The problem of entity resolution or data matching is the problem of finding and linking different mentions of the same entity in a single data source or across multiple data sources. Fuzzy Data Matching or Fuzzy Record Matching is referenced by various names – entity matching, record matching, record linkage, dedupe, deduplication, entity resolution, merge purge, reference matching etc. The entity can be any type – person, organization, address, product etc. Each entity also has its attributes – email id, url, phone number, house number, brand, model, capacity etc, and each attribute can have differences in the way it is captured in two different records of the same entity.
Take for example a customer record.
Even within just two records, we can see that the first name is spelled differently and has a salutation, middle name is missing, telephone number is different, address 1 has variation and address 2 is omitted in the first instance.
Humanly, it is possible for us to read these entries and infer that they probably reference the same individual. But for a computer which understands equality or lack of it, how can we reconcile these two records to one single golden copy? If we cant do that, our master data gets corrupt. We are unable to discover relationships and patterns and can not make effective decisions. As our data grows, the problem becomes more rampant and tougher to solve.
We could follow a traditional rule based approach. Essentially a rule based system is a big if-then of multiple conditions like match first 2 characters of first name followed by first 3 characters of last name and if country2 is substring of country 1….Its easy to follow that discovering and maintaining these rules will be a big challenge. We could get more technical and apply an approximate string matching algorithm like Jaro or Levenshtein distance. To get more precise, we would typically use a combination of some kind of edit distance or vector distance on the characters of each field. Then apply a weight to the score of each field and compute the overall score. Then pick up pairs which have a score above a particular threshold. Needless to say, adjusting the right algorithm and the right weight will remain a challenge. Also, with the variety of data we see in terms of vendors, customers, products, organizations etc, it is a herculean task to address all the data matching for different entities.
There is another challenge. Even after defining the rules for similarity, we still have to deal with the scale of the problem. As there are no unique identifiers or equal keys to compare, for every n records we have, every rule that we create will run on n*(n-1)/2 number of unique possible pairs. With a mere 100,000 records, we have a potential 4,999,950,000 potential pairs to compute similarity on. Even with a very fast way to compute similarity within a pair, say 1 millisecond per 100 pairs, we need a whopping 13.8875 hours.
Entity Resolution is challenging because
- It is tough to define matching rules for an attribute
- It is tough to combine matching rules for different attributes of a record
- It is time consuming to define rules for each entity type
- Multiple languages like Chinese, Japanese, Thai, German, French have their own notions of text similarity
- It is tough to control the run time performance of the matching. Even with a few thousand records, the number of comparisons is large.
We have chosen to build Reifier using Spark so that we can provide big data matching and entity resolution with ease.
Learn how we are solving this problem of entity resolution using Spark with our AI Engine for Data Matching and check matching samples.
Contact us for an evaluation.
Posted on July 25th, 2016