The problem of entity resolution or data matching is of finding and linking different mentions of the same entity in a single data source or across multiple data sources. Entity resolution or Fuzzy Data Matching or Fuzzy Record Matching is referenced by various names – entity matching, record matching, record linkage, dedupe, deduplication, merge purge, reference matching etc. The entity to be resolved can be any type – person, organization, address, product etc. Each entity also has its attributes – email id, url, phone number, house number, brand, model, capacity etc. Also, each attribute can have multiple differences in the way it is captured in two different records of the same entity.
Entity Resolution – an example
Take for example a customer record.
Even within just two records, we can see that the first name is spelled differently and has a salutation, middle name is missing, telephone number is different, address 1 has variation and address 2 is omitted in the first instance.
Humanly, it is possible for us to read these entries and infer that they probably reference the same individual. But for a computer which understands equality or lack of it, how can we reconcile these two records to one single golden copy? If we cant do that, our master data gets corrupt. We are unable to discover relationships and patterns and can not make effective decisions. As our data grows, the problem becomes more rampant and tougher to solve.
Entity resolution by rules
To build an entity resolution system, we could follow a traditional rule based approach. Essentially a rule based system is a big if-then of multiple conditions. Some sample rules can be
- match first 2 characters of first name
- match first 3 characters of last name
- check if country2 is substring of country 1
We could even get more advanced in our rule based entity resolution system and and apply an approximate string matching algorithm like Jaro or Levenshtein distance.
Its easy to see that discovering and maintaining these rules will be a big challenge. There are bound to be rules we miss. Ordering of rules and updating them with variations in new records is tricky. To get more precise, we would typically use a combination of some kind of edit distance or vector distance on the characters of each field. Then apply a weight to the score of each field and compute the overall score. Then pick up pairs which have a score above a particular threshold. Needless to say, adjusting the right algorithm and the right weight will remain a challenge. Also, with the variety of data we see in terms of vendors, customers, products, organizations etc, it is a herculean task to address all the data matching for different entities.
Entity Resolution is compute heavy due to number of comparisons
There is another challenge with entity resolution. Even after defining the rules for similarity and data matching, we still have to deal with the scale of the problem. As there are no unique identifiers or equal keys to compare, for every n records we have, every rule that we create will run on n*(n-1)/2 number of unique possible pairs. With a mere 100,000 records, we have a potential 4,999,950,000 potential pairs to compute similarity on. Even with a very fast way to compute similarity within a pair, say 1 millisecond per 100 pairs, we need a whopping 13.8875 hours.
Entity Resolution – Challenges
Entity Resolution is challenging because
- It is tough to define matching rules for an attribute
- Combining matching rules for different attributes of a record in again challenging
- It is time consuming to define rules for each entity type
- Multiple languages like Chinese, Japanese, Thai, German, French have their own notions of text similarity
- It is tough to control the run time performance of the matching. Even with a few thousand records, the number of comparisons is large.
Reifier and Entity Resolution
We have chosen to build Reifier using AI and Spark so that we can provide big data matching and entity resolution with ease.
Learn how we are solving this problem of entity resolution using Spark with our AI Engine for Data Matching and check matching samples.
Contact us for an evaluation.