Deduplication involves same entity being represented in slightly different ways. There could be minor typographical errors in some of the fields, middle initials or names present or absent, suffixes and prefixes or there could be missing fields altogether. With all these differences between fields, records need to be identified as duplicates when they really are referring to the same real world entity.
Large lists are often a result of merging multiple files, each sourced from a different database. Some fields may be present in one file and missing in another. Typical customer record will have a name and address and sometimes an email address or phone number. Names could be further separated in first and last name, initial, title etc, all of which could be either in a field of its own or concatenated and put together in the same field. Addresses are generally not standardized. A trained human can handle all these differences as long as the data is small, but as record sizes increase, it is not possible to do it manually. Due to subtle differences between attributes and lack of a common identifier, coding rules is difficult.
With Reifier’s Spark and AI based fuzzy matching engine, there is no need to define complex rules. Reifier trains to learn the match rules just like a human with a simple yes and no about matches and non-matches. Once trained, Reifier easily removes duplicates to create a clean and comprehensive list.
Posted on May 7th, 2017