Deduplication involves same entity being represented in slightly different ways. There could be minor typographical errors in some of the fields, middle initials or names present or absent, or there could be missing fields. All this differences leads to minor differences, yet records need to be identified as duplicates when they really are referring to the same real world entity.
Large lists are often a result of merging multiple files, each sourced from a different database. Some fields may be present in one file and missing in another. Typical customer record will have a name and address and sometimes an email address or phone number. Names could be further separated in first and last name, initial, title etc, all of which could be either in a field of its own or concatenated and put together in the same field. Addresses are generally not standardized. A trained human can handle all these differences as long as the data is small, but as record sizes increase, it is not possible to do in manually. Due to subtle differences and lack of a common identifier, coding rules is difficult.
With Reifier’s Spark based fuzzy matching engine, there is no need to define complex rules. Reifier trains to learn the match rules just like a human with a simple yes and no about matches and
non-matches. Once trained, Reifier easily removes duplicates to create a clean and comprehensive list.