UBM Asia collects contact information of visitors to its leading trade fairs. Due to the international nature of the visitorship, the data is multilingual with a mix of English, Chinese, Thai, Turkish, Korean, Japanese. Typical record volumes are about a million entries. At the end of the fair, the data is collated and fed into a CRM (Client Relationship Management) system for future correspondence, offerings and promotions.
The contact information is riddled with poor quality data – missing fields, typographical and lexical differences as well as field swapping within multiple entries of the same person. Many times, visitors provide common company sales or marketing email, phone numbers and addresses instead of their own personal email ids, phones & addresses. Other times the same visitor may provide different emails or phone numbers, or official address in one case and personal address in another. There are also misspellings, partial names with missing first, middle or last names, leading and trailing spaces and other typographical variations across all the fields. As a consequence of having these duplicates, UBM Asia was
- Missing cost-saving opportunities
- Sub-optimal customer experience as same customer being approached multiple times for the same offer
The sheer size of the data as well as the nuanced differences make manual deduplication impossible. As exact matches are rare, database joins and filtering are ruled out too.
- Handle different variations in fields across records
- Directly managed by business user
- Work without data massaging, normalization and preprocessing
- Support different geographies – even when the names are in English, there are regional differences when the event is held in India vs one in Singapore
- 5x speedup on deployment time and 2x on accuracy