We have been working on version 2 of Reifier since some time. We are now at a stage where the core functionality is beginning to shape up, and we have a solid foundation to deliver the next generation fuzzy matching, reference and master data management system using AI. When we started Reifier, we designed it as a big data file based application working on data residing in local, S3 or HDFS files. Clean formatted delimited text files. Same structure of all the records one would like to match. At that stage, it felt right to keep this simplification, we wanted to make sure that the most critical pieces of the puzzle were highly accurate and performant. The big data preprocessing and matching was sure to be the focal point of our system, and we wanted to make sure that we focused relentlessly to keep it at least 20x accurate, performant and scalable than traditional rule based or threshold based systems.
We also wanted to make sure there was a market for our work and we reasoned that once we had sufficient traction, we could pull in additional data sources later. After multiple deployments, we felt restrained by our first version. There is the big vision and roadmap for Reifier, the data management application for the future, using AI on large data sets to uniquely link records and build golden copies. In the near future, Reifier will be able to understand entities, prescribe and suggest business actions, drive enterprise revenue and pro-actively manage risk. Our customers have been asking us for more data sources, more formats, more data massaging, peek inside the models, online training, better data stewardship and we have been aching to deliver them. Which is good – it solidifies their relationship with Reifier and their need to integrate it more and more into their work.
Reifier v2 is our answer to many of these dreams. We have migrated from the RDD based API to the Dataset API and this has led to a much smaller code base, even as we have the ability to support diverse formats like JSON, XML, Parquet, Avro, CSV etc as well as data sources like Salesforce, Redshift, Cassandra, HBase etc. Earlier, we had our own internal data structures to represent records, with data marshalling and unmarshalling based on field types. Now we just use the Row class. We have also moved from our home grown feature engineering, evaluation and transform pipeline to the ML based Pipeline in Spark, just plugging in our custom models, transformers and evaluators. All this has dramatically reduced our code base while giving us the ability to focus on our core algorithms and let Spark take care of the rest.
We are excited by the recent changes, and though we have spent a good part of last and this year in moving the code, our slimmer leaner code base will allow us to move rapidly in the direction we want to take Reifier to. Watch this space as we unfold more details about Reifier v2 – the absolutely ground breaking master data management engine for your data!
Posted on August 7th, 2018