Festive season is round the corner and we wish you all a very happy, clean and joyful Diwali. Now that we mentioned clean, cant help but thinking clean data, but that’s for another day!
Cheers!Posted on November 4th, 2018
We have been working on version 2 of Reifier since some time. We are now at a stage where the core functionality is beginning to shape up, and we have a solid foundation to deliver the next generation fuzzy matching, reference and master data management system using AI. When we started Reifier, we designed it as a big data file based application working on data residing in local, S3 or HDFS files. Clean formatted delimited text files. Same structure of all the records one would like to match. At that stage, it felt right to keep this simplification, we wanted to make sure that the most critical pieces of the puzzle were highly accurate and performant. The big data preprocessing and matching was sure to be the focal point of our system, and we wanted to make sure that we focused relentlessly to keep it at least 20x accurate, performant and scalable than traditional rule based or threshold based systems.
We also wanted to make sure there was a market for our work and we reasoned that once we had sufficient traction, we could pull in additional data sources later. After multiple deployments, we felt restrained by our first version. There is the big vision and roadmap for Reifier, the data management application for the future, using AI on large data sets to uniquely link records and build golden copies. In the near future, Reifier will be able to understand entities, prescribe and suggest business actions, drive enterprise revenue and pro-actively manage risk. Our customers have been asking us for more data sources, more formats, more data massaging, peek inside the models, online training, better data stewardship and we have been aching to deliver them. Which is good – it solidifies their relationship with Reifier and their need to integrate it more and more into their work.
Reifier v2 is our answer to many of these dreams. We have migrated from the RDD based API to the Dataset API and this has led to a much smaller code base, even as we have the ability to support diverse formats like JSON, XML, Parquet, Avro, CSV etc as well as data sources like Salesforce, Redshift, Cassandra, HBase etc. Earlier, we had our own internal data structures to represent records, with data marshalling and unmarshalling based on field types. Now we just use the Row class. We have also moved from our home grown feature engineering, evaluation and transform pipeline to the ML based Pipeline in Spark, just plugging in our custom models, transformers and evaluators. All this has dramatically reduced our code base while giving us the ability to focus on our core algorithms and let Spark take care of the rest.
We are excited by the recent changes, and though we have spent a good part of last and this year in moving the code, our slimmer leaner code base will allow us to move rapidly in the direction we want to take Reifier to. Watch this space as we unfold more details about Reifier v2 – the absolutely ground breaking master data management engine for your data!Posted on August 7th, 2018
It seems even wild animals can benefit from fuzzy matching! In a new proposal, the Thane wildlife Warden has requested the linkage of forest and wildlife criminal data with police records to ensure that character certificates appropriately reflect the crime history of an individual. Here is the news articlePosted on May 9th, 2018
After the festivities of the new year, the team has been working on deploying Reifier on the Databricks cloud. Databricks provides a unified analytics platform with a fully managed Spark service. Some of our newest customers run massive analytic and predictive workloads on the Databricks Cloud. They want the ability to run Reifier from within their data processing pipeline – building 360 views, removing duplicates, understanding relationships and consolidating accounts. Reifier was architected from the grounds up to be Spark application, and we had spent our effort to be Apache Spark Certified. As Databricks platform provides an optimized Spark environment, minimal changes around the creation of SparkContext were needed at our end to make the transition.
We are happy to announce that we have tested Reifier on the Databricks Platform. Reifier has always supported multiple deployment models, enabling enterprises to master their customer, vendor, location, employee and other data seamlessly on an infrastructure of their choice – AWS/Azure/Google Cloud/Data Centre. With Reifier on Databricks, we provide our customers yet another option and flexibility.Posted on February 5th, 2018
Wishing you all a very happy, healthy and prosperous new year! Hope the new year brings in new joy and successes. May unified data lead you to greater customer acquisition, enhanced compliance and operational efficiency and smoother mergers and acquisitions!Posted on December 28th, 2017
Effective selling means knowing your customer, understanding their pain points, learning their journey and helping them achieve their goals. (more…)Posted on May 1st, 2017
As a company, we are committed to building tools and techniques that enable our customers make informed choices by utilizing and maximizing the value of their data assets. Business is tough, conditions are hard, but we want to make sure that we make a difference to your bottom line. Here is the first tip of the series, stay tuned for simple techniques, analysis and tips for better business..Posted on April 30th, 2017
Kudos! Reifier got covered in Analytics India Magazine – the number one platform analytics, data science and big data, dedicated to passionately championing and promoting the ecosystem in India.
Read the full story here
Posted on April 30th, 2017
..Or rather – the golden record, the one single version of the entity record which combines data from different sources and the most trusted piece of information about the entity.
When we started working with Reifier, we were very heavily focussed on the matching, and we worked tirelessly to make it the absolutely easiest, fastest and accurate matching engine on the planet. We got some rave reviews, and also tuned the technology on diverse customer data sets – uncovering some beautiful record linkages and matches over languages, multi domain entities and locale specific data.
Our earlier users had simple merge functions – last updated record or source system specific. However, over the last few months, detailed discussions have led us to uncover some deep challenges in defining and maintaining merge rules. We think its time for some disruption – watch out this space as we fill in more details on the ideal merge.Posted on April 14th, 2017
I recently subscribed to an email newsletter from one of most popular investment magazines in India. Somehow I subscribed to the same newsletter at two different places on the website, giving the same email address at both places.
The result of this dual request is that I get two copies of the same email every time they send out a post. Thanks to Gmail, which consolidates the emails in the same thread, I have to read only one of them and can delete/archive both
of the mails with the same button click.
I know it is good for the magazine to have a large subscription count and sending out an additional email doesn’t cost much, but working with duplicate customers in their subscribers list gives a false count of subscribers leading to bad marketing and sales decisions. Also, from the customer’s point of view, receiving multiple copies is an irritant and gives a bad impression to the magazine.
In this case, minor focus from the management by deduplicating customer records can lead to quick rewards in knowing the customer and also better reputation in the eyes of the customer.Posted on April 14th, 2017