We got Reifier application certified on Apache Spark, which means that Reifier can work with different distributions of Apache Spark with ease.
Also happy to get covered at the Databricks blogPosted on December 20th, 2016
Understanding entities and their relationships is the core of any security, surveillance and intelligence operations. By using Reifier, intelligence agencies can discover the people entity graph, and decipher hidden relationships and patterns. Understanding how people are connected is the most essential analysis over which other layers of interaction can be juxtaposed to complete the picture.
As part of the customer journey, we also provide thought leadership on applying AI for intelligence and security scenarios. Here is a non confidential presentation highlighting some thoughts and use cases of AI for intelligence and security.
It was a pleasure to again be a part of the Convention by the Department of Financial Studies New Delhi and thoroughly enjoyed the discussion with the students on big data, AI as well as use cases in the financial industry. Here are some videos from the convention.
-SonalPosted on November 11th, 2016
In a recent article at Harvard Business Review, the cost of dirty data in the US alone is a staggering 3 trillion. Needless to say, the worldwide figures will be much much higher.
Dirty duplicate data hampers productivity, reduces decision making and hurts operations. While looking at these costs, we must also look at why the problem persists. There can be multiple factors – lack of a clear data policy, difficulty in coordination of different departments that produce and consume data,
Easy to use tools in the hands of business analysts can really help in making a big difference to the data quality, reduce reliance on IT and improve efficiency.Posted on September 26th, 2016
The 42nd Annual Conference on Very Large Data Bases has a number of interesting workshops and sessions on data quality. The 11th International Workshop on Quality in DataBases with a special focus on Big Data Integration and Quality is also part of the event. The data cleansing research track covers data repair using regular expressions , timestamp cleaning with temporal constraints and other interesting sessions. There is progressive and iterative data cleaning for statistical modelling which looks very interesting too.Posted on September 8th, 2016
Happy to join ML-India, the buzzing machine learning community in data science. Hoping to connect with academic institutions, corporates and practitioners doing cutting edge data science in India.Posted on September 8th, 2016
A very timely piece on customer data utilized for effective customer journeys. The company used customer data in different formats and channels and created a holistic customer journey to understand customer lifecycle, arrive at personalized messages and select appropriate channels. Needless to say, entity resolution and record matching are a critical piece of this approach. Key decision makers also need to be fully bought in and coordination of different departments is a must for data sharing so that effective marketing can be achieved.Posted on August 24th, 2016
There is an interesting article at Techcrunch on How startups can compete with enterprises in artificial intelligence and machine learning. The article talks about the importance of data for AI and ML and correctly points out how it is difficult for startups to compete in the space as big companies already have proprietary data sets spanning multiple industries. This puts startups at a disadvantage, as pure algorithms are becoming more and more open source, so its the company with the best data that wins.
When we think about generic systems like Siri and Cortana, they are built for mass deployment and work through lots of data, learning as different users interact with them. The more users that use them, the more they learn about different intents and responses. Many of these systems have generic models across all users, and then user specific personal models, juxtaposed to give the right information or action at the right time. Clearly, more and better data make for better learning and that translates to making the system more and more usable. The more usable the system becomes, more and more people use it more often and this feedback loop makes it even smarter. The trick here is to have an initial huge base data and augment it further through user interaction. It is definitely difficult for a startup to start with this base data.
The second type of systems are narrow domain specific AI – like chatbots or natural language interfaces for retail banking. Gathering your own and your teams’ statements for the last couple of years can be a good place to start here. If you can get a single bank interested and ready to share their call centre logs and scripts, a good deal of learning can be done. Startups, ever pressed for resources, are always very creative to maximize with even the smallest bits they can get.
In the case of Reifier, when we were building our entity resolution engine, we started with a single synthetic data set of about 50 records(!!) containing first name, last name, street address, city, state and date of birth. This data set had a good number of duplicate records, which we wanted to match and link together. We built and tested the core algorithms on this data. Working off a small narrow dataset, we were able to exactly pin point how each record was getting indexed, the number of buckets we were making, how many comparisons we were making and the final identity matching scores we were getting.
We then built bigger data sets, and tested how well we were generalizing to them. Although we were working off a big data stack using Apache Spark, it helped to stay on a single machine. We could easily isolate if the issues were due to our own code. Then we could run on a bigger cluster and see how well the distribution was happening. Along the way, a single page website about what we were doing got us talking to multiple early adopters struggling with the problem and willing to share their much bigger and diverse data sets, so we could tune the algorithms and glean early feedback. In fact, Reifier Interactive Learner was born out of the multiple POC requests we were handling, where there was no labelled data to learn from. We realized that if we had to test the system on different data types and fields and demonstrate value, we had to have a quick way to build labelled training data.
As we are bootstrapped and pretty frugal, we tried to maximize the throughput of our own personal machines to handle larger number of records, resorting to cloud resources only when absolutely necessary. That led us to optimize the performance of the clustering algorithms, reducing the number of comparisons to about 0.05% of the actual possible number of pairs.
As a startup, the going is never easy. But that is still what leads us founders to the startup world – the magic of building something unique and valuable despite the odds. We are the imaginative bunch all set to defy the odds – and a few of us surely will.
Signing off with one of my favorite quotes from Alice In Wonderland.
Alice – “This is impossible”
The Mad Hatter – “Only if you believe it is”
Let us continue to dream of at least 6 impossible things before breakfast 🙂Posted on August 3rd, 2016
Originally published at: Medium.com
So my cofounder and I finally managed to watch The Martian. Wrapped as we are with our data matching engine Reifier, the movie still completely engrossed us. The story line, the direction, the performance as well as the effects were stunning. However, rushing back to work, we could not help but muse about some key takeaways for the startup world. (more…)Posted on July 24th, 2016