There is an interesting article at Techcrunch on How startups can compete with enterprises in artificial intelligence and machine learning. The article talks about the importance of data for AI and ML and correctly points out how it is difficult for startups to compete in the space as big companies already have proprietary data sets spanning multiple industries. This puts startups at a disadvantage, as pure algorithms are becoming more and more open source, so its the company with the best data that wins.
When we think about generic systems like Siri and Cortana, they are built for mass deployment and work through lots of data, learning as different users interact with them. The more users that use them, the more they learn about different intents and responses. Many of these systems have generic models across all users, and then user specific personal models, juxtaposed to give the right information or action at the right time. Clearly, more and better data make for better learning and that translates to making the system more and more usable. The more usable the system becomes, more and more people use it more often and this feedback loop makes it even smarter. The trick here is to have an initial huge base data and augment it further through user interaction. It is definitely difficult for a startup to start with this base data.
The second type of systems are narrow domain specific AI – like chatbots or natural language interfaces for retail banking. Gathering your own and your teams’ statements for the last couple of years can be a good place to start here. If you can get a single bank interested and ready to share their call centre logs and scripts, a good deal of learning can be done. Startups, ever pressed for resources, are always very creative to maximize with even the smallest bits they can get.
In the case of Reifier, when we were building our entity resolution engine, we started with a single synthetic data set of about 50 records(!!) containing first name, last name, street address, city, state and date of birth. This data set had a good number of duplicate records, which we wanted to match and link together. We built and tested the core algorithms on this data. Working off a small narrow dataset, we were able to exactly pin point how each record was getting indexed, the number of buckets we were making, how many comparisons we were making and the final identity matching scores we were getting.
We then built bigger data sets, and tested how well we were generalizing to them. Although we were working off a big data stack using Apache Spark, it helped to stay on a single machine. We could easily isolate if the issues were due to our own code. Then we could run on a bigger cluster and see how well the distribution was happening. Along the way, a single page website about what we were doing got us talking to multiple early adopters struggling with the problem and willing to share their much bigger and diverse data sets, so we could tune the algorithms and glean early feedback. In fact, Reifier Interactive Learner was born out of the multiple POC requests we were handling, where there was no labelled data to learn from. We realized that if we had to test the system on different data types and fields and demonstrate value, we had to have a quick way to build labelled training data.
As we are bootstrapped and pretty frugal, we tried to maximize the throughput of our own personal machines to handle larger number of records, resorting to cloud resources only when absolutely necessary. That led us to optimize the performance of the clustering algorithms, reducing the number of comparisons to about 0.05% of the actual possible number of pairs.
As a startup, the going is never easy. But that is still what leads us founders to the startup world – the magic of building something unique and valuable despite the odds. We are the imaginative bunch all set to defy the odds – and a few of us surely will.
Signing off with one of my favorite quotes from Alice In Wonderland.
Alice – “This is impossible”
The Mad Hatter – “Only if you believe it is”
Let us continue to dream of at least 6 impossible things before breakfast 🙂Posted on August 3rd, 2016