Simple record deduplication using Hadoop and Python

In this post, we explain a simple Python based Map Reduce program to deduplicate user records. The data files are in text format, many users are repeated and we want to have a single copy of each record. Hadoop’s streaming can be used to get Python programs to talk to the Hadoop framework. We create [...]

read more