Using Hadoop’s DistributedCache
While working with Map Reduce applications, there are times when we need to share files globally with all nodes on the cluster. This can be a shared library to be accessed by each task, a global lookup file holding key value pairs, jars or archives containing executable code. Hadoop’s Map Reduce project provides this functionality through a distributed cache. This distributed cache can be configured with the job, and provides read only data to the application across all machines. The cache copies over the file(s) to the local machines. If archives are distributed, they are unarchived and execute permissions set.
Distributing files is pretty straight forward. To cache a file addToCache.txt on HDFS, one can setup the job as
Job job = new Job(conf);
Other URI schemes can also be specified.
Now, in the Mapper/Reducer, one can access the file as:
Path cacheFiles = context.getLocalCacheFiles();
FileInputStream fileStream = new FileInputStream(cacheFiles.toString());