Accessing HDFS over FTP
- Apr
- 19
The Hadoop Distributed File System provides different interfaces so that clients can interact with it. Besides the HDFS shell, the file system exposes itself through WebDAV, Thrift, FTP and FUSE. In this post, we access HDFS over FTP. We have used Hadoop 0.20.2. 1. Download the hdfs-over-ftp tar from https://issues.apache.org/jira/secure/attachment/12409518/hdfs-over-ftp-0.20.0.tar.gz 2. Untar hdfs-over-ftp-0.20.0.tar.gz. 3. We [...]
read moreSimple record deduplication using Hadoop and Python
- Mar
- 26
In this post, we explain a simple Python based Map Reduce program to deduplicate user records. The data files are in text format, many users are repeated and we want to have a single copy of each record. Hadoop’s streaming can be used to get Python programs to talk to the Hadoop framework. We create [...]
read moreData loading from Hadoop to MySQL using HIHO
- Mar
- 09
HIHO comes with a MySQL Connector, which can import and export data from MySQL database to Hadoop. In this post, we demonstrate how HIHO’s MySQL data loading can be achieved. MySQL Connector J for Java exposes a high performance data import facility for MySQL. The MySQL JDBC driver allows one to use LOAD DATA statement [...]
read moreTesting Hadoop Map Reduce jobs
- Feb
- 25
Complex applications are always a challenge to test, and the more the complexity, the greater the need for unit and integration testing. With Hadoop, the distributed file system and parallel job execution across a cluster of machines makes testing extremely tough. For HIHO, we use Junit 4 and Mockito to unit test our code. We [...]
read more