Nov 12, 2013

Building Hadoop 2.2.0

I am learning the new YARN and MapReduce brought by the stable version Hadoop 2.2.0, and thought the best way to find out how it works is by looking at the sources.

Prerequisites (copied from hadoop-common repository)

* Unix System
* JDK 1.6+
* Maven 3.0 or later
* Findbugs 1.3.9 (if running findbugs)
* ProtocolBuffer 2.5.0
* CMake 2.6 or newer (if compiling native code)
* Internet connection for first build (to fetch all Maven and Hadoop dependencies)


Linux: I am using a rather old 32bit Debian 6.0.6. 
debian@debian:~$ uname -a
Linux debian 2.6.32-5-686 #1 SMP Sun Sep 23 09:49:36 UTC 2012 i686 GNU/Linux

Java:  I have the newest (at the time this article is written) Java 1.7 installed
debian@debian:~$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) Server VM (build 24.45-b08, mixed mode)

Build and install the protocolbuffer-compiler 2.5.0

The newest and the one required by Hadoop is version 2.5.0. This is only available in debian experimental repository (at this time), and I could not get it installed via apt-get. If your Linux distribution provides 2.5.0 from software repository, use that one.

First you are going to need g++ installed. My virtual machine was really pure in terms of installed software, so I had to install g++ first: 

  $ aptitude install g++

  $ tar -xvzf protobuf-2.5.0.tar.gz
  $ cd protobuf-2.5.0
  $ ./configure --disable-shared #[1]
  $ make install

The above commands compiled, built and hopefully installed protoc into /usr/local/bin/protoc .

Install Maven 3.0+

Choose a 3.0+ version from link below. I used 3.1.1, the newest one available at the time this article written.

You need a binary tar.gz: 

Put Maven to its place:
  $ tar -xvzf apache-maven-3.1.1-bin.tar.gz
  $ mkdir -p /usr/local/maven/
  $ mv apache-maven-3.1.1 /usr/local/maven
  $ ln -s /usr/local/maven/apache-maven-3.1.1 /usr/local/maven/current

Put a symlink into /usr/sbin
  $ ln -s /usr/local/maven/current/bin/mvn /usr/sbin/mvn

In fact, this is the same way, how you install the Oracle JDK/JRE. The other way is to put the .../bin folder of the appliction on the $PATH variable at the end of /etc/bash.bashrc.

Install Git

This is available from repository:
  $ aptitude install git

Clone hadoop-common

Go to your Eclipse workspace, or create one if you don't have any. I put it into my home:
  $ mkdir -p ~/Development/workspace_eclipse_java

Clone the git repository:
  $ git clone hadoop-common

Install hadoop Maven plugin

Hadoop has it's own maven plugin to do stuff:
  $ cd hadoop-maven-plugins
  $ mvn install

First build everything

I found the project setup and build well documented. Everything is written down in the BUILDING.txt [2] 

First you need to build the whole hadoop-common to allow Maven caching the dependency jars in your local repository. That way, eclipse will be able to resolve all your inter-project dependencies.

  $ cd hadoop-common
  $ mvn install -DskipTests -nsu #-nsu means something cache forever

Generate Eclipse projects

I am only interested in YARN and MapReduce components, so I will:
  $ cd hadoop-yarn-project
  $ mvn eclipse:eclipse -DskipTests

Set M2_REPO variable in Eclipse

If not yet set, you have to create a variable in eclipse pointing to your local Maven repository, as every dependencies in the generated .classpath file start with M2_REPO/..

  [Window] => [Preferences]
  Java -> Build Path -> Classpath Variables

Add a new one named M2_REPO pointing to your Maven local repository, which by default is at   /home/username/.m2/repository

Import projects into Eclipse

  [File] => [Import]
  General -> Existing Projects into workspace
Set your root directory to the hadoop component you want to import. In my case it's 

I highly recommend creating working set to every hadoop component, since they all consists of several eclipse projects.