Installing Hadoop in Pseudo Distributed Mode

Introduction

The aim of this article is to provide the reader a walkthrough on how to set up a Hadoop 2.7.1 in Pseudo Distributed Mode (= Single Node Cluster) on Ubuntu 14.04. The article does not provide Hadoop basics (e.g. what is a NameNode; what is a DataNode, what is HDFS, etc.). Ok, let’s go.

Installation walkthrough

1) First of all – if it’s not already installed – install JDK 7:

root@roadrunner:/home/edi# apt-get install openjdk-7-jre-headless

2) Create a user called hadoop:

root@roadrunner:/home/edi# adduser hadoop
Adding user `hadoop' ...
Adding new group `hadoop' (1001) ...
Adding new user `hadoop' (1001) with group `hadoop' ...
Creating home directory `/home/hadoop' ...
….
root@roadrunner:/home/edi#

3) Add user hadoop to sudo group:

root@roadrunner:/home/edi# adduser hadoop sudo
Adding user `hadoop' to group `sudo' ...
Adding user hadoop to group sudo
Done.
root@roadrunner:/home/edi#

4) Switch to user hadoop:

root@roadrunner:/home/edi# su hadoop
hadoop@roadrunner:/home/edi$ cd ~
hadoop@roadrunner:~$ 

5) Download and unpack hadoop framework:

hadoop@roadrunner:~$ wget http://www.us.apache.org/dist/hadoop/common/stable2/hadoop-2.7.1.tar.gz
hadoop@roadrunner:~$ tar -zxf hadoop-2.7.1.tar.gz

6) Move unpacked framework to /user/local/:

hadoop@roadrunner:~$ sudo mv hadoop-2.7.1 /usr/local/
[sudo] password for hadoop:
hadoop@roadrunner:~$

7) Find out your Java home location as you will need it in next step.

8) Edit bashrc of user hadoop:

hadoop@roadrunner:~$ vi .bashrc

While in editor add this at the end of bashrc-configuration (please note that your JAVA_HOME location may be different from mine):

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386/jre
export HADOOP_HOME=/usr/local/hadoop-2.7.1
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME

8) Open a new terminal, switch to hadoop user and check hadoop is working by executing the version command of hadoop:

edi@roadrunner:~$ su hadoop
Password:
hadoop@roadrunner:/home/edi$ hadoop version
Hadoop 2.7.1
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a
Compiled by jenkins on 2015-06-29T06:04Z
Compiled with protoc 2.5.0
From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a
This command was run using /usr/local/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar

Everything fine so far, ok let’s continue.

9) Set Java home in hadoop-env.sh:

hadoop@roadrunner:~$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh

While in editor search JAVA_HOME and edit accordingly (please note that your JAVA_HOME location may be different from mine):

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386/jre

10)  Edit core-site.xml. Hadoop will be configured to run localy on port 9000 using hdfs as default filesystem.

hadoop@roadrunner:~$ vi $HADOOP_HOME/etc/hadoop/core-site.xml

Change the configuration element to this:

<configuration>
<property>
<name>fs.default.name </name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

11) Edit hdfs-site.xml. Hadoop will be configured to use different storage location for NameNode and DataNode. As we have one DataNode we set the data replication factor to 1.

hadoop@roadrunner:~$ vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Change the configuration element  to this:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>

<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
</property>
</configuration>

12) Edit yarn-site.xml . Configure YARN, a  framework for job scheduling and cluster resource management.

hadoop@roadrunner:~$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

Change the configuration element to this:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

13) Copy mapred.site.xml.template and edit. Configure Hadoop to use YARN as application for MapReduce.

hadoop@roadrunner:~$ cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
hadoop@roadrunner:~$ vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

Change the configuration element to this:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

14) Configuration done, format the filesystem:

hadoop@roadrunner:~$ hadoop namenode -format

You will see a bunch of info messages on the terminal when executing above command. Afterwards – if everthing is ok – a new directory „hadoopinfra“ has been created in hadoop home as specified in hdfs-site.xml:

hadoop@roadrunner:~$ ls
hadoop-2.7.1.tar.gz  hadoopinfra

15) Install OpenSSH. SSH is needed by hadoop to execute operations like start and stop the dfs namenode and datanode deamons in the cluster.

hadoop@roadrunner:~$ sudo apt-get install openssh-server

16) We need a key value pair used by hadoop for SSH communication (important: without a passphrase):

hadoop@roadrunner:~$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
....
hadoop@roadrunner:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
hadoop@roadrunner:~$ chmod 0600 ~/.ssh/authorized_keys

17) Start NameNode and DataNode daemon:

hadoop@roadrunner:~$ start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop-2.7.1/logs/hadoop-hadoop-namenode-roadrunner.out
localhost: starting datanode, logging to /usr/local/hadoop-2.7.1/logs/hadoop-hadoop-datanode-roadrunner.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.7.1/logs/hadoop-hadoop-secondarynamenode-roadrunner.out
hadoop@roadrunner:~$

18) Start YARN:

hadoop@roadrunner:~$ start-yarn.sh

19) Check the namenode web application  is working: http://localhost:50070

hadoop_namenode_web_app

20) Check YARN resourcemanager web application  is working: http://localhost:8088/

hadoop_yarn_ressource_manager_web_app

Summary

If you followed my walkthrough with success you now should have a Hadoop installation in pseudo-distributed mode running on your (local) machine. As the installation is just the beginning I now wish you a nice time exploring the Hadoop framework 😉