Introduction
The aim of this article is to provide the reader a walkthrough on how to set up a Hadoop 2.7.1 in Pseudo Distributed Mode (= Single Node Cluster) on Ubuntu 14.04. The article does not provide Hadoop basics (e.g. what is a NameNode; what is a DataNode, what is HDFS, etc.). Ok, let’s go.
Installation walkthrough
1) First of all – if it’s not already installed – install JDK 7:
root@roadrunner:/home/edi# apt-get install openjdk-7-jre-headless
2) Create a user called hadoop:
root@roadrunner:/home/edi# adduser hadoop Adding user `hadoop' ... Adding new group `hadoop' (1001) ... Adding new user `hadoop' (1001) with group `hadoop' ... Creating home directory `/home/hadoop' ... …. root@roadrunner:/home/edi#
3) Add user hadoop to sudo group:
root@roadrunner:/home/edi# adduser hadoop sudo Adding user `hadoop' to group `sudo' ... Adding user hadoop to group sudo Done. root@roadrunner:/home/edi#
4) Switch to user hadoop:
root@roadrunner:/home/edi# su hadoop hadoop@roadrunner:/home/edi$ cd ~ hadoop@roadrunner:~$
5) Download and unpack hadoop framework:
hadoop@roadrunner:~$ wget http://www.us.apache.org/dist/hadoop/common/stable2/hadoop-2.7.1.tar.gz hadoop@roadrunner:~$ tar -zxf hadoop-2.7.1.tar.gz
6) Move unpacked framework to /user/local/:
hadoop@roadrunner:~$ sudo mv hadoop-2.7.1 /usr/local/ [sudo] password for hadoop: hadoop@roadrunner:~$
7) Find out your Java home location as you will need it in next step.
8) Edit bashrc of user hadoop:
hadoop@roadrunner:~$ vi .bashrc
While in editor add this at the end of bashrc-configuration (please note that your JAVA_HOME location may be different from mine):
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386/jre export HADOOP_HOME=/usr/local/hadoop-2.7.1 export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_INSTALL=$HADOOP_HOME
8) Open a new terminal, switch to hadoop user and check hadoop is working by executing the version command of hadoop:
edi@roadrunner:~$ su hadoop Password: hadoop@roadrunner:/home/edi$ hadoop version Hadoop 2.7.1 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a Compiled by jenkins on 2015-06-29T06:04Z Compiled with protoc 2.5.0 From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a This command was run using /usr/local/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar
Everything fine so far, ok let’s continue.
9) Set Java home in hadoop-env.sh:
hadoop@roadrunner:~$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
While in editor search JAVA_HOME and edit accordingly (please note that your JAVA_HOME location may be different from mine):
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386/jre
10) Edit core-site.xml. Hadoop will be configured to run localy on port 9000 using hdfs as default filesystem.
hadoop@roadrunner:~$ vi $HADOOP_HOME/etc/hadoop/core-site.xml
Change the configuration element to this:
<configuration> <property> <name>fs.default.name </name> <value>hdfs://localhost:9000</value> </property> </configuration>
11) Edit hdfs-site.xml. Hadoop will be configured to use different storage location for NameNode and DataNode. As we have one DataNode we set the data replication factor to 1.
hadoop@roadrunner:~$ vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Change the configuration element to this:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value> </property> </configuration>
12) Edit yarn-site.xml . Configure YARN, a framework for job scheduling and cluster resource management.
hadoop@roadrunner:~$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
Change the configuration element to this:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
13) Copy mapred.site.xml.template and edit. Configure Hadoop to use YARN as application for MapReduce.
hadoop@roadrunner:~$ cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml hadoop@roadrunner:~$ vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
Change the configuration element to this:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
14) Configuration done, format the filesystem:
hadoop@roadrunner:~$ hadoop namenode -format
You will see a bunch of info messages on the terminal when executing above command. Afterwards – if everthing is ok – a new directory „hadoopinfra“ has been created in hadoop home as specified in hdfs-site.xml:
hadoop@roadrunner:~$ ls hadoop-2.7.1.tar.gz hadoopinfra
15) Install OpenSSH. SSH is needed by hadoop to execute operations like start and stop the dfs namenode and datanode deamons in the cluster.
hadoop@roadrunner:~$ sudo apt-get install openssh-server
16) We need a key value pair used by hadoop for SSH communication (important: without a passphrase):
hadoop@roadrunner:~$ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: ....
hadoop@roadrunner:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
hadoop@roadrunner:~$ chmod 0600 ~/.ssh/authorized_keys
17) Start NameNode and DataNode daemon:
hadoop@roadrunner:~$ start-dfs.sh Starting namenodes on [localhost] localhost: starting namenode, logging to /usr/local/hadoop-2.7.1/logs/hadoop-hadoop-namenode-roadrunner.out localhost: starting datanode, logging to /usr/local/hadoop-2.7.1/logs/hadoop-hadoop-datanode-roadrunner.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.7.1/logs/hadoop-hadoop-secondarynamenode-roadrunner.out hadoop@roadrunner:~$
18) Start YARN:
hadoop@roadrunner:~$ start-yarn.sh
19) Check the namenode web application is working: http://localhost:50070
20) Check YARN resourcemanager web application is working: http://localhost:8088/
Summary
If you followed my walkthrough with success you now should have a Hadoop installation in pseudo-distributed mode running on your (local) machine. As the installation is just the beginning I now wish you a nice time exploring the Hadoop framework 😉