Category Archives: Uncategorized

Setting up a Single Node Hadoop Cluster on an Ubuntu Linux Machine

This post is concerning the setup of a single node Hadoop cluster on an Ubuntu Linux machine. Three primary sources of this blog are Hadoop Quick Start Guide, Hadoop Cluster Setup and Michael Noll’s page.

The Need for Hadoop – Big Data Analytics Problem

Big Data is data with high velocity (generated at a fast pace), high volume (large number of attributes and a larger number of rows) and  high variety (structured, semi-structured and unstructured). Although Big Data has existed for a long time, and has a relative definition for each organization, the problems of analyzing Big Data to derive valuable information have been realized quite recently. Organizations are now unable to efficiently execute queries on their RDBMS, or to efficiently derive value using data mining techniques, e.g., classification, cluster analysis, association rule mining etc. For instance, a query over several million call detail records, or ATM transactions of customers in the past 6 months can easily take up to 7-10 minutes. Similarly, constructing an efficient decision tree classifier, or generating clusters of your customers in CRM, can force you to reduce the volume of the input data set – this means that you compromise on more valuable information which could have been learnt with more data. In 2004, Google published a paper describing MapReduce – a paradigm to solve such Big Data analysis problems simply through 2 functions: Map and Reduce, which are based on key-value pair mappings. This paradigm later became the basis of the Hadoop framework. 

Hadoop – Described Briefly

Hadoop is a Java-based open-source framework which attempts to solve the Big Data Analysis problem through distributed computing techniques. It is licensed by Apache Foundation and available as the Hadoop Apache project. The architecture has one master node connected to one or mode slave nodes – this is a typical HDFS cluster. Using MapReduce, the master “maps” data into key-value pairs, and distributes it to slaves using commodity hardware. Each slave processes the given task and sends back the result. Results from all slaves are then “reduced” by the master to provide the final output. For instance, each row in a big table could be converted into a CSV string as value and the row number as key. A single query over this table could then be run separately on n different nodes – the n results are reduced later on, e.g., by aggregating n different counts, into a single output. The primary processing components of Hadoop are as follows:

  • Hadoop Distributed File System (HDFS): HDFS manages all file-related operations which occur during MapReduce operations, e.g., creation, deletion etc. It is maintained independently from the local file system. It stores large files across multiple nodes and replicates data across multiple nodes to achieve reliability. All communication between nodes is TCP/IP-based.
  • Primary Name Node: Manages the file indexes of HDFS and is typically controlled by master.
  • Secondary Name Node: Generates snapshots of the primary name node’s memory structures, in order to prevent file corruption.
  • Data Node: A typical node in the HDFS cluster which processes MapReduce tasks – even the master can be a data node.
  • Job Tracker: Manages the scheduling of MapReduce tasks to/from the slaves – typically controlled by the master
  • Task Tracker: Available at each Data Node to coordinate MapReduce tasks with the Job Tracker.

——————————————————————————————————————————————————————————–

Single-Node Cluster Setup

Now that we have a grasp of Hadoop basics, let us get down to business. A single-node Hadoop cluster includes a single machine which acts as both master and slave, with all Hadoop components initialized only on this machine. Here is a basic snapshot:

single.node.setup

Our System configuration is:

  • Ubuntu Linux 12.04 LTS
  • Hadoop version: 1.1.2 (stable)
  • Oracle Sun Java 7u25 – JDK + JRE (JPS package bundled with this distribution allows viewing of active Hadoop components)
  • Node: HP G62 Laptop – Corei5 2.6GHz – 3 GB Ram – 64 bit

——————————————————————————————————————————————————————————–

Installing Java

1) Remove any previous OpenJDK installations

$ sudo apt-get purge openjdk-\*

2) Make directory to hold Sun Java

$ sudo mkdir -p /usr/local/java

3) Download the appropriate version of Oracle Java Sun (JDK/JRE) from Oracle’s website: JDK Download page and JRE Download page. As we are using 64-bit machine, my JDK version is: jdk-7u25-linux-x64.tar.gz and JRE version is jre-7u25-linux-x64.tar.gz. The downloaded files will be placed in /home/”your_user_name”/Downloads folder.

4) Copy the downloaded files to the Java directory

$ cd /home/"your_user_name"/Downloads
$ sudo cp -r jdk-7u25-linux-x64.tar.gz /usr/local/java
$ sudo cp -r jre-7u25-linux-x64.tar.gz /usr/local/java

5) Unpack the compressed binaries

$ cd /usr/local/java
$ sudo tar xvzf jdk-7u25-linux-x64.tar.gz
$ sudo tar jre-7u25-linux-x64.tar.gz

6) Cross-check the extracted binaries:

$ ls -a

The following two folders should be created: jdk1.7.0_25 and jre1.7.0_25

7) To provide information about JDK/JRE paths to the system PATH (located in /etc/profile), first access the PATH:

$sudo nano /etc/profile

and add the following lines at the end:

JAVA_HOME=/usr/local/java/jdk1.7.0_25
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
JRE_HOME=/usr/local/java/jre1.7.0_25
PATH=$PATH:$HOME/bin:$JRE_HOME/bin
export JAVA_HOME
export JRE_HOME
export PATH

Save and exit (CTRL+X, then press “Y”)

8) Inform Ubuntu about Oracle Sun Java location to signal that it is ready for use:

JDK is available:

$ sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/local/java/jdk1.7.0_25/bin/javac" 1

JRE is available:

$ sudo update-alternatives --install "/usr/bin/java" "java" "/usr/local/java/jre1.7.0_25/bin/java" 1

Java Web Start is available:

$ sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/local/java/jre1.7.0_25/bin/javaws" 1

9) Make Oracle Sun JDK/JRE the default on your Ubuntu system:
Set JRE:

$ sudo update-alternatives --set java /usr/local/java/jre1.7.0_25/bin/java

Set javac Compiler:

$ sudo update-alternatives --set javac /usr/local/java/jdk1.7.0_25/bin/javac

Set Java Web Start:

$ sudo update-alternatives --set javaws /usr/local/java/jre1.7.0_25/bin/javaws

10) Restart your system to re-load the /etc/profile

$ sudo reboot

11) Check your Java version to ensure installations and settings:

$java -version

Output should be something like this:

java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

12) Verify that JPS (JVM Process Status tool) is up and running

$ jps

This will show the process id of the jps process
——————————————————————————————————————————————————————————–

Adding a Dedicated Hadoop System User:

It is advisable to create a separate user account/ID for all Hadoop-related processing, to keep things simplified and separate from other processes. For this, create a Hadoop user group and add a Hadoop user (hdpuser) to this group. This user will have full rights over all Hadoop-related folders and files.

$sudo addgroup hadoop
$sudo adduser --ingroup hadoop hdpuser

On adding hdpuser, specify a password and provide the demographic information as required.

Provide all privileges to hdpuser by accessing the /etc/sudoers file

$sudo nano /etc/sudoers

and adding the following lines:

$hdpuser ALL=(ALL:ALL) ALL

——————————————————————————————————————————————————————————–

Configuring Secure Shell (SSH)

In Hadoop, communication between master and slave nodes is done through SSH. First ensure that SSH server is installed and running:

$ sudo apt-get install openssh-server

For the single-node setup on current machine, we configure SSH access to localhost for hdpuser

1) Access hdpuser account and generate SSH key. Outputs are shown below:

tariqmahmood@ubuntu:/$ su - hdpuser
Password:
hdpuser@ubuntu:~$ ssh-keygen -t rsa -P ""

The directory /home/hdpuser/.ssh will be created and the generated key will be saved in /home/hdpuser/.ssh/id_rsa. A fingerprint and an image of the key will be also displayed. The key has no password (empty quotes) so that Hadoop nodes can interact with each other without providing a password for every inter-node communication

2) Enable SSH access on current machine by authorizing the generated key

hdpuser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

3) Test SSH by connecting to localhost from the current machine with hdpuser. A part of the output is shown below:

hdpuser@ubuntu:~$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is 06:77:ed:68:ed:e4:8a:8a:f7:6d:36:4e:ae:c2:7a:85.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.

This also adds localhost to the list of known hosts of hdpuser (available in /home/hdpuser/.ssh/known_hosts) – to make it a trusted host.

——————————————————————————————————————————————————————————–

Disabling IPv6

In order to prevent any “clash” between IPv6 and IPv4 protocols, it is best to disable IPv6 as all Hadoop communication between nodes is IPv4-based. For this, first access the file /etc/sysctl.conf

hdpuser@ubuntu:~$ sudo nano /etc/sysctl.conf

and add the following lines at the end

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Save and exit (CTRL+X, then press “Y”)Restart your machine for changes to take effect.

If the following command returns 1 (after reboot), it means IPv6 is disabled.

hdpuser@ubuntu:~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

——————————————————————————————————————————————————————————–

Download and Install Hadoop

Download Version 1.1.2 (Stable Version) from the Hadoop Download Page. The file is hadoop-1.1.2.tar.gz and is placed in /home/”your_user_name”/Downloads folder.

1) Make Hadoop installation directory – I have selected /usr/hadoop

hdpuser@ubuntu:~$ sudo mkdir -p /usr/hadoop

2) Copy Hadoop installer to installation directory

hdpuser@ubuntu:/$ sudo cp -r /home/tariqmahmood/Downloads/hadoop-1.1.2.tar.gz /usr/hadoop

3) Extract Hadoop installer and cross-check extracted folder (hadoop-1.1.2)

hdpuser@ubuntu:/usr/hadoop$ sudo tar xvzf hadoop-1.1.2.tar.gz
hdpuser@ubuntu:/usr/hadoop$ ls -a

4) Create a simlink to this folder by name “hadoop”

hdpuser@ubuntu:/usr/hadoop$ sudo mv hadoop-1.1.2 hadoop

5) Make hdpuser the owner of this folder

hdpuser@ubuntu:/usr/hadoop$ sudo chown -R hdpuser:hadoop hadoop

You can verify that the  Hadoop installation is now available in /usr/hadoop/hadoop
——————————————————————————————————————————————————————————–

Update .bashrc with Hadoop-related environment variables

1) Access .bashrc – available in $HOME/.bashrc

hdpuser@ubuntu:/usr/hadoop$ sudo nano $HOME/.bashrc

2) Add following lines at the end:

# Set HADOOP_HOME
export HADOOP_HOME=/usr/hadoop/hadoop
# Set JAVA_HOME
export JAVA_HOME=/usr/local/java/jdk1.7.0_25
# Add Hadoop bin directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
# If LZO compression for Hadoop-related tasks, e.g., in HBase:
lzohead () {hadoop fs -cat $1 | lzop -dc | head -1000 | less}

Save and exit (CTRL+X, then press “Y”)

3) Restart Machine

——————————————————————————————————————————————————————————–

Hadoop Configuration

Update JAVA_HOME in hadoop-env.sh

1) Access hadoop-env.sh – available in $HADOOP_HOME/conf

hdpuser@ubuntu:/$ sudo nano `echo $HADOOP_HOME`/conf/hadoop-env.sh

2) Add the line:

export JAVA_HOME=/usr/local/java/jdk1.7.0_25

3) Save and exit (CTRL+X, then press “Y”)

——————————————————————————————————————————————————————————–

Create a Directory to hold Hadoop’s Temporary Files:

1) Create directory – in my case,  /usr/hadoop/tmp

hdpuser@ubuntu:/$ sudo mkdir -p /usr/hadoop/tmp

2) Provide hdpuser the rights to this directory

hdpuser@ubuntu:/$ sudo chown hdpuser:hadoop /usr/hadoop/tmp

——————————————————————————————————————————————————————————–

Modify $HADOOP_HOME/conf/core-site.xml – Core Configuration

1) Access $HADOOP_HOME/conf/core-site.xml

hdpuser@ubuntu:/$ sudo nano `echo $HADOOP_HOME`/conf/core-site.xml

2) Add the following lines between configuration tags

<property>
   <name>hadoop.tmp.dir</name>
   <value>/usr/hadoop/tmp</value>
   <description>Hadoop's temporary directory</description>
</property>
<property>
   <name>fs.default.name</name>
   <value>hdfs://localhost:54310</value>
   <description>Specifying HDFS as the default file system. The URI will be used to access and monitor HDFS processes, files, folders etc.</description>
</property>

——————————————————————————————————————————————————————————–

Modify $HADOOP_HOME/conf/mapred-site.xml – MapReduce configuration

1) Access $HADOOP_HOME/conf/mapred-site.xml

hdpuser@ubuntu:/$ sudo nano `echo $HADOOP_HOME`/conf/mapred-site.xml

2) Add the following lines between configuration tags

<property>
   <name>mapred.job.tracker</name>
   <value>localhost:54311</value>
   <description>The URI is used to monitor the status of MapReduce tasks</description>
</property>

——————————————————————————————————————————————————————————–

Modify $HADOOP_HOME/conf/hdfs-site.xml – File Replication

Hadoop stores files as a sequence of blocks which are replicated for fault tolerance (see here for more information). We need to specify the number of allowed replications in hdfs-site.xml – we set it to 1.

1) Access $HADOOP_HOME/conf/hdfs-site.xml

hdpuser@ubuntu:/$ sudo nano `echo $HADOOP_HOME`/conf/hdfs-site.xml

2) Add following lines between configuration tags:

<property>
   <name>dfs.replication</name>
   <value>1</value>
   <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>
</property>

——————————————————————————————————————————————————————————–

Initializing the Single-Node Cluster

Formatting the Name Node:

While setting up the cluster for the first time, we need to initially format the Name Node in HDFS. Do that as follows – the output generated is shown.

hdpuser@ubuntu:/$ /usr/hadoop/hadoop/bin/hadoop namenode -format
Warning: $HADOOP_HOME is deprecated.

13/07/18 10:46:40 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.1.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013
************************************************************/
13/07/18 10:46:40 INFO util.GSet: VM type       = 64-bit
13/07/18 10:46:40 INFO util.GSet: 2% max memory = 17.77875 MB
13/07/18 10:46:40 INFO util.GSet: capacity      = 2^21 = 2097152 entries
13/07/18 10:46:40 INFO util.GSet: recommended=2097152, actual=2097152
13/07/18 10:46:41 INFO namenode.FSNamesystem: fsOwner=hdpuser
13/07/18 10:46:41 INFO namenode.FSNamesystem: supergroup=supergroup
13/07/18 10:46:41 INFO namenode.FSNamesystem: isPermissionEnabled=true
13/07/18 10:46:41 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
13/07/18 10:46:41 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
13/07/18 10:46:41 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/07/18 10:46:41 INFO common.Storage: Image file of size 113 saved in 0 seconds.
13/07/18 10:46:41 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/usr/hadoop/tmp/dfs/name/current/edits
13/07/18 10:46:41 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/usr/hadoop/tmp/dfs/name/current/edits
13/07/18 10:46:42 INFO common.Storage: Storage directory /usr/hadoop/tmp/dfs/name has been successfully formatted.
13/07/18 10:46:42 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************

Starting HDFS daemon:

To start the single-node cluster, we first start HDFS. Type the following command – the output generated is shown.

hdpuser@ubuntu:/$ /usr/hadoop/hadoop/bin/start-dfs.sh
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/hadoop/hadoop/libexec/../logs/hadoop-hdpuser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/hadoop/hadoop/libexec/../logs/hadoop-hdpuser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/hadoop/hadoop/libexec/../logs/hadoop-hdpuser-secondarynamenode-ubuntu.out

Run jps to ensure that Name Node, Secondary Name Node and Data Node are active:

hdpuser@ubuntu:/$ jps
3536 DataNode
3789 Jps
3738 SecondaryNameNode
3305 NameNode

Starting MapReduce daemon:

Type the following command to start MapReduce – the output generated is shown:

hdpuser@ubuntu:/$ /usr/hadoop/hadoop/bin/start-mapred.sh
Warning: $HADOOP_HOME is deprecated.

starting jobtracker, logging to /usr/hadoop/hadoop/libexec/../logs/hadoop-hdpuser-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/hadoop/hadoop/libexec/../logs/hadoop-hdpuser-tasktracker-ubuntu.out

Run jps to ensure that Task Tracker and Job Tracker are active:

hdpuser@ubuntu:/$ jps
4086 TaskTracker
3536 DataNode
4830 Jps
3738 SecondaryNameNode
3305 NameNode
3891 JobTracker

Stopping MapReduce daemon:

Type the following command to stop MapReduce – the output generated is shown:

hdpuser@ubuntu:/$ /usr/hadoop/hadoop/bin/stop-mapred.sh
Warning: $HADOOP_HOME is deprecated.

stopping jobtracker
localhost: stopping tasktracker

Stopping HDFS daemon:

Type the following command to stop HDFS – the output generated is shown:

hdpuser@ubuntu:/$ /usr/hadoop/hadoop/bin/stop-dfs.sh
Warning: $HADOOP_HOME is deprecated.

stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

Starting/Stopping both HDFS and MapReduce Together:

We can start both of them together as follows:

hdpuser@ubuntu:~$ /usr/hadoop/hadoop/bin/start-all.sh
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/hadoop/hadoop/libexec/../logs/hadoop-hdpuser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/hadoop/hadoop/libexec/../logs/hadoop-hdpuser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/hadoop/hadoop/libexec/../logs/hadoop-hdpuser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/hadoop/hadoop/libexec/../logs/hadoop-hdpuser-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/hadoop/hadoop/libexec/../logs/hadoop-hdpuser-tasktracker-ubuntu.out

For stopping:

hdpuser@ubuntu:~$ /usr/hadoop/hadoop/bin/stop-all.sh
Warning: $HADOOP_HOME is deprecated.

stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

——————————————————————————————————————————————————————————–

Running a MapReduce Job

Let us now run a MapReduce job using the single-node cluster.  The Hadoop distribution provides a Word Count example, which counts the number of occurrences of unique word patterns in a given set of text files. The input is text files and the output is text files, each line of which contains a word (key) and the count of how often it occurred (value), separated by a tab. The Map phase maps each line to  these (key,value) pairs, and the aggregated count of a word is done in Reduce phase. See this Word Count blog for more functional details.

1) Download the following three Sherlock Holmes detective stories from Project Gutenberg: Adventures of Sherlock Holmes, Memoirs of Sherlock Holmes, and Return of Sherlock Holmes. The download format is Plain Text UTF-8.

2) Create a directory to hold these files and make hdpuser the owner:

hdpuser@ubuntu:~$ sudo mkdir -p /usr/hadoop/BookDB
hdpuser@ubuntu:~$ sudo chown hdpuser:hadoop /usr/hadoop/BookDB

3) Copy the three downloaded files to the BookDB folder (using cp -r command) and verify

hdpuser@ubuntu:/usr/hadoop/BookDB$ ls -l
 total 1710
 -rw-r--r-- 1 root root 594933 Jul 19 00:48 Adventures_Sherlock_Holmes.txt
 -rw-r--r-- 1 root root 502760 Jul 19 00:48 Memoirs_Sherlock_Holmes.txt
 -rw-r--r-- 1 root root 641720 Jul 19 00:48 Return_Sherlock_Holmes.txt

4) Start all the Hadoop processes:

hdpuser@ubuntu:/$ /usr/hadoop/hadoop/bin/start-all.sh

5) Copy the text files from the local file system to HDFS – I copy into the /HDFS/BookDB folder as follows:

hdpuser@ubuntu:/usr/hadoop/hadoop$ bin/hadoop dfs -copyFromLocal /usr/hadoop/BookDB /HDFS/BookDB
 Warning: $HADOOP_HOME is deprecated 

6) Cross-check copied files

hdpuser@ubuntu:/usr/hadoop/hadoop$ bin/hadoop dfs -ls /HDFS/BookDB
 Warning: $HADOOP_HOME is deprecated.

Found 3 items
 -rw-r--r--   1 hdpuser supergroup     594933 2013-07-19 00:59 /HDFS/BookDB/Adventures_Sherlock_Holmes.txt
 -rw-r--r--   1 hdpuser supergroup     502760 2013-07-19 00:59 /HDFS/BookDB/Memoirs_Sherlock_Holmes.txt
 -rw-r--r--   1 hdpuser supergroup     641720 2013-07-19 00:59 /HDFS/BookDB/Return_Sherlock_Holmes.txt 

7) Execute the following command to run the Word Count example – I store my Map Reduce results in /HDFS/BookDB-Output

hdpuser@ubuntu:/usr/hadoop/hadoop$ bin/hadoop jar hadoop-examples-1.1.2.jar wordcount /HDFS/BookDB /HDFS/BookDB-Output
 Warning: $HADOOP_HOME is deprecated.

The generated output is as follows:

13/07/19 01:02:06 INFO input.FileInputFormat: Total input paths to process : 3
 13/07/19 01:02:06 INFO util.NativeCodeLoader: Loaded the native-hadoop library
 13/07/19 01:02:06 WARN snappy.LoadSnappy: Snappy native library not loaded
 13/07/19 01:02:07 INFO mapred.JobClient: Running job: job_201307190058_0001
 13/07/19 01:02:08 INFO mapred.JobClient:  map 0% reduce 0%
 13/07/19 01:02:17 INFO mapred.JobClient:  map 66% reduce 0%
 13/07/19 01:02:21 INFO mapred.JobClient:  map 100% reduce 0%
 13/07/19 01:02:25 INFO mapred.JobClient:  map 100% reduce 33%
 13/07/19 01:02:28 INFO mapred.JobClient:  map 100% reduce 100%
 13/07/19 01:02:29 INFO mapred.JobClient: Job complete: job_201307190058_0001
 13/07/19 01:02:29 INFO mapred.JobClient: Counters: 29
 13/07/19 01:02:29 INFO mapred.JobClient:   Job Counters
 13/07/19 01:02:29 INFO mapred.JobClient:     Launched reduce tasks=1
 13/07/19 01:02:29 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=17945
 13/07/19 01:02:29 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
 13/07/19 01:02:29 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
 13/07/19 01:02:29 INFO mapred.JobClient:     Launched map tasks=3
 13/07/19 01:02:29 INFO mapred.JobClient:     Data-local map tasks=3
 13/07/19 01:02:29 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10976
 13/07/19 01:02:29 INFO mapred.JobClient:   File Output Format Counters
 13/07/19 01:02:29 INFO mapred.JobClient:     Bytes Written=307141
 13/07/19 01:02:29 INFO mapred.JobClient:   FileSystemCounters
 13/07/19 01:02:29 INFO mapred.JobClient:     FILE_BYTES_READ=629707
 13/07/19 01:02:29 INFO mapred.JobClient:     HDFS_BYTES_READ=1739796
 13/07/19 01:02:29 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1465107
 13/07/19 01:02:29 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=307141
 13/07/19 01:02:29 INFO mapred.JobClient:   File Input Format Counters
 13/07/19 01:02:29 INFO mapred.JobClient:     Bytes Read=1739413
 13/07/19 01:02:29 INFO mapred.JobClient:   Map-Reduce Framework
 13/07/19 01:02:29 INFO mapred.JobClient:     Map output materialized bytes=629719
 13/07/19 01:02:29 INFO mapred.JobClient:     Map input records=36291
 13/07/19 01:02:29 INFO mapred.JobClient:     Reduce shuffle bytes=629719
 13/07/19 01:02:29 INFO mapred.JobClient:     Spilled Records=87734
 13/07/19 01:02:29 INFO mapred.JobClient:     Map output bytes=2947831
 13/07/19 01:02:29 INFO mapred.JobClient:     Total committed heap usage (bytes)=574816256
 13/07/19 01:02:29 INFO mapred.JobClient:     CPU time spent (ms)=14380
 13/07/19 01:02:29 INFO mapred.JobClient:     Combine input records=313344
 13/07/19 01:02:29 INFO mapred.JobClient:     SPLIT_RAW_BYTES=383
 13/07/19 01:02:29 INFO mapred.JobClient:     Reduce input records=43867
 13/07/19 01:02:29 INFO mapred.JobClient:     Reduce input groups=28423
 13/07/19 01:02:29 INFO mapred.JobClient:     Combine output records=43867
 13/07/19 01:02:29 INFO mapred.JobClient:     Physical memory (bytes) snapshot=689762304
 13/07/19 01:02:29 INFO mapred.JobClient:     Reduce output records=28423
 13/07/19 01:02:29 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3994656768
 13/07/19 01:02:29 INFO mapred.JobClient:     Map output records=313344

8) List the output files generated on HDFS:

hdpuser@ubuntu:/usr/hadoop/hadoop$ bin/hadoop dfs -ls /HDFS/BookDB-Output
 Warning: $HADOOP_HOME is deprecated.

Found 3 items
 -rw-r--r--   1 hdpuser supergroup          0 2013-07-19 01:02 /HDFS/BookDB-Output/_SUCCESS
 drwxr-xr-x   - hdpuser supergroup          0 2013-07-19 01:02 /HDFS/BookDB-Output/_logs
 -rw-r--r--   1 hdpuser supergroup     307141 2013-07-19 01:02 /HDFS/BookDB-Output/part-r-00000

9) Create a directory on local file system to hold the HDFS output – I chose /usr/hadoop/BookDB-Output

hdpuser@ubuntu:/usr/hadoop/hadoop$ sudo mkdir -p /usr/hadoop/BookDB-Output

10) Concatenate output data in HDFS into the local output directory

hdpuser@ubuntu:/usr/hadoop/hadoop$ sudo bin/hadoop dfs -getmerge /HDFS/BookDB-Output /usr/hadoop/BookDB-Output
 13/07/19 01:08:32 INFO util.NativeCodeLoader: Loaded the native-hadoop library

This generates two files: BookDB-Output and .BookDB-Output.crc

11) Access concatenated output file:

hdpuser@ubuntu:/usr/hadoop/hadoop$ sudo nano /usr/hadoop/BookDB-Output/BookDB-Output

A part of the output is shown below:

"'"A    1
"'"Ah,  1
"'"And  1
"'"But  2
"'"For  1
"'"Gone!        1
"'"Ha,  1
"'"He   1
"'"Hullo,       1
"'"I    3
"'"I'd  2
"'"I'm  1

——————————————————————————————————————————————————————————–

Web Interfaces To Monitor Hadoop Processes

Hadoop provides 3 web interfaces to monitor the status and progress of HDFS and MapReduce processes. They are as follows:

  1. http://localhost:50070/ –  Name Node montoring interface
  2. http://localhost:50030/ –  Job Tracker monitoring interface
  3. http://localhost:50060/ – Task Tracker monitoring interface

The following is a snapshot of our Name Node daemon:

1

It summarizes the cluster status in terms of DFS resource utilization and active nodes, dead nodes etc. The HDFS directories can also be browsed – a snapshot of BookDB-Output folder on HDFS is shown below:

2

Here is the snapshot from the Job Tracker interface:

3

As our Word Count example has already been executed, many entries are 0.

——————————————————————————————————————————————————————————–

Data Node Failure – Namespace Issue

It could happen that a given Data Node could fail due to an issue of incompatible namespace IDs between the name node and a given data node. For instance, when I ran the above Word Count example again, my Data Node (single node) failed. To resolve this issue:
1) To verify the namespace ID issue, access the Data Node log hadoop-hdpuser-datanode-ubuntu.log, available in $HADOOP_HOME/logs. You will see an incompatible namespaceID error: ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /usr/hadoop/tmp/dfs/data: namenode namespaceID = 1955919901; datanode namespaceID = 545241855 (Your IDs could be different.)
2) Stop all Hadoop processes
3) Copy the namenode namespace ID (1955919901). You can do the same by accessing this ID available in the Version file of name node in /usr/hadoop/tmp/dfs/name/current/Version
4) Access the data node’s Version file available in /usr/hadoop/tmp/dfs/data/current/Version and replace the value of namespaceID variable with 1955919901.
6) Restart Hadoop processes.

It is advisable to repeat this activity at each data node, each time you format the name node..