HDFS Architecture

HDFS Architecture

Namenode High Availability

Namenode High Availability

Environment Variable setup for HDFS client

Let's assume that you have downloaded Apache Hadoop 2.7.7 and unpacked it under /home/user/opt/hadoop-2.7.7.

In order to set the environment variables in a permanent way (so that every time you login, they are defined), we can use shell profiles. Each shell may use different file but in this tutorial we assume Linux CentOS. Open ~/.bashrc file using an editor (you can use command line editors such as nano or vi or editors in GUI such as gedit).

The first environment variable is PATH which helps you to run Hadoop commands without providing the absolute path. Add the following line to the file:

export PATH=$PATH:/home/user/opt/hadoop-2.7.7/bin

The second environment variable is JAVA_HOME that points to the installation directory of your JDK/JRE. Add the following line to the file

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

If you are a Mac user, use

export JAVA_HOME=$(/usr/libexec/java_home -v1.8)

The next environment variable is the one that indicates where are the configuration files of Hadoop and it is HADOOP_CONF_DIR. Let's assume that you have download the configuration files from the namenode and put them under /home/user/opt/hadoop-2.7.7/etc/my-cluster Add the following lines to the file


                                export HADOOP_HOME=/home/user/opt/hadoop-2.7.7
                                export HADOOP_CONF_DIR=$HADOOP_HOME/etc/my-cluster
                            

Note that the HADOOP_HOME is not necessary but just a good practice.

Save the file and run source ~/.bashrc. This is to setup your environment variables for this session. From now on, whenever you login using user user, all these environment variables are set.

If everything is set properly, the following command should return the content of the root directory of your Hadoop cluster.

hadoop fs -ls /

Impersonate Hadoop user

Hadoop commands and library uses current user that is logged into the client as the effective user (if you are using a Linux machine, just run whoami command to see that). If this user is different from the user that has permission to the HDFS directories and files, you might need to impersonate the user with permission. To achieve this, you can use HADOOP_USER_NAME environment variable.

export HADOOP_USER_NAME=user_with_permission

MapReduce

MaReduce

Introduction to Zookeeper

The code related to this section can be found on GitHub.

Installing Apache Zookeeper

Apache Zookeeper distribution includes both server and CLI binaries. This guide covers only Linux environment. For Mac and Windows, there are a couple of tweaks to do.

Download Apache Zookeeper

Go to Apache Zookeeper Release page and follow the steps to download the revision that you need. You will get a TAR file.


                            mkdir -p ~/opt
                            mv ~/Downloads/zookeeper-*.tar.gz ~/opt/
                            cd ~/opt
                            tar -xf zookeeper-*.tar.gz
                            mv ./zookeeper-*/ ./zookeeper
                        

Then open ~/.bashrc file and add the following lines.


                            export ZOOKEEPER_HOME=~/opt/zookeeper
                            export PATH=$PATH:$ZOOKEEPER_HOME/bin
                        

Then finish it by running source ~/.bashrc

Run Zookeeper Server locally

Open $ZOOKEEPER_HOME/conf/zoo.cfg file (if doesn't exist, create one) and add the following lines


                            tickTime=2000
                            dataDir=$ZOOKEEPER_HOME/data
                            clientPort=2181
                        

Run the server zkServer.sh start

Run Zookeeper CLI

In order to run the Zookeeper CLI, just execute zkCli.sh -server localhost:2181.