HDFS Architecture
Namenode High Availability
Environment Variable setup for HDFS client
Let's assume that you have downloaded Apache Hadoop 2.7.7 and unpacked it under
/home/user/opt/hadoop-2.7.7
.
In order to set the environment variables in a permanent way (so that every time you
login, they are defined), we can use shell profiles. Each shell may use different file
but in this tutorial we assume Linux CentOS. Open ~/.bashrc
file using
an editor (you can use command line editors such as nano
or vi
or editors in GUI such as gedit
).
The first environment variable is PATH
which helps you to run Hadoop
commands without providing the absolute path. Add the following line to the file:
export PATH=$PATH:/home/user/opt/hadoop-2.7.7/bin
The second environment variable is JAVA_HOME
that points to the installation
directory of your JDK/JRE. Add the following line to the file
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
If you are a Mac user, use
export JAVA_HOME=$(/usr/libexec/java_home -v1.8)
The next environment variable is the one that indicates where are the configuration
files of Hadoop and it is HADOOP_CONF_DIR
.
Let's assume that you have download the configuration files from the namenode and put
them under /home/user/opt/hadoop-2.7.7/etc/my-cluster
Add the following lines to the file
export HADOOP_HOME=/home/user/opt/hadoop-2.7.7
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/my-cluster
Note that the HADOOP_HOME
is not necessary but just a good practice.
Save the file and run source ~/.bashrc
. This is to setup your environment
variables for this session. From now on, whenever you login using user user
,
all these environment variables are set.
If everything is set properly, the following command should return the content of the root directory of your Hadoop cluster.
hadoop fs -ls /
Impersonate Hadoop user
Hadoop commands and library uses current user that is logged into the client as the
effective user (if you are using a Linux machine, just run whoami
command
to see that).
If this user is different from the user that has permission to the HDFS directories and
files, you might need to impersonate the user with permission. To achieve this, you can
use HADOOP_USER_NAME
environment variable.
export HADOOP_USER_NAME=user_with_permission
MapReduce
Introduction to Zookeeper
The code related to this section can be found on GitHub.
Installing Apache Zookeeper
Apache Zookeeper distribution includes both server and CLI binaries. This guide covers only Linux environment. For Mac and Windows, there are a couple of tweaks to do.
Download Apache Zookeeper
Go to Apache Zookeeper Release page and follow the steps to download the revision that you need. You will get a TAR file.
mkdir -p ~/opt
mv ~/Downloads/zookeeper-*.tar.gz ~/opt/
cd ~/opt
tar -xf zookeeper-*.tar.gz
mv ./zookeeper-*/ ./zookeeper
Then open ~/.bashrc
file and add the following lines.
export ZOOKEEPER_HOME=~/opt/zookeeper
export PATH=$PATH:$ZOOKEEPER_HOME/bin
Then finish it by running source ~/.bashrc
Run Zookeeper Server locally
Open $ZOOKEEPER_HOME/conf/zoo.cfg
file (if doesn't exist, create one) and add
the following lines
tickTime=2000
dataDir=$ZOOKEEPER_HOME/data
clientPort=2181
Run the server zkServer.sh start
Run Zookeeper CLI
In order to run the Zookeeper CLI, just execute zkCli.sh -server localhost:2181
.