How to setup Hadoop 2.9 Pseudo Cluster mode on a remote PC using SSH

In my <other tutorial>  we learned about what Hadoop is, why Hadoop is so awesome and what Hadoop is used for. No I will show you, how to setup Hadoop 2.9 in Pseudo Cluster mode on a VM using SSH.

Download Hadoop 2.9

wget http://www-eu.apache.org/dist/hadoop/common/hadoop-2.9.0/hadoop-2.9.0-src.tar.gz

Then unzip it
tar -xvzf hadoop-2.9.0-src.tar.gz

Remember where you extracted this to, because we will need to add the path to the Enviroment Variables later!
To get the path use the handy command
pwd

Download SSH and Rsync
sudo apt-get install ssh
sudo apt-get install rsync

Setup SSH connecton to localhost
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod og-wx ~/.ssh/authorized_keys

Setup Hadoop Enviroment Variables

sudo gedit ~/.bashrc

and enter the following text (and by that adding the following variables)
export HADOOP_HOME=/path/to/hadoop/folder
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Next step is to edit the Hadoop-env.sh file located inside of your Hadoop folder in /etc/hadoop/Hadoop-env.sh .
We will add your Java home path to the Hadoop settings.
Change
export JAVA_HOME=${JAVA_HOME}
for
export JAVA_HOME= /usr/lib/jvm/java-8-openjdk-amd64
To make sure you use the right path, write
echo $JAVA_HOME
in your Terminal, to recieve the Java Home Path

Enable Pseudo Cluster Mode

Now we can finally setup the configurations for Hadoop pseudo distributed mode
The necessary files to edit are located inside of the HadoopBase/etc/hadoop folder.

hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>1</value></property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/user/hadoop/data/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/user/hadoop/data/hdfs/datanode</value>
</property>

mapred-site.xml

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

yarn-site.xml

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

core-site.xml

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

 

 

Then Format the File system

bin/hdfs namenode -format

and we are done!

To see how to run Hadoop check this article out!

What is hadoop and Why is it awesome?

Introduction

Hadoop Provides big companies a mean to distribute and store huge amount of data on not only one computer but multiple!  You can imagine it like you normal Window or Unix Filesystem, but only distributed! At first you might think, wow ok so what, now I have my 4K Video  on 5 different Computers, what do we get from that?

Usually, only one computer supplies us with the data we want, this one computer only has 1 Network connection and limited bandwith. If you have multiple computers in different locations using different connections to the internet, their bandwith sums up and a high perfomance boost will be noticable

You can imagine it, as 1 Person having to deliver a giant rocket consisting of multiple big parts. That one person can only deliver 1 Rocket part at a time. If we use 2 Persons, we already doubled our speed !  The same principle applies to down loading and uploading

So instead of just 1 Computer supplying you with a limited Datastream, you have multiple Computers serving you Data at the same time!

 

IF you want to setup Hadoop on your Local machien check <THIS> out!

If you are intrested in setting  up a Hadoop pseudo cluster check <THIS> out!

If you want to learn about basic java interacting with Hadoop, downloading, uploading from a Distributed File System  , check <THIS> out!

 

Notation

Model

FileSystem Class