Category Archives: Hadoop

This contains tutorials about the distributed file system Hadoop
These article show you :
– how to setup hadoop,
– how to setup hadoop on a cluster of computers for a distributed file system
– how to upload and download data from dfs (distributed file system)
– how to connect hadoop with spark

Java Spark Tips, Tricks and Basics 3 – How to select columns for nested Datasets / Dataframes in Spark Java

How to select columns from a nested Dataset/Dataframe in Spark java

 

Let’s assume we have nested data that looks like this

Let’s say we have the data stored and we load into a dataframe frist

 

 

 

We can now get a dataframe, only containing one of the nested colmns with the following command

 

 

And so on. So you just have to use “.” as separate to select any nested column.

 

Java Spark Tips, Tricks and Basics 1 – How to read images as Datasets / Dataframes from Hadoop in Spark Java

This tutorial will show you how to read a folder of images from a Hadoop folder.

Just use the following command and update the path to your image folder in the Hadoop HDFS

We will be using  Image Schema   and it’s  readImages function.

 

That’s it already!

 

Spark Feature Engineering Tutorial 4 – RCV1 Newswire stories categorized

Spark Feature Engineering Tutorial – 4 – transforming RCV1 dataset

What is the data?

The dataset was provided by the Jorunal of achien Learning research in 2004 as new benchmark for text categorization research. You can read more about the journal that has released the dataset over here .

Where to get the data ?

www.csie.ntu.edu.tw/
is the data Provider

 

 

What is the data about?

It contains information about Newswire stories and their categorization

 

Lets load the data

We can load the data into spark with this command

 

Let’s checkout the data

We should checkout the data schema and a few rows. This is how you can do it

 

;

Your console output should look like this

Very nice, the data is aleady in a nice format.

Using String Indexer

We will use the stirng indexer, to index the amount of classes we have.

 

This will define the first column as label column.

Using the Vector Assembler

 

This will define the 2nd column as feature column

Build the pieline

 

This will tell spark in which order to apply the transformers

Instantiate the pipeline

 

This will apply the pipeline on the original Datas and return a model.

Get the transformed dataset

 

This will apply the transformation on the dataset and returns the transformed Dataframe.

Lets checkout ou transformed data

This looks pretty good, but we do not need the Label and features column anymore.

Drop useless columns

 

The cleaned dataset

This is now our struct, perfect!

We are ready to do soem machine learning on this.

Let’s test if we transformed our data properly, by applying a linear classyfier to it!.

Define the linear classifier

 

Call your dataset with this function and you should get no errors, if you did everything like in this tutorial!

Here is the full code :

 

Deployment Cylcle with Spark and Hadoop with java

This article will show you one of many possible cycles to deploy your code as quickly and efficiently as possible. Also, we will talk a little about what Hadoop and Spark actually is and how we can use it to make awesome distributed computations!

What is Hadoop and Spark for?

You use your Spark cloud, to do very computationally expensive tasks in a distributed fashion.

Hadoop provided the data in a distributed fashion, making it available from multiple nodes and by that increasing the rate at which every node in the cluster network will get its data.

We will write our code in Java and define cluster computations using the open source Apache Spark library.

After defining the code, we will use Maven to create a fat jar from it, which will contain all the dependencies.

We will make the Jar available from multiple sources, so that multiple computation nodes from our spark cluster can download it at the same time, this is achieved by making the data available distributed through hadoop.

What does a deployment cycle with kafka and hadoop look like in Java?

A typical cycle could look like this :

  1. Write code in Java
  2. Compile code into a fat Jar
  3. Make jar available in Hadoop cloud
  4. Launch Spark Driver which can allocate a dynamic amount of nodes to take care of the computations defined within the jar.

1.Write code in Java

You will have to define a main function with a main class. This will be the code that the cluster runs first, so everything starts from this function.

2. Compile code into fat Jar

mvn clean compile assembly:single

3. Make jar available from Hadoop cloud

Go into your Hadoop web interface and browse the file system

3.1 Create a folder in the cloud and upload the jar

After uploading your jar into your Hadoop cloud, it will be available to any computer that can talk with the Hadoop cloud. It is now distributed available on all the Hadoop nodes and is ready for highly efficient and fast data exchange with any cluster, in our example we use a Spark cluster.

If your hadoop node is called hadoop_fs and port is 9000, your jar is available to any node under the following URL:

hdfs://hadop_fs:9000/jars/example.jar

4. Launch distributed Spark Computation

To launch the driver, you need an instance of the spark-submit class. The most straightforward way to get it, is to just download the Spark library and unzip it.

wget http://apache.lauf-forum.at/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz

4.1 Launch Spark driver from command line.

Go to the directory where you have unzipped your Spark library, for me it would be

loan@Y510P:~/Libraries/Apache/Spark/spark-2.3.0-bin-hadoop2.7$

In the ./bin/spark-submit will have all the functionality we will require,

4.2 Gathering the Parameters

You need the following parameters, to launch your jar in the cluster

  • Spark Master URL
  • Hadoop Jar Url
  • Name of your main Class
  • Define –deploy-mode as Cluster to run the computation in cluster mode

4.3 Final step :Put the parameters together and launch the Jar in the cluster

./bin/spark-submit –class com.package.name.mainClass –master spark://10.0.1.10:6066 –deploy-mode cluster hdfs://hadop_fs:9000/jars/example.jar

This will tell the Spark cluster, where the Jar we want to run is. It will launch a user defined (or appropriate) amount of executors and finish the computation in a distributed fashion

Your Task should now show up in your Spark Webinterface.

What have you learned :

  • How to turn your java code into a fat jar
  • How to deploy your fat jar into the Hadoop cloud
  • How to run your code distributed in Spark, usinsg Hadoop as data source

Spark Error ” Exception in thread “main” java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running” MetricsSystem”

I recently did some feature engineering on a few datasets with spark.

I wanted to make the datasets available in our Hadoop cluster, so i used our normal dataset upload pattern, but ran into these  nasty little errors

 

 

and also

 

 

 

The full exception stacks looks like this

 

 

So what do these error mean and why do they occur?

  • It often has to do something, with not initializing the Spark session properly
  • Usually means, there is a wrong value, for the master location
  • Double check the master adress for the spark session, by default it should use port 7077 and NOT  6066
  • Check if the version of the spark cluster  is the same as your spark version in the jar / of the job you want to submit

How do I fix “Could not find CoarseGrained Scheduler” or “an only call  getServletHandlers on a running MetricsSystem”

  • Update master URL
  • Update dependencies POM/Gradle/Jar, so you use the same version as the cluster

Now your error should be fixed. Have fun with your Spark Cluster!

 

How to setup Hadoop 2.9 Pseudo Cluster mode on a remote PC using SSH

In my <other tutorial>  we learned about what Hadoop is, why Hadoop is so awesome and what Hadoop is used for. No I will show you, how to setup Hadoop 2.9 in Pseudo Cluster mode on a VM using SSH.

Download Hadoop 2.9

wget http://www-eu.apache.org/dist/hadoop/common/hadoop-2.9.0/hadoop-2.9.0-src.tar.gz

Then unzip it
tar -xvzf hadoop-2.9.0-src.tar.gz

Remember where you extracted this to, because we will need to add the path to the Enviroment Variables later!
To get the path use the handy command
pwd

Download SSH and Rsync
sudo apt-get install ssh
sudo apt-get install rsync

Setup SSH connecton to localhost
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod og-wx ~/.ssh/authorized_keys

Setup Hadoop Enviroment Variables

sudo gedit ~/.bashrc

and enter the following text (and by that adding the following variables)
export HADOOP_HOME=/path/to/hadoop/folder
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Next step is to edit the Hadoop-env.sh file located inside of your Hadoop folder in /etc/hadoop/Hadoop-env.sh .
We will add your Java home path to the Hadoop settings.
Change
export JAVA_HOME=${JAVA_HOME}
for
export JAVA_HOME= /usr/lib/jvm/java-8-openjdk-amd64
To make sure you use the right path, write
echo $JAVA_HOME
in your Terminal, to recieve the Java Home Path

Enable Pseudo Cluster Mode

Now we can finally setup the configurations for Hadoop pseudo distributed mode
The necessary files to edit are located inside of the HadoopBase/etc/hadoop folder.

hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>1</value></property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/user/hadoop/data/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/user/hadoop/data/hdfs/datanode</value>
</property>

mapred-site.xml

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

yarn-site.xml

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

core-site.xml

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

 

 

Then Format the File system

bin/hdfs namenode -format

and we are done!

To see how to run Hadoop check this article out!

What is hadoop and Why is it awesome?

Introduction

Hadoop Provides big companies a mean to distribute and store huge amount of data on not only one computer but multiple!  You can imagine it like you normal Window or Unix Filesystem, but only distributed! At first you might think, wow ok so what, now I have my 4K Video  on 5 different Computers, what do we get from that?

Usually, only one computer supplies us with the data we want, this one computer only has 1 Network connection and limited bandwith. If you have multiple computers in different locations using different connections to the internet, their bandwith sums up and a high perfomance boost will be noticable

You can imagine it, as 1 Person having to deliver a giant rocket consisting of multiple big parts. That one person can only deliver 1 Rocket part at a time. If we use 2 Persons, we already doubled our speed !  The same principle applies to down loading and uploading

So instead of just 1 Computer supplying you with a limited Datastream, you have multiple Computers serving you Data at the same time!

 

IF you want to setup Hadoop on your Local machien check <THIS> out!

If you are intrested in setting  up a Hadoop pseudo cluster check <THIS> out!

If you want to learn about basic java interacting with Hadoop, downloading, uploading from a Distributed File System  , check <THIS> out!

 

Notation

Model

FileSystem Class