Deployment Cylcle with Spark and Hadoop with java

This article will show you one of many possible cycles to deploy your code as quickly and efficiently as possible. Also, we will talk a little about what Hadoop and Spark actually is and how we can use it to make awesome distributed computations!

What is Hadoop and Spark for?

You use your Spark cloud, to do very computationally expensive tasks in a distributed fashion.

Hadoop provided the data in a distributed fashion, making it available from multiple nodes and by that increasing the rate at which every node in the cluster network will get its data.

We will write our code in Java and define cluster computations using the open source Apache Spark library.

After defining the code, we will use Maven to create a fat jar from it, which will contain all the dependencies.

We will make the Jar available from multiple sources, so that multiple computation nodes from our spark cluster can download it at the same time, this is achieved by making the data available distributed through hadoop.

What does a deployment cycle with kafka and hadoop look like in Java?

A typical cycle could look like this :

  1. Write code in Java
  2. Compile code into a fat Jar
  3. Make jar available in Hadoop cloud
  4. Launch Spark Driver which can allocate a dynamic amount of nodes to take care of the computations defined within the jar.

1.Write code in Java

You will have to define a main function with a main class. This will be the code that the cluster runs first, so everything starts from this function.

2. Compile code into fat Jar

mvn clean compile assembly:single

3. Make jar available from Hadoop cloud

Go into your Hadoop web interface and browse the file system

3.1 Create a folder in the cloud and upload the jar

After uploading your jar into your Hadoop cloud, it will be available to any computer that can talk with the Hadoop cloud. It is now distributed available on all the Hadoop nodes and is ready for highly efficient and fast data exchange with any cluster, in our example we use a Spark cluster.

If your hadoop node is called hadoop_fs and port is 9000, your jar is available to any node under the following URL:

hdfs://hadop_fs:9000/jars/example.jar

4. Launch distributed Spark Computation

To launch the driver, you need an instance of the spark-submit class. The most straightforward way to get it, is to just download the Spark library and unzip it.

wget http://apache.lauf-forum.at/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz

4.1 Launch Spark driver from command line.

Go to the directory where you have unzipped your Spark library, for me it would be

loan@Y510P:~/Libraries/Apache/Spark/spark-2.3.0-bin-hadoop2.7$

In the ./bin/spark-submit will have all the functionality we will require,

4.2 Gathering the Parameters

You need the following parameters, to launch your jar in the cluster

  • Spark Master URL
  • Hadoop Jar Url
  • Name of your main Class
  • Define –deploy-mode as Cluster to run the computation in cluster mode

4.3 Final step :Put the parameters together and launch the Jar in the cluster

./bin/spark-submit –class com.package.name.mainClass –master spark://10.0.1.10:6066 –deploy-mode cluster hdfs://hadop_fs:9000/jars/example.jar

This will tell the Spark cluster, where the Jar we want to run is. It will launch a user defined (or appropriate) amount of executors and finish the computation in a distributed fashion

Your Task should now show up in your Spark Webinterface.

What have you learned :

  • How to turn your java code into a fat jar
  • How to deploy your fat jar into the Hadoop cloud
  • How to run your code distributed in Spark, usinsg Hadoop as data source

Leave a Reply

Your email address will not be published. Required fields are marked *