Tag Archives: distributed computing

Spark Encoders, what are they there for and why do we need them?

  • Encoders main purpose is the task of performing  serialization  and deserialization (SerDe)
  • Since Spark does not save data as JVM objects, but instead in it’s very own binary format.
  • Spark comes with a lot of build in encoders
  • An Encoder  priovides information about a tables schema, without  having to deserialize the whole object.
  • Encoders are nessecary when mapping datasets

Spark Java Library overview

Spark Java Library overview

This is an overview of the full ecosystem, but we will just take a look at the most important classes for everyday usage.

Spark provides a huge library of useful classes and interfaces. It is important to know, what functionality your tool offers or otherwise you are not able to use it to its fullest potential.


That’s why I give you this great overview :


What is Sparksession for?

The Sparksession provided the main entry point for dataset functionality in spark.

It provides ways to


What is SparkContext for?

The Spark Context provides the main entry  point for Spark functionality.

The context represents the connection to a Spark cluster and you can do operations in clusters with it like

  • broadcast Variables to a cluster
  • create Resilient Distributed Datasets
  • create accumulators
  • distribute files in nodes of you spark cluster
  • distribute jars in nodes of your spark cluster
  • broadcast any variable to nodes in your cluster
  • Submit jobs  to the Cluster


What is the Spark Launcher for?

This class allows you, to launch any application in your spark cluster programmatically from java!  It will launch a new Spark Application as child process


What is the inProcessLauncher for?

This allows you to launch a new Spark application, within the invoking process


What is SparkAppHandle for?

The appHandler object is returned after starting a Spark app with the Spark launcher.

The appHandler gives you options to monitor and control the running application.


What is SparkAppHandle.listener for?


What is SparkSQLContext for?

The Spark Context provides the main entry point for Spark functionality.

The context representsit’s the connection to a Spark cluster and you can do operations in clusters with it like

  • broadcast Variables to a cluster
  • create Resilient Distributed Datasets
  • create accumulators


What are UserDefinedFunctions for?


Spark Feature Engineering Tutorial 3 – Forest Alpha Brainwave Data transformation

Spark Java dataset transformation tutorial

In this tutorial we will learn, how to transform the ALPHA train set of brainwave data provided by one of the Machine Learning labs of Technische Universität Berlin.


Downloading the data using Wget

I saved you guys some time, you do not need to open another browser tab! Wget the juicy data straight from the source.

Helper Script


Download the feature data


Download the label Data


Then convert the data with the helper script

Convert Data with the provided python helper script



How does our data look like?

Label Is a value of either -1 or 1 so it tells us, weater the particle with the corrosponding features is an instance of the particle we want to classify or not .

1  would imply the features describe the particle, -1 would imply the features describe a different particle.

Features is a tuple, where the first entry is the feature count = 500 and the second is an array of all the feature values.

For our ML Format, we want all the Label columns to contain only doubles , no arrays


What Spark objects will we be needing?


How do we get our data in the proper format?


First we load the raw data into our dataframe



Then we define our label column .  and the struct field


Structfield for Label


This will be our label column.

The struct field  will be used to tell Spark which schema our new Rows and data frames should have. Fields will hold the definition for every column of our new dataframe.

The first entry of the StructField is for labels, the second one is for the Features.


Structifeld for features


We will have a vector type for the second column, so it can encode a feature vector of any size for us.


Build the struct from the fields


The schmea will be infered by the createStructType(fields) method, which will generate us a nice Struct that  represents all ou defined fields and is very nicely to work with.


Initialize our new DF


This writes the label column from the original data into our new dataset and also initialize it


Conert the Dataset to a list for easier Looping



It is time to get loopy … and to extract relevant data from the original dataset


The Actual Loop


    //in this loop we create a new row for every Row in the original Dataset




will not satisfy our needs, as our row will only have 2 columns,  feature count and the feature array in one column


Create The dataset after the loop with this line



Defining the Pipleline


If you head over to https://spark.apache.org/docs/latest/ml-pipeline.html you can see the toolkit for creating a pipline spark provides for you.


Today most intresting for us, the


How to use the String Indexer


How to use the Vector ASsembler

      //Get all the features, they are in all cols exept 0 and 1


Chain the indexers into a Pipeline

Now after defining the Indexers and Assemblers, we can stuff them in the Pipeline like this



Instantiate the pipeline

Instantiate an instance of the pipeline with



Apply the Pipline to a dataset, transforming it into Spark Format



Validate The Dataset Schema

To see if we have done everyting correct, print ot the schema



it should look like this


Spark Error ” Exception in thread “main” java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running” MetricsSystem”

I recently did some feature engineering on a few datasets with spark.

I wanted to make the datasets available in our Hadoop cluster, so i used our normal dataset upload pattern, but ran into these  nasty little errors



and also




The full exception stacks looks like this



So what do these error mean and why do they occur?

  • It often has to do something, with not initializing the Spark session properly
  • Usually means, there is a wrong value, for the master location
  • Double check the master adress for the spark session, by default it should use port 7077 and NOT  6066
  • Check if the version of the spark cluster  is the same as your spark version in the jar / of the job you want to submit

How do I fix “Could not find CoarseGrained Scheduler” or “an only call  getServletHandlers on a running MetricsSystem”

  • Update master URL
  • Update dependencies POM/Gradle/Jar, so you use the same version as the cluster

Now your error should be fixed. Have fun with your Spark Cluster!


How to fix Java Error “The trustAnchors parameter must be non-empty” while building Maven Single Jar

Recently I tried to deploy my Java  project as a single fat jar. To my bad awekening, Maven did not feel like creating jars anymore and complained with errors like :

You might encounter an error like this, if you recently reinstalled your java and want to build Jar with maven











The console with spit something like this out :




How to fix the error

To fix the error, we will download the java JDK and extract the certificates into our local java installation.

Get java


Extract the jdk



Copy the Certificate into your java directory


And now you can finally build your jar with


What is hadoop and Why is it awesome?


Hadoop Provides big companies a mean to distribute and store huge amount of data on not only one computer but multiple!  You can imagine it like you normal Window or Unix Filesystem, but only distributed! At first you might think, wow ok so what, now I have my 4K Video  on 5 different Computers, what do we get from that?

Usually, only one computer supplies us with the data we want, this one computer only has 1 Network connection and limited bandwith. If you have multiple computers in different locations using different connections to the internet, their bandwith sums up and a high perfomance boost will be noticable

You can imagine it, as 1 Person having to deliver a giant rocket consisting of multiple big parts. That one person can only deliver 1 Rocket part at a time. If we use 2 Persons, we already doubled our speed !  The same principle applies to down loading and uploading

So instead of just 1 Computer supplying you with a limited Datastream, you have multiple Computers serving you Data at the same time!


IF you want to setup Hadoop on your Local machien check <THIS> out!

If you are intrested in setting  up a Hadoop pseudo cluster check <THIS> out!

If you want to learn about basic java interacting with Hadoop, downloading, uploading from a Distributed File System  , check <THIS> out!




FileSystem Class