Tag Archives: Cluster Computing

Spark Kubernetes error Can only call getServletHandlers on a running MetricsSystem – How to fix it

Did you encounter an error like

java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem

The fix is pretty straight forward.
The error is caused by running a different Spark version in the cluster then the one used for Spark submit.

Double-check the cluster Spark version via for the cluster UI or checking the Spark Master pod.
Then check the version of your spark-submit, it is usually shown during submitting a job.

This is the full stack trace :

java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
at org.apache.spark.SparkContext.(SparkContext.scala:516)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at phabmacsJobs.KafkaJsonWriter$.main(KafkaJsonWriter.scala:32)
at phabmacsJobs.KafkaJsonWriter.main(KafkaJsonWriter.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
at org.apache.spark.SparkContext.(SparkContext.scala:516)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at phabmacsJobs.KafkaJsonWriter$.main(KafkaJsonWriter.scala:32)
at phabmacsJobs.KafkaJsonWriter.main(KafkaJsonWriter.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Things I wish I knew about Azure functions , before working with Azure

Every function in Azure consists of a trigger, input and output bindings and the code defining the function of course!

What are Triggers?

Triggers are mechanisms that trigger the execution of your function. You can setup triggers for a HTTP request, a database update or almost anything.

What are bindings?

The bindings define which resources our function will have access to. It will be provided as a parameter to the function

How to configure bindings and triggers?

Every function is accompanied with a function.json, whcih defines the bindings, the directions and triggers. For compiled languages, so any non scripting language, we do not have to create the function.json file ourselves, since it can be automatically generated from the function code. But for scirpting languages, we must define the function.json ourselves.

What are Durable Functions?

Durable Functions extends Azures classical functions with functions that can have a state AND are still in a serverless enviroment! Durable Functions are also nessecary, if you want to create an Orchestrator Function. The Durable Functions made up of different classical Azure Functions.

What are some Durable Function patterns?

Often, one common pattern for the Durable function is that you chain together a bunch of normal functions and their output it piped to the next function, this together. There are also Fan-out/fan-in patterns, which runs a bunch of functions in parallel and waits for all of then to finish, to return the final result. Then there is Async HTTP Api calls and liek the name implies, it enables us to make API calls that are not synchronous. Also, there is one pattern to program a human in the loob, called human interaction . You can check for more patterns the offical docs here

Java Spark Tips, Tricks and Basics 7 – How to accumulate a variable in Spark cluster? Why do we need to accumulate variables?

Why do we need Spark accumulators

An accumulator is a shared variable across all the nodes and it is used to accumulate values of a type ( Long or Double).

It is necessary to use an accumulator, to implement a distributed counting variable which can be updated by multiple processes.

Nodes may not read the value of an accumulator, but the driver has full access to it.

Nodes can only accumulate values into the accumulator.

You will find the functionality for this in the accumulator Class of Spark. Keep in mind, that we are using the AccumulatorV2, older accumulators are deprecated for Spark version below 2.0

 

Don’t forget to register your accumulator to the Spark Context if you create it separately.

 

What did we learn?

In this short tutorial, you learned what Spark Accumulators are for,  what accumulators do  and how to use them in Java.

Java Spark Tips, Tricks and Basics 6 – How to broadcast a variable to Spark cluster? Why do we need to broadcast variables?

Why do we need Spark broadcasters?

Spark is all about cluster computing. In a cluster of nodes, each node of course has it’s personal private memory.

If we want all the nodes in the cluster to work towards a common goal,  having shared variables just seems necessary.

Let’s say we want to sum up all the rows in a CSV table with 1 million lines. It makes just sense, to let 1 node work with 1/2 million and the other work with the other 1/2 million rows. Both calculate their results and then the driver program will combine their results.

Broadcasting allows us to create a read-only cached copy of a variable on every node in our cluster. The distribution of those variables is handled by efficient broadcast algorithms implemented by Spark under the hood. This will also take the burden of thinking about serialization and deserialization since good old Spark takes care of that!

This great functionality for broadcasting is provided by the SparkContext class.  Alternatively, one can also consider to use the broadcast class right away, do your work

How to broadcast a variable in Spark Java

What did we learn?

In this short tutorial, you learned what Spark Broadcast is for,  what Broadcast does and how to use it in Java.

Java Spark – Errors while using map function in cluster mode – Spark java

Ever tried a map function, for each function or a simmilar Lambda function and it runs in local mode but you cannot get it running in cluster mode?

Then you just found your solution!

 

First, go to your java root directory and call

 

If you have this error :

 

Or a Stack trace like this :

 

Java Spark Tips, Tricks and Basics 3 – How to select columns for nested Datasets / Dataframes in Spark Java

How to select columns from a nested Dataset/Dataframe in Spark java

 

Let’s assume we have nested data that looks like this

Let’s say we have the data stored and we load into a dataframe frist

 

 

 

We can now get a dataframe, only containing one of the nested colmns with the following command

 

 

And so on. So you just have to use “.” as separate to select any nested column.

 

Java Spark Tips, Tricks and Basics 2 – How to add columns to Datasets / Dataframes in Spark Java

This tutorial will show you how to add a new column to an already existing dataset /dataframe .

 

First we create a dataset.

 

 

Then we add a column with lit

 

and we are done!

Java Spark Tips, Tricks and Basics 1 – How to read images as Datasets / Dataframes from Hadoop in Spark Java

This tutorial will show you how to read a folder of images from a Hadoop folder.

Just use the following command and update the path to your image folder in the Hadoop HDFS

We will be using  Image Schema   and it’s  readImages function.

 

That’s it already!

 

Spark Feature Engineering Tutorial 4 – RCV1 Newswire stories categorized

Spark Feature Engineering Tutorial – 4 – transforming RCV1 dataset

What is the data?

The dataset was provided by the Jorunal of achien Learning research in 2004 as new benchmark for text categorization research. You can read more about the journal that has released the dataset over here .

Where to get the data ?

www.csie.ntu.edu.tw/
is the data Provider

 

 

What is the data about?

It contains information about Newswire stories and their categorization

 

Lets load the data

We can load the data into spark with this command

 

Let’s checkout the data

We should checkout the data schema and a few rows. This is how you can do it

 

;

Your console output should look like this

Very nice, the data is aleady in a nice format.

Using String Indexer

We will use the stirng indexer, to index the amount of classes we have.

 

This will define the first column as label column.

Using the Vector Assembler

 

This will define the 2nd column as feature column

Build the pieline

 

This will tell spark in which order to apply the transformers

Instantiate the pipeline

 

This will apply the pipeline on the original Datas and return a model.

Get the transformed dataset

 

This will apply the transformation on the dataset and returns the transformed Dataframe.

Lets checkout ou transformed data

This looks pretty good, but we do not need the Label and features column anymore.

Drop useless columns

 

The cleaned dataset

This is now our struct, perfect!

We are ready to do soem machine learning on this.

Let’s test if we transformed our data properly, by applying a linear classyfier to it!.

Define the linear classifier

 

Call your dataset with this function and you should get no errors, if you did everything like in this tutorial!

Here is the full code :

 

Spark Feature Engineering Tutorial 2 – Forest Covertype Data transformation

Getting to know the Data

Today we gonna checkout the forest covertype data which contains information about which tree type is the most predominant in a forest area.

Get the data : http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info

Let’s imagine you want to buy a big piece of forest land but you have no about the covertype of that area since nobody had the time to count the occurence of each tree in that forest. An approach to this, would be to predict the forest covertype with a trained neural network!

When we checkout the data is spark, we see there are 55 columns, it should look like this

There are 581,012 different datapoints or obserations in the dataset
There are 10 quantitative variables
There are 4 binary wilderness areas
40 binary soil type variabls
One of 7 forest cover types aka the labels we want to predict

In our data we find the labels in the last column called “_c54”

What Spark objects will we need?

https://spark.apache.org/docs/latest/api/java/index.html Get your documentation out, it’s time to program!

We will need the docs for Pipelines, Vectors, StringIndexer, VectorIndexer, Estimators, Transformers, and VectorAssembler

What is the vector indexer for?

The vector indexer enables us to detect whether the features of our data are categorical or continuous. We achieve this by passing a parameter N to Max Categories().

When the vector indexer is called during the pipeline execution process, it looks if there are more than N different values for each feature. If a feature has more than N different values, it is declared continuous. If a feature has N or fewer different values in its feature, this feature is declared categorical.

A pipeline consists of a sequence of stages in which each stage either has an estimator or transformer to be executed by calling Pipeline.fit(). On each estimator, the fit() method is called to generate a transformer which then transforms the data in the pipeline.

Create a Spark Session

 

 

Loading the data into Spark

 

Cast the columns to double

Since the columns are nativly interpreted as Strings, we have to cast them

 

Get the column names

 

Create the feature vector

What does a neural network like to eat the most That’s right feature vector! Time to cook up some crispy feature vectors for our ML Algorithms!
Since _c54 is the label, we will tell our Vector assembler to use all fields except the last one as input.
fieldNames[fieldNames.length-1]
This is the label column. We want to use the columns from _c0 to _c53 as features. That is why we have -2 in the solution. In code it looks like this :

 

Build the pipeline

Our previously defined transformers and Assembles now all go into a pipeline, which executes then sequentially on the data .

 

 

Transform our Data into ML format!

 

Test if it works

Now we can test our data with a sample classifier, add this function to your code and give it your transformed datase!

 

Enjoy and happy coding!