Java Spark – Errors while using map function in cluster mode – Spark java

Ever tried a map function, for each function or a simmilar Lambda function and it runs in local mode but you cannot get it running in cluster mode?

Then you just found your solution!

 

First, go to your java root directory and call

 

If you have this error :

 

Or a Stack trace like this :

 

Java Spark Tips, Tricks and Basics 3 – How to select columns for nested Datasets / Dataframes in Spark Java

How to select columns from a nested Dataset/Dataframe in Spark java

 

Let’s assume we have nested data that looks like this

Let’s say we have the data stored and we load into a dataframe frist

 

 

 

We can now get a dataframe, only containing one of the nested colmns with the following command

 

 

And so on. So you just have to use “.” as separate to select any nested column.

 

Java Spark Tips, Tricks and Basics 2 – How to add columns to Datasets / Dataframes in Spark Java

This tutorial will show you how to add a new column to an already existing dataset /dataframe .

 

First we create a dataset.

 

 

Then we add a column with lit

 

and we are done!

Java Spark Tips, Tricks and Basics 1 – How to read images as Datasets / Dataframes from Hadoop in Spark Java

This tutorial will show you how to read a folder of images from a Hadoop folder.

Just use the following command and update the path to your image folder in the Hadoop HDFS

We will be using  Image Schema   and it’s  readImages function.

 

That’s it already!

 

Spark Feature Engineering Tutorial 4 – RCV1 Newswire stories categorized

Spark Feature Engineering Tutorial – 4 – transforming RCV1 dataset

What is the data?

The dataset was provided by the Jorunal of achien Learning research in 2004 as new benchmark for text categorization research. You can read more about the journal that has released the dataset over here .

Where to get the data ?

www.csie.ntu.edu.tw/
is the data Provider

 

 

What is the data about?

It contains information about Newswire stories and their categorization

 

Lets load the data

We can load the data into spark with this command

 

Let’s checkout the data

We should checkout the data schema and a few rows. This is how you can do it

 

;

Your console output should look like this

Very nice, the data is aleady in a nice format.

Using String Indexer

We will use the stirng indexer, to index the amount of classes we have.

 

This will define the first column as label column.

Using the Vector Assembler

 

This will define the 2nd column as feature column

Build the pieline

 

This will tell spark in which order to apply the transformers

Instantiate the pipeline

 

This will apply the pipeline on the original Datas and return a model.

Get the transformed dataset

 

This will apply the transformation on the dataset and returns the transformed Dataframe.

Lets checkout ou transformed data

This looks pretty good, but we do not need the Label and features column anymore.

Drop useless columns

 

The cleaned dataset

This is now our struct, perfect!

We are ready to do soem machine learning on this.

Let’s test if we transformed our data properly, by applying a linear classyfier to it!.

Define the linear classifier

 

Call your dataset with this function and you should get no errors, if you did everything like in this tutorial!

Here is the full code :

 

Spark Feature Engineering Tutorial 2 – Forest Covertype Data transformation

Getting to know the Data

Today we gonna checkout the forest covertype data which contains information about which tree type is the most predominant in a forest area.

Get the data : http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info

Let’s imagine you want to buy a big piece of forest land but you have no about the covertype of that area since nobody had the time to count the occurence of each tree in that forest. An approach to this, would be to predict the forest covertype with a trained neural network!

When we checkout the data is spark, we see there are 55 columns, it should look like this

There are 581,012 different datapoints or obserations in the dataset
There are 10 quantitative variables
There are 4 binary wilderness areas
40 binary soil type variabls
One of 7 forest cover types aka the labels we want to predict

In our data we find the labels in the last column called “_c54”

What Spark objects will we need?

https://spark.apache.org/docs/latest/api/java/index.html Get your documentation out, it’s time to program!

We will need the docs for Pipelines, Vectors, StringIndexer, VectorIndexer, Estimators, Transformers, and VectorAssembler

What is the vector indexer for?

The vector indexer enables us to detect whether the features of our data are categorical or continuous. We achieve this by passing a parameter N to Max Categories().

When the vector indexer is called during the pipeline execution process, it looks if there are more than N different values for each feature. If a feature has more than N different values, it is declared continuous. If a feature has N or fewer different values in its feature, this feature is declared categorical.

A pipeline consists of a sequence of stages in which each stage either has an estimator or transformer to be executed by calling Pipeline.fit(). On each estimator, the fit() method is called to generate a transformer which then transforms the data in the pipeline.

Create a Spark Session

 

 

Loading the data into Spark

 

Cast the columns to double

Since the columns are nativly interpreted as Strings, we have to cast them

 

Get the column names

 

Create the feature vector

What does a neural network like to eat the most That’s right feature vector! Time to cook up some crispy feature vectors for our ML Algorithms!
Since _c54 is the label, we will tell our Vector assembler to use all fields except the last one as input.
fieldNames[fieldNames.length-1]
This is the label column. We want to use the columns from _c0 to _c53 as features. That is why we have -2 in the solution. In code it looks like this :

 

Build the pipeline

Our previously defined transformers and Assembles now all go into a pipeline, which executes then sequentially on the data .

 

 

Transform our Data into ML format!

 

Test if it works

Now we can test our data with a sample classifier, add this function to your code and give it your transformed datase!

 

Enjoy and happy coding!

Spark Feature Engineering Tutorial 1 – Quantum Data transformation

Transforming Quantum Datasets into Spark ML Format – tackling the Particle Physics tasks by cs-Cornell

Since Spark ML algorithms only work on datasets in the correct data format, it is necessary to transform your data to the proper data format.

But what is the right data format?

To find out, today we are checking looking at quantum physics particle data! The challenge provided by Cornell is to learn a classification rule that differentiates between two types of particles generated in high energy collider experiments!

It has 78 attributes and 2 classes that we want to find.

First, we head to http://osmot.cs.cornell.edu/kddcup/datasets.html and register. Then we download the datasets provided.

It comes with 2 datasets, one for the classification of 2 different types of particles that are

Downloading the data using Wget

Helper Script

 

Download the tain data

 

Then convert the data with the helper script

Convert Data with the provided python helper script

 

Use a text editor like Nano to check out the data. The phy_test.dat looks something like this:

As we can see, the data is separated by tabs!

What is the data format?

Each line in the training and test files describe one example. The structure of each line is as follows:

  • The first element of each line is an EXAMPLE ID which uniquely describes the example. You will need this EXAMPLE ID when submitting results.
  • The second element is the example’s class. Positive examples are denoted by 1, negative examples by 0. Test examples have a “?” in this position. This is a balanced problem so the target values are roughly half 0s and 1s. All of the following elements are feature values. There are 78 feature values in each line.
  • Missing values: columns 22, 23, 24, and 46, 47, 48 use a value of “999” to denote “not available,” and columns 31 and 57 use “9999” to denote “not available.” These are the column numbers in the data tables starting with 1 for the first column (the case ID numbers). If you remove the first two columns (the case ID numbers and the targets), and start numbering the columns at the first attribute, these are attributes 20, 21, 22, and 44, 45, 46, and 29 and 55, respectively. You may treat missing values any way you want, including coding them as a unique value, imputing missing values, using learning methods that can handle missing values, ignoring these attributes, etc..

The elements in each line are separated by whitespace.

So we have an ID, 2 classes and continuous attributes

What Spark objects will we need?

https://spark.apache.org/docs/latest/api/java/index.html Get your documentation out, it’s time to program!

We will need the docs for DataFrame , Pipelines, Vectors, StringIndexer, VectorIndexer, Estimators, Transformers, Parameter  and VectorAssembler.

A pipeline consists of a sequence of stages in which each stage either has an estimator or transformer to be executed by calling Pipeline.fit(). On each estimator, the fit() method is called to generate a transformer which then transforms the data in the pipeline.

What is the vector indexer for?

The vector indexer enables us to detect whether the features of our data are categorical or continuous. We achieve this by passing a parameter N to Max Categories().

When the vector indexer is called during the pipeline execution process, it looks if there are more than N different values for each feature. If a feature has more than N different values, it is declared continuous. If a feature has N or fewer different values in its feature, this feature is declared categorical.

What is a Pipline for?

A pipeline is the thing, where we put all our Transformers  and  Estimators

– E.g.: Feature 0 has unique values {-1.0, 0.0} and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 is categorical and feature 1 is continuous.

 

Loading the data into Spark

 

 

We specify the tab delimiters with .option(“sep”, “\t”)

Doing this, Spark will label all the columns with _C0 to _C80.

_C0 represents the data point’s ID

_C1 represents the class

_C2 – _C80 are feature values

Define the labels

Define the column that contains the data labels with a StringIndexer. We know _c1 contains the class values, so we tell our indexer that that is the input column and the output column will be “indexedLabel” in our example.

 


As we can see by calling this code, all our columns are currently of type String, but   there are Double values in the rows.

 

 

Cast data types

This sexy loop updates all column data types to double

 

 

Index the features

Now it is time to use our VectorAssembler. We give it all the column names as input and a range of which columns to index into the features. Since column 0 is an ID and column 1 is a label, we start indexing features from the index 2.

 

Putting it all together

After defining all our estimators and transformers it is time to pass them all as a set to the Pipeline constructor. It will call every component of our Pipeline sequentially to generate the desired data format. Calling the transform function on the dataset services us with the processed data.

 

 

Verify that we transformed our data properly

Let’s create a simple classification tree to see if Spark ML can work with our data!

Here is the function, which you can just copy (make sure you have the same col names!):

 

 

If all went well, you should get an output similar to this:

Test Error = 0.3079487863430248
Accuracy: 0.6920512136569752

Congratulations! We have imported a dataset from the Internet, built a pipeline to transform its column datatype to the proper machine learning Spark format!

Spark Encoders, what are they there for and why do we need them?

  • Encoders main purpose is the task of performing  serialization  and deserialization (SerDe)
  • Since Spark does not save data as JVM objects, but instead in it’s very own binary format.
  • Spark comes with a lot of build in encoders
  • An Encoder  priovides information about a tables schema, without  having to deserialize the whole object.
  • Encoders are nessecary when mapping datasets

Spark Java Library overview

Spark Java Library overview

This is an overview of the full ecosystem, but we will just take a look at the most important classes for everyday usage.

Spark provides a huge library of useful classes and interfaces. It is important to know, what functionality your tool offers or otherwise you are not able to use it to its fullest potential.

 

That’s why I give you this great overview :

 

What is Sparksession for?

The Sparksession provided the main entry point for dataset functionality in spark.

It provides ways to

 

What is SparkContext for?

The Spark Context provides the main entry  point for Spark functionality.

The context represents the connection to a Spark cluster and you can do operations in clusters with it like

  • broadcast Variables to a cluster
  • create Resilient Distributed Datasets
  • create accumulators
  • distribute files in nodes of you spark cluster
  • distribute jars in nodes of your spark cluster
  • broadcast any variable to nodes in your cluster
  • Submit jobs  to the Cluster

 

What is the Spark Launcher for?

This class allows you, to launch any application in your spark cluster programmatically from java!  It will launch a new Spark Application as child process

 

What is the inProcessLauncher for?

This allows you to launch a new Spark application, within the invoking process

 

What is SparkAppHandle for?

The appHandler object is returned after starting a Spark app with the Spark launcher.

The appHandler gives you options to monitor and control the running application.

 

What is SparkAppHandle.listener for?

 

What is SparkSQLContext for?

The Spark Context provides the main entry point for Spark functionality.

The context representsit’s the connection to a Spark cluster and you can do operations in clusters with it like

  • broadcast Variables to a cluster
  • create Resilient Distributed Datasets
  • create accumulators

 

What are UserDefinedFunctions for?