Spark Feature Engineering Tutorial 3 – Forest Alpha Brainwave Data transformation

Spark Java dataset transformation tutorial

In this tutorial we will learn, how to transform the ALPHA train set of brainwave data provided by one of the Machine Learning labs of Technische Universität Berlin.


Downloading the data using Wget

I saved you guys some time, you do not need to open another browser tab! Wget the juicy data straight from the source.

Helper Script


Download the feature data


Download the label Data


Then convert the data with the helper script

Convert Data with the provided python helper script



How does our data look like?

Label Is a value of either -1 or 1 so it tells us, weater the particle with the corrosponding features is an instance of the particle we want to classify or not .

1  would imply the features describe the particle, -1 would imply the features describe a different particle.

Features is a tuple, where the first entry is the feature count = 500 and the second is an array of all the feature values.

For our ML Format, we want all the Label columns to contain only doubles , no arrays


What Spark objects will we be needing?


How do we get our data in the proper format?


First we load the raw data into our dataframe



Then we define our label column .  and the struct field


Structfield for Label


This will be our label column.

The struct field  will be used to tell Spark which schema our new Rows and data frames should have. Fields will hold the definition for every column of our new dataframe.

The first entry of the StructField is for labels, the second one is for the Features.


Structifeld for features


We will have a vector type for the second column, so it can encode a feature vector of any size for us.


Build the struct from the fields


The schmea will be infered by the createStructType(fields) method, which will generate us a nice Struct that  represents all ou defined fields and is very nicely to work with.


Initialize our new DF


This writes the label column from the original data into our new dataset and also initialize it


Conert the Dataset to a list for easier Looping



It is time to get loopy … and to extract relevant data from the original dataset


The Actual Loop


    //in this loop we create a new row for every Row in the original Dataset




will not satisfy our needs, as our row will only have 2 columns,  feature count and the feature array in one column


Create The dataset after the loop with this line



Defining the Pipleline


If you head over to you can see the toolkit for creating a pipline spark provides for you.


Today most intresting for us, the


How to use the String Indexer


How to use the Vector ASsembler

      //Get all the features, they are in all cols exept 0 and 1


Chain the indexers into a Pipeline

Now after defining the Indexers and Assemblers, we can stuff them in the Pipeline like this



Instantiate the pipeline

Instantiate an instance of the pipeline with



Apply the Pipline to a dataset, transforming it into Spark Format



Validate The Dataset Schema

To see if we have done everyting correct, print ot the schema



it should look like this


Leave a Reply

Your email address will not be published. Required fields are marked *