Spark Feature Engineering Tutorial 3 – Forest Alpha Brainwave Data transformation

Spark Java dataset transformation tutorial

In this tutorial we will learn, how to transform the ALPHA train set of brainwave data provided by one of the Machine Learning labs of Technische Universität Berlin.

 

Downloading the data using Wget

I saved you guys some time, you do not need to open another browser tab! Wget the juicy data straight from the source.

Helper Script

 

Download the feature data

 

Download the label Data

 

Then convert the data with the helper script

Convert Data with the provided python helper script

 

 

How does our data look like?

Label Is a value of either -1 or 1 so it tells us, weater the particle with the corrosponding features is an instance of the particle we want to classify or not .

1  would imply the features describe the particle, -1 would imply the features describe a different particle.

Features is a tuple, where the first entry is the feature count = 500 and the second is an array of all the feature values.

For our ML Format, we want all the Label columns to contain only doubles , no arrays

 

What Spark objects will we be needing?

 

How do we get our data in the proper format?

 

First we load the raw data into our dataframe

 

 

Then we define our label column .  and the struct field

 

Structfield for Label

 

This will be our label column.

The struct field  will be used to tell Spark which schema our new Rows and data frames should have. Fields will hold the definition for every column of our new dataframe.

The first entry of the StructField is for labels, the second one is for the Features.

 

Structifeld for features

 

We will have a vector type for the second column, so it can encode a feature vector of any size for us.

 

Build the struct from the fields

 

The schmea will be infered by the createStructType(fields) method, which will generate us a nice Struct that  represents all ou defined fields and is very nicely to work with.

 

Initialize our new DF

 

This writes the label column from the original data into our new dataset and also initialize it

 

Conert the Dataset to a list for easier Looping

 

 

It is time to get loopy … and to extract relevant data from the original dataset

 

The Actual Loop

 

    //in this loop we create a new row for every Row in the original Dataset

 

Using

 

will not satisfy our needs, as our row will only have 2 columns,  feature count and the feature array in one column

 

Create The dataset after the loop with this line

 

 

Defining the Pipleline

 

If you head over to https://spark.apache.org/docs/latest/ml-pipeline.html you can see the toolkit for creating a pipline spark provides for you.

 

Today most intresting for us, the

 

How to use the String Indexer

 

How to use the Vector ASsembler

      //Get all the features, they are in all cols exept 0 and 1

 

Chain the indexers into a Pipeline

Now after defining the Indexers and Assemblers, we can stuff them in the Pipeline like this

 

 

Instantiate the pipeline

Instantiate an instance of the pipeline with

 

 

Apply the Pipline to a dataset, transforming it into Spark Format

 

 

Validate The Dataset Schema

To see if we have done everyting correct, print ot the schema

 

 

it should look like this

 

Leave a Reply

Your email address will not be published. Required fields are marked *