Spark Feature Engineering Tutorial 4 – RCV1 Newswire stories categorized

Spark Feature Engineering Tutorial – 4 – transforming RCV1 dataset

What is the data?

The dataset was provided by the Jorunal of achien Learning research in 2004 as new benchmark for text categorization research. You can read more about the journal that has released the dataset over here .

Where to get the data ?
is the data Provider



What is the data about?

It contains information about Newswire stories and their categorization


Lets load the data

We can load the data into spark with this command


Let’s checkout the data

We should checkout the data schema and a few rows. This is how you can do it



Your console output should look like this

Very nice, the data is aleady in a nice format.

Using String Indexer

We will use the stirng indexer, to index the amount of classes we have.


This will define the first column as label column.

Using the Vector Assembler


This will define the 2nd column as feature column

Build the pieline


This will tell spark in which order to apply the transformers

Instantiate the pipeline


This will apply the pipeline on the original Datas and return a model.

Get the transformed dataset


This will apply the transformation on the dataset and returns the transformed Dataframe.

Lets checkout ou transformed data

This looks pretty good, but we do not need the Label and features column anymore.

Drop useless columns


The cleaned dataset

This is now our struct, perfect!

We are ready to do soem machine learning on this.

Let’s test if we transformed our data properly, by applying a linear classyfier to it!.

Define the linear classifier


Call your dataset with this function and you should get no errors, if you did everything like in this tutorial!

Here is the full code :


Leave a Reply

Your email address will not be published. Required fields are marked *