Getting to know the Data
Today we gonna checkout the forest covertype data which contains information about which tree type is the most predominant in a forest area.
Get the data : http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info
Let’s imagine you want to buy a big piece of forest land but you have no about the covertype of that area since nobody had the time to count the occurence of each tree in that forest. An approach to this, would be to predict the forest covertype with a trained neural network!
When we checkout the data is spark, we see there are 55 columns, it should look like this
There are 581,012 different datapoints or obserations in the dataset
There are 10 quantitative variables
There are 4 binary wilderness areas
40 binary soil type variabls
One of 7 forest cover types aka the labels we want to predict
In our data we find the labels in the last column called “_c54”
What Spark objects will we need?
https://spark.apache.org/docs/latest/api/java/index.html Get your documentation out, it’s time to program!
We will need the docs for Pipelines, Vectors, StringIndexer, VectorIndexer, Estimators, Transformers, and VectorAssembler
What is the vector indexer for?
The vector indexer enables us to detect whether the features of our data are categorical or continuous. We achieve this by passing a parameter N to Max Categories().
When the vector indexer is called during the pipeline execution process, it looks if there are more than N different values for each feature. If a feature has more than N different values, it is declared continuous. If a feature has N or fewer different values in its feature, this feature is declared categorical.
A pipeline consists of a sequence of stages in which each stage either has an estimator or transformer to be executed by calling Pipeline.fit(). On each estimator, the fit() method is called to generate a transformer which then transforms the data in the pipeline.
Create a Spark Session
|
spark=SparkSession.builder().appName("ML_Formatting_spark").master("local").config("spark.driver.maxResultSize","4g").config("spark.driver.cores", 1).config("spark.driver.memory", "4g").config("spark.executor.cores", 1).config("spark.executor.memory", "4g").getOrCreate(); |
Loading the data into Spark
|
Dataset data = spark.read().format("csv").load("/path/to/data/covtype.data"); |
Cast the columns to double
Since the columns are nativly interpreted as Strings, we have to cast them
|
int i = 0; for (String c: data.columns()) { data = data.withColumn(c, data.col(c).cast("Double")); } |
Get the column names
|
String[] fieldNames = data.schema().fieldNames(); |
Create the feature vector
What does a neural network like to eat the most That’s right feature vector! Time to cook up some crispy feature vectors for our ML Algorithms!
Since _c54 is the label, we will tell our Vector assembler to use all fields except the last one as input.
fieldNames[fieldNames.length-1]
This is the label column. We want to use the columns from _c0 to _c53 as features. That is why we have -2 in the solution. In code it looks like this :
|
VectorAssembler assembler = new VectorAssembler().setInputCols( Arrays.copyOfRange(fieldNames, 0, fieldNames.length - 2)) .setOutputCol(GlobalDefinition.FEATURE_COL); |
Build the pipeline
Our previously defined transformers and Assembles now all go into a pipeline, which executes then sequentially on the data .
|
// Chain indexers and tree in a Pipeline. Pipeline pipeline = new Pipeline() .setStages(new PipelineStage[]{labelIndexer, assembler}); |
|
// Train model. This also runs the indexers. PipelineModel model = pipeline.fit(data); |
Transform our Data into ML format!
|
Dataset transformed_ML_data = model.transform(data); |
Test if it works
Now we can test our data with a sample classifier, add this function to your code and give it your transformed datase!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
|
private void applyClassifier(Dataset dataset) { // Split the data into training and test sets (30% held out for testing). Dataset[] splits = dataset.randomSplit(new double[] { 0.7, 0.3 }); Dataset trainingData = splits[0]; Dataset testData = splits[1]; // Train a DecisionTree model. DecisionTreeClassifier dt = new DecisionTreeClassifier().setLabelCol("indexedLabel") .setFeaturesCol("indexedFeatures"); // Chain indexers and tree in a Pipeline. Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] { dt }); System.out.println("Start Training"); // Train model. This also runs the indexers. PipelineModel model = pipeline.fit(trainingData); // Select (prediction, true label) and compute test error. MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator() .setLabelCol("indexedLabel").setPredictionCol("prediction").setMetricName("accuracy"); // Make predictions. Dataset predictions = model.transform(testData); double accuracy = evaluator.evaluate(predictions); System.out.println("Test Error = " + (1.0 - accuracy)); double metrics = evaluator.evaluate(predictions); System.out.println("Accuracy" + ": " + String.valueOf(metrics)); } |
Enjoy and happy coding!