Category Archives: Java

Intelij plugin for with Microsoft Azure tutorial deploying web app

In this tutorial we will checkout how to get the Microsoft Azure Plugin and how to use it.

First of all, start your IDE and hit Shift two times in quick succession and enter “plugin” to get quickly to the plugin instal menu.

Then just type Azure and install the fisrt plugin suggested, which is developed by microsoft.

After having installed and having created an account on the Azure website, you can login to your account through intelij.

Select the tools tab in the top toolbar and login into azure, using interactive mode and just type in your credentials you just used for making your Azure account.

Prepearing the Ressource groups

I was following this great tutorial from Microsoft, but I and probably a lot of other people encountered an error, when trying to launch a web app right after having created a new account in Azure.

Before you can launch anything in Azure, you need Ressoure groups. Even though the tutorial from Microsoft does not state it explicitly, you should really create an Ressource group, before attempting this.

Here is how you create a resource group :

Login to your Azure account on the Microsoft website and head to “My Account”

Next select “Create a resource”

And then select “Web App”

Enter a name for the App and the resource Group, click Create New

After having created the group, we are finally ready to deploy our app with Intelij!

Start a new project and select a web app in Maven and make sure you are creating the project from archtype!

Then just go to the root folder of your project and right click it in Intelij. You should now see the Azure options, which let you deploy your web app to the cloud!

If you did not login to Azure before, do it now.

Then you have to option to use an existing Web App or a new one. We want a new one, but we will use an existing resource group! For some reason, creating a resource group with intelij plugin, seems to result in exceptions. The only way to avoid those so far, is to create the group manualy in azure and then use that one for further deployment

After hitting run and waiting a few seconds, your console should update with an URL to your freshly deployed web app.

Thanks for reading and have fun in the cloud!

Java Spark Tips, Tricks and Basics 7 – How to accumulate a variable in Spark cluster? Why do we need to accumulate variables?

Why do we need Spark accumulators

An accumulator is a shared variable across all the nodes and it is used to accumulate values of a type ( Long or Double).

It is necessary to use an accumulator, to implement a distributed counting variable which can be updated by multiple processes.

Nodes may not read the value of an accumulator, but the driver has full access to it.

Nodes can only accumulate values into the accumulator.

You will find the functionality for this in the accumulator Class of Spark. Keep in mind, that we are using the AccumulatorV2, older accumulators are deprecated for Spark version below 2.0

 

Don’t forget to register your accumulator to the Spark Context if you create it separately.

 

What did we learn?

In this short tutorial, you learned what Spark Accumulators are for,  what accumulators do  and how to use them in Java.

Java Spark Tips, Tricks and Basics 6 – How to broadcast a variable to Spark cluster? Why do we need to broadcast variables?

Why do we need Spark broadcasters?

Spark is all about cluster computing. In a cluster of nodes, each node of course has it’s personal private memory.

If we want all the nodes in the cluster to work towards a common goal,  having shared variables just seems necessary.

Let’s say we want to sum up all the rows in a CSV table with 1 million lines. It makes just sense, to let 1 node work with 1/2 million and the other work with the other 1/2 million rows. Both calculate their results and then the driver program will combine their results.

Broadcasting allows us to create a read-only cached copy of a variable on every node in our cluster. The distribution of those variables is handled by efficient broadcast algorithms implemented by Spark under the hood. This will also take the burden of thinking about serialization and deserialization since good old Spark takes care of that!

This great functionality for broadcasting is provided by the SparkContext class.  Alternatively, one can also consider to use the broadcast class right away, do your work

How to broadcast a variable in Spark Java

What did we learn?

In this short tutorial, you learned what Spark Broadcast is for,  what Broadcast does and how to use it in Java.

Java Spark Tips, Tricks and Basics 3 – How to select columns for nested Datasets / Dataframes in Spark Java

How to select columns from a nested Dataset/Dataframe in Spark java

 

Let’s assume we have nested data that looks like this

Let’s say we have the data stored and we load into a dataframe frist

 

 

 

We can now get a dataframe, only containing one of the nested colmns with the following command

 

 

And so on. So you just have to use “.” as separate to select any nested column.

 

Java Spark Tips, Tricks and Basics 2 – How to add columns to Datasets / Dataframes in Spark Java

This tutorial will show you how to add a new column to an already existing dataset /dataframe .

 

First we create a dataset.

 

 

Then we add a column with lit

 

and we are done!

Java Spark Tips, Tricks and Basics 1 – How to read images as Datasets / Dataframes from Hadoop in Spark Java

This tutorial will show you how to read a folder of images from a Hadoop folder.

Just use the following command and update the path to your image folder in the Hadoop HDFS

We will be using  Image Schema   and it’s  readImages function.

 

That’s it already!

 

Spark Feature Engineering Tutorial 2 – Forest Covertype Data transformation

Getting to know the Data

Today we gonna checkout the forest covertype data which contains information about which tree type is the most predominant in a forest area.

Get the data : http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info

Let’s imagine you want to buy a big piece of forest land but you have no about the covertype of that area since nobody had the time to count the occurence of each tree in that forest. An approach to this, would be to predict the forest covertype with a trained neural network!

When we checkout the data is spark, we see there are 55 columns, it should look like this

There are 581,012 different datapoints or obserations in the dataset
There are 10 quantitative variables
There are 4 binary wilderness areas
40 binary soil type variabls
One of 7 forest cover types aka the labels we want to predict

In our data we find the labels in the last column called “_c54”

What Spark objects will we need?

https://spark.apache.org/docs/latest/api/java/index.html Get your documentation out, it’s time to program!

We will need the docs for Pipelines, Vectors, StringIndexer, VectorIndexer, Estimators, Transformers, and VectorAssembler

What is the vector indexer for?

The vector indexer enables us to detect whether the features of our data are categorical or continuous. We achieve this by passing a parameter N to Max Categories().

When the vector indexer is called during the pipeline execution process, it looks if there are more than N different values for each feature. If a feature has more than N different values, it is declared continuous. If a feature has N or fewer different values in its feature, this feature is declared categorical.

A pipeline consists of a sequence of stages in which each stage either has an estimator or transformer to be executed by calling Pipeline.fit(). On each estimator, the fit() method is called to generate a transformer which then transforms the data in the pipeline.

Create a Spark Session

 

 

Loading the data into Spark

 

Cast the columns to double

Since the columns are nativly interpreted as Strings, we have to cast them

 

Get the column names

 

Create the feature vector

What does a neural network like to eat the most That’s right feature vector! Time to cook up some crispy feature vectors for our ML Algorithms!
Since _c54 is the label, we will tell our Vector assembler to use all fields except the last one as input.
fieldNames[fieldNames.length-1]
This is the label column. We want to use the columns from _c0 to _c53 as features. That is why we have -2 in the solution. In code it looks like this :

 

Build the pipeline

Our previously defined transformers and Assembles now all go into a pipeline, which executes then sequentially on the data .

 

 

Transform our Data into ML format!

 

Test if it works

Now we can test our data with a sample classifier, add this function to your code and give it your transformed datase!

 

Enjoy and happy coding!