Category Archives: Distributed Systems

Kubernetes Helm Install Error: could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Error: could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

If you ever run into this error message when installing a chart with Helm into Kubernetes, try closing your Kubernetes connection and open it again!
In Azure AKS it would be
az aks browse --resource-group bachelor-ckl --name aks-ckl

This resolves the Error Kubernetes Helm Install Error: could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request.

Happy Devops -ing!

Creating Cromjobs

What is a cronjob or a crontab file?

Crontab (cron table) is a text file that specifies the schedule of cron jobs. There are two types of crontab files. The system-wide crontab files and individual user crontab files.

Users crontab files are stored by the user’s name and their location varies by operating systems. In Red Hat based system such as CentOS, crontab files are stored in the /var/spool/cron directory while on Debian and Ubuntu files are stored in the /var/spool/cron/crontabs directory.

Although you can edit the user crontab files manually, it is recommended to use the crontab command.

/etc/crontab and the files inside the /etc/cron.d directory are system-wide crontab files which can be edited only by the system administrators.
In most Linux distributions you can also put scripts inside the /etc/cron.{hourly,daily,weekly,monthly} directories and the scripts will be executed every hour/day/week/month

Linux Crontab Command

The crontab command allows you to install or open a crontab file for editing. You can use the crontab command to view, add, remove or modify cron jobs using the following options:

  • crontab -e – Edit crontab file, or create one if it doesn’t already exist.
  • crontab -l – Display crontab file contents.
  • crontab -r – Remove your current crontab file.
  • crontab -i – Remove your current crontab file with a prompt before removal.
  • crontab -u – Edit other use crontab file. Requires system administrator privileges.

Azure Security – Security methods overview

There are many ways to make your cloud system more secure, here is a little overview of the most common and useful techniques to achieve safe cloud infrastrucutre

Account Shared Access Signature

The account -SAS is a Signature, that enables the client to access resources in one or more of the storage services. Everything you can do with service SAS you can do with account SAS as well. So basically the account SAS is used for delegating access to a group of services

Service Shared Access Signature

The Service SAS is a Signature which is used to delegate access to exactly one resource.

Stored Access Policy

A stored acess policy gives you more fine tunes control over service SAS on the server side. The stored acess policy (SAP) can be used to group shared access signatures and to provide additional restrictions for signatures that are bound by that policy. You can use SAP on Blob containesr, File Shares, Qoues, and Tables.

Role Based Access controll (RBAC)

RBAC lets you distribute resource access much more fine-grained than with the other methods.

Intelij plugin for with Microsoft Azure tutorial deploying web app

In this tutorial we will checkout how to get the Microsoft Azure Plugin and how to use it.

First of all, start your IDE and hit Shift two times in quick succession and enter “plugin” to get quickly to the plugin instal menu.

Then just type Azure and install the fisrt plugin suggested, which is developed by microsoft.

After having installed and having created an account on the Azure website, you can login to your account through intelij.

Select the tools tab in the top toolbar and login into azure, using interactive mode and just type in your credentials you just used for making your Azure account.

Prepearing the Ressource groups

I was following this great tutorial from Microsoft, but I and probably a lot of other people encountered an error, when trying to launch a web app right after having created a new account in Azure.

Before you can launch anything in Azure, you need Ressoure groups. Even though the tutorial from Microsoft does not state it explicitly, you should really create an Ressource group, before attempting this.

Here is how you create a resource group :

Login to your Azure account on the Microsoft website and head to “My Account”

Next select “Create a resource”

And then select “Web App”

Enter a name for the App and the resource Group, click Create New

After having created the group, we are finally ready to deploy our app with Intelij!

Start a new project and select a web app in Maven and make sure you are creating the project from archtype!

Then just go to the root folder of your project and right click it in Intelij. You should now see the Azure options, which let you deploy your web app to the cloud!

If you did not login to Azure before, do it now.

Then you have to option to use an existing Web App or a new one. We want a new one, but we will use an existing resource group! For some reason, creating a resource group with intelij plugin, seems to result in exceptions. The only way to avoid those so far, is to create the group manualy in azure and then use that one for further deployment

After hitting run and waiting a few seconds, your console should update with an URL to your freshly deployed web app.

Thanks for reading and have fun in the cloud!

Java Spark Tips, Tricks and Basics 7 – How to accumulate a variable in Spark cluster? Why do we need to accumulate variables?

Why do we need Spark accumulators

An accumulator is a shared variable across all the nodes and it is used to accumulate values of a type ( Long or Double).

It is necessary to use an accumulator, to implement a distributed counting variable which can be updated by multiple processes.

Nodes may not read the value of an accumulator, but the driver has full access to it.

Nodes can only accumulate values into the accumulator.

You will find the functionality for this in the accumulator Class of Spark. Keep in mind, that we are using the AccumulatorV2, older accumulators are deprecated for Spark version below 2.0

 

Don’t forget to register your accumulator to the Spark Context if you create it separately.

 

What did we learn?

In this short tutorial, you learned what Spark Accumulators are for,  what accumulators do  and how to use them in Java.

Java Spark Tips, Tricks and Basics 6 – How to broadcast a variable to Spark cluster? Why do we need to broadcast variables?

Why do we need Spark broadcasters?

Spark is all about cluster computing. In a cluster of nodes, each node of course has it’s personal private memory.

If we want all the nodes in the cluster to work towards a common goal,  having shared variables just seems necessary.

Let’s say we want to sum up all the rows in a CSV table with 1 million lines. It makes just sense, to let 1 node work with 1/2 million and the other work with the other 1/2 million rows. Both calculate their results and then the driver program will combine their results.

Broadcasting allows us to create a read-only cached copy of a variable on every node in our cluster. The distribution of those variables is handled by efficient broadcast algorithms implemented by Spark under the hood. This will also take the burden of thinking about serialization and deserialization since good old Spark takes care of that!

This great functionality for broadcasting is provided by the SparkContext class.  Alternatively, one can also consider to use the broadcast class right away, do your work

How to broadcast a variable in Spark Java

What did we learn?

In this short tutorial, you learned what Spark Broadcast is for,  what Broadcast does and how to use it in Java.

Java Spark Tips, Tricks and Basics 2 – How to add columns to Datasets / Dataframes in Spark Java

This tutorial will show you how to add a new column to an already existing dataset /dataframe .

 

First we create a dataset.

 

 

Then we add a column with lit

 

and we are done!

Java Spark Tips, Tricks and Basics 1 – How to read images as Datasets / Dataframes from Hadoop in Spark Java

This tutorial will show you how to read a folder of images from a Hadoop folder.

Just use the following command and update the path to your image folder in the Hadoop HDFS

We will be using  Image Schema   and it’s  readImages function.

 

That’s it already!

 

Spark Feature Engineering Tutorial 4 – RCV1 Newswire stories categorized

Spark Feature Engineering Tutorial – 4 – transforming RCV1 dataset

What is the data?

The dataset was provided by the Jorunal of achien Learning research in 2004 as new benchmark for text categorization research. You can read more about the journal that has released the dataset over here .

Where to get the data ?

www.csie.ntu.edu.tw/
is the data Provider

 

 

What is the data about?

It contains information about Newswire stories and their categorization

 

Lets load the data

We can load the data into spark with this command

 

Let’s checkout the data

We should checkout the data schema and a few rows. This is how you can do it

 

;

Your console output should look like this

Very nice, the data is aleady in a nice format.

Using String Indexer

We will use the stirng indexer, to index the amount of classes we have.

 

This will define the first column as label column.

Using the Vector Assembler

 

This will define the 2nd column as feature column

Build the pieline

 

This will tell spark in which order to apply the transformers

Instantiate the pipeline

 

This will apply the pipeline on the original Datas and return a model.

Get the transformed dataset

 

This will apply the transformation on the dataset and returns the transformed Dataframe.

Lets checkout ou transformed data

This looks pretty good, but we do not need the Label and features column anymore.

Drop useless columns

 

The cleaned dataset

This is now our struct, perfect!

We are ready to do soem machine learning on this.

Let’s test if we transformed our data properly, by applying a linear classyfier to it!.

Define the linear classifier

 

Call your dataset with this function and you should get no errors, if you did everything like in this tutorial!

Here is the full code :

 

Spark Feature Engineering Tutorial 2 – Forest Covertype Data transformation

Getting to know the Data

Today we gonna checkout the forest covertype data which contains information about which tree type is the most predominant in a forest area.

Get the data : http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info

Let’s imagine you want to buy a big piece of forest land but you have no about the covertype of that area since nobody had the time to count the occurence of each tree in that forest. An approach to this, would be to predict the forest covertype with a trained neural network!

When we checkout the data is spark, we see there are 55 columns, it should look like this

There are 581,012 different datapoints or obserations in the dataset
There are 10 quantitative variables
There are 4 binary wilderness areas
40 binary soil type variabls
One of 7 forest cover types aka the labels we want to predict

In our data we find the labels in the last column called “_c54”

What Spark objects will we need?

https://spark.apache.org/docs/latest/api/java/index.html Get your documentation out, it’s time to program!

We will need the docs for Pipelines, Vectors, StringIndexer, VectorIndexer, Estimators, Transformers, and VectorAssembler

What is the vector indexer for?

The vector indexer enables us to detect whether the features of our data are categorical or continuous. We achieve this by passing a parameter N to Max Categories().

When the vector indexer is called during the pipeline execution process, it looks if there are more than N different values for each feature. If a feature has more than N different values, it is declared continuous. If a feature has N or fewer different values in its feature, this feature is declared categorical.

A pipeline consists of a sequence of stages in which each stage either has an estimator or transformer to be executed by calling Pipeline.fit(). On each estimator, the fit() method is called to generate a transformer which then transforms the data in the pipeline.

Create a Spark Session

 

 

Loading the data into Spark

 

Cast the columns to double

Since the columns are nativly interpreted as Strings, we have to cast them

 

Get the column names

 

Create the feature vector

What does a neural network like to eat the most That’s right feature vector! Time to cook up some crispy feature vectors for our ML Algorithms!
Since _c54 is the label, we will tell our Vector assembler to use all fields except the last one as input.
fieldNames[fieldNames.length-1]
This is the label column. We want to use the columns from _c0 to _c53 as features. That is why we have -2 in the solution. In code it looks like this :

 

Build the pipeline

Our previously defined transformers and Assembles now all go into a pipeline, which executes then sequentially on the data .

 

 

Transform our Data into ML format!

 

Test if it works

Now we can test our data with a sample classifier, add this function to your code and give it your transformed datase!

 

Enjoy and happy coding!