Category Archives: Machine Learning Proejcts

Tensorflow 1.x and 2.x Saving Error Using a `tf.Tensor` as a Python `bool` is not allowed.

Ever tried to save a Tensorflow model with tf.compat.v1.saved_model.simple_save or a similar TF saving method function?

Ever Encounterd in TF 1,X
TypeError: Using a tf.Tensor as a Python bool is not allowed. Use if t is not None: instead of if t: to test if a tensor is defined, and use TensorFlow ops such as tf.cond to execute subgraphs conditioned on the value of a tensor.
or in TF 2.X
OperatorNotAllowedInGraphError: using a tf.Tensor as a Python bool is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

Error
DONT BE FOOLED! This error is not what it seems. At least for me…

This was my save method that was causing the error


token_tensor = tf.ones((input_len,batch_size), "int32", "token_tensor")
segment_tensor = tf.ones((input_len,batch_size), "int32", "segment_tensor")
mask_tensor = tf.ones((input_len,batch_size), "float32", "mask_tensor")
seq_out = model.get_sequence_output()

with tf.compat.v1.Session() as sess:
tf.compat.v1.saved_model.simple_save(
sess,
export_dir,
inputs={'input': token_tensor, 'segment' : segment_tensor, 'mask' : mask_tensor},
outputs=seq_out,
legacy_init_op=init_op
)

See the error? Its very minor…
The problem was, the output tensor IS NOT INSIDE OF A DICT !
Duuuuh! Isn’t that obvious to infer from the

Looking at the source code of the save function is what actually made me see the issue!
simple_save.py

So here is the fix, just define a dict for your input or output tensors!

token_tensor = tf.ones((input_len,batch_size), "int32", "token_tensor")
segment_tensor = tf.ones((input_len,batch_size), "int32", "segment_tensor")
mask_tensor = tf.ones((input_len,batch_size), "float32", "mask_tensor")
seq_out = model.get_sequence_output()

with tf.compat.v1.Session() as sess:
tf.compat.v1.saved_model.simple_save(
sess,
export_dir,
inputs={'input': token_tensor, 'segment' : segment_tensor, 'mask' : mask_tensor},
outputs={"out": seq_out},
legacy_init_op=init_op
)

Happy TensorFlow hacking!

This is the full Stack trace in TF 1.X
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
45 outputs= seq_out, #{'output': mask_tensor, 'norms': mask_tensor},
46 #outputs={'word_emb': model_wordembedding_output, 'sentence_emb': model_sentence_embedding_output},
---> 47 legacy_init_op=init_op
48 )
49 # print('saving done')

/home/loan/venv/XLNET_jupyter_venv/lib/python2.7/site-packages/tensorflow/python/util/deprecation.pyc in new_func(*args, **kwargs)
322 'in a future version' if date is None else ('after %s' % date),
323 instructions)
--> 324 return func(*args, **kwargs)
325 return tf_decorator.make_decorator(
326 func, new_func, 'deprecated',

/home/loan/venv/XLNET_jupyter_venv/lib/python2.7/site-packages/tensorflow/python/saved_model/simple_save.pyc in simple_save(session, export_dir, inputs, outputs, legacy_init_op)
79 signature_def_map = {
80 signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
---> 81 signature_def_utils.predict_signature_def(inputs, outputs)
82 }
83 b = builder.SavedModelBuilder(export_dir)

/home/loan/venv/XLNET_jupyter_venv/lib/python2.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.pyc in predict_signature_def(inputs, outputs)
195 if inputs is None or not inputs:
196 raise ValueError('Prediction inputs cannot be None or empty.')
--> 197 if outputs is None or not outputs:
198 raise ValueError('Prediction outputs cannot be None or empty.')
199

/home/loan/venv/XLNET_jupyter_venv/lib/python2.7/site-packages/tensorflow/python/framework/ops.pyc in __nonzero__(self)
702 TypeError.
703 """
--> 704 raise TypeError("Using a tf.Tensor as a Python bool is not allowed. "
705 "Use if t is not None: instead of if t: to test if a "
706 "tensor is defined, and use TensorFlow ops such as "

TypeError: Using a tf.Tensor as a Python bool is not allowed. Use if t is not None: instead of if t: to test if a tensor is defined, and use TensorFlow ops such as tf.cond to execute subgraphs conditioned on the value of a tensor.

And Tensorflow 2.x

---------------------------------------------------------------------------
OperatorNotAllowedInGraphError Traceback (most recent call last)
in
86 inputs=bert_inputs,
87 outputs=table_tensor,
---> 88 legacy_init_op=init_op
89 )
90

~/venv/XLNET_py3_venv/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py in new_func(*args, **kwargs)
322 'in a future version' if date is None else ('after %s' % date),
323 instructions)
--> 324 return func(*args, **kwargs)
325 return tf_decorator.make_decorator(
326 func, new_func, 'deprecated',

~/venv/XLNET_py3_venv/lib/python3.7/site-packages/tensorflow_core/python/saved_model/simple_save.py in simple_save(session, export_dir, inputs, outputs, legacy_init_op)
79 signature_def_map = {
80 signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
---> 81 signature_def_utils.predict_signature_def(inputs, outputs)
82 }
83 b = builder.SavedModelBuilder(export_dir)

~/venv/XLNET_py3_venv/lib/python3.7/site-packages/tensorflow_core/python/saved_model/signature_def_utils_impl.py in predict_signature_def(inputs, outputs)
195 if inputs is None or not inputs:
196 raise ValueError('Prediction inputs cannot be None or empty.')
--> 197 if outputs is None or not outputs:
198 raise ValueError('Prediction outputs cannot be None or empty.')
199

~/venv/XLNET_py3_venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in __bool__(self)
755 TypeError.
756 """
--> 757 self._disallow_bool_casting()
758
759 def __nonzero__(self):

~/venv/XLNET_py3_venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in _disallow_bool_casting(self)
524 else:
525 # Default: V1-style Graph execution.
--> 526 self._disallow_in_graph_mode("using a tf.Tensor as a Python bool")
527
528 def _disallow_iteration(self):

~/venv/XLNET_py3_venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py in _disallow_in_graph_mode(self, task)
513 raise errors.OperatorNotAllowedInGraphError(
514 "{} is not allowed in Graph execution. Use Eager execution or decorate"
--> 515 " this function with @tf.function.".format(task))
516
517 def _disallow_bool_casting(self):

OperatorNotAllowedInGraphError: using a tf.Tensor as a Python bool is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

Java Spark Tips, Tricks and Basics 6 – How to broadcast a variable to Spark cluster? Why do we need to broadcast variables?

Why do we need Spark broadcasters?

Spark is all about cluster computing. In a cluster of nodes, each node of course has it’s personal private memory.

If we want all the nodes in the cluster to work towards a common goal,  having shared variables just seems necessary.

Let’s say we want to sum up all the rows in a CSV table with 1 million lines. It makes just sense, to let 1 node work with 1/2 million and the other work with the other 1/2 million rows. Both calculate their results and then the driver program will combine their results.

Broadcasting allows us to create a read-only cached copy of a variable on every node in our cluster. The distribution of those variables is handled by efficient broadcast algorithms implemented by Spark under the hood. This will also take the burden of thinking about serialization and deserialization since good old Spark takes care of that!

This great functionality for broadcasting is provided by the SparkContext class.  Alternatively, one can also consider to use the broadcast class right away, do your work

How to broadcast a variable in Spark Java

What did we learn?

In this short tutorial, you learned what Spark Broadcast is for,  what Broadcast does and how to use it in Java.

Spark Feature Engineering Tutorial 4 – RCV1 Newswire stories categorized

Spark Feature Engineering Tutorial – 4 – transforming RCV1 dataset

What is the data?

The dataset was provided by the Jorunal of achien Learning research in 2004 as new benchmark for text categorization research. You can read more about the journal that has released the dataset over here .

Where to get the data ?

www.csie.ntu.edu.tw/
is the data Provider

 

 

What is the data about?

It contains information about Newswire stories and their categorization

 

Lets load the data

We can load the data into spark with this command

 

Let’s checkout the data

We should checkout the data schema and a few rows. This is how you can do it

 

;

Your console output should look like this

Very nice, the data is aleady in a nice format.

Using String Indexer

We will use the stirng indexer, to index the amount of classes we have.

 

This will define the first column as label column.

Using the Vector Assembler

 

This will define the 2nd column as feature column

Build the pieline

 

This will tell spark in which order to apply the transformers

Instantiate the pipeline

 

This will apply the pipeline on the original Datas and return a model.

Get the transformed dataset

 

This will apply the transformation on the dataset and returns the transformed Dataframe.

Lets checkout ou transformed data

This looks pretty good, but we do not need the Label and features column anymore.

Drop useless columns

 

The cleaned dataset

This is now our struct, perfect!

We are ready to do soem machine learning on this.

Let’s test if we transformed our data properly, by applying a linear classyfier to it!.

Define the linear classifier

 

Call your dataset with this function and you should get no errors, if you did everything like in this tutorial!

Here is the full code :

 

Spark Feature Engineering Tutorial 2 – Forest Covertype Data transformation

Getting to know the Data

Today we gonna checkout the forest covertype data which contains information about which tree type is the most predominant in a forest area.

Get the data : http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info

Let’s imagine you want to buy a big piece of forest land but you have no about the covertype of that area since nobody had the time to count the occurence of each tree in that forest. An approach to this, would be to predict the forest covertype with a trained neural network!

When we checkout the data is spark, we see there are 55 columns, it should look like this

There are 581,012 different datapoints or obserations in the dataset
There are 10 quantitative variables
There are 4 binary wilderness areas
40 binary soil type variabls
One of 7 forest cover types aka the labels we want to predict

In our data we find the labels in the last column called “_c54”

What Spark objects will we need?

https://spark.apache.org/docs/latest/api/java/index.html Get your documentation out, it’s time to program!

We will need the docs for Pipelines, Vectors, StringIndexer, VectorIndexer, Estimators, Transformers, and VectorAssembler

What is the vector indexer for?

The vector indexer enables us to detect whether the features of our data are categorical or continuous. We achieve this by passing a parameter N to Max Categories().

When the vector indexer is called during the pipeline execution process, it looks if there are more than N different values for each feature. If a feature has more than N different values, it is declared continuous. If a feature has N or fewer different values in its feature, this feature is declared categorical.

A pipeline consists of a sequence of stages in which each stage either has an estimator or transformer to be executed by calling Pipeline.fit(). On each estimator, the fit() method is called to generate a transformer which then transforms the data in the pipeline.

Create a Spark Session

 

 

Loading the data into Spark

 

Cast the columns to double

Since the columns are nativly interpreted as Strings, we have to cast them

 

Get the column names

 

Create the feature vector

What does a neural network like to eat the most That’s right feature vector! Time to cook up some crispy feature vectors for our ML Algorithms!
Since _c54 is the label, we will tell our Vector assembler to use all fields except the last one as input.
fieldNames[fieldNames.length-1]
This is the label column. We want to use the columns from _c0 to _c53 as features. That is why we have -2 in the solution. In code it looks like this :

 

Build the pipeline

Our previously defined transformers and Assembles now all go into a pipeline, which executes then sequentially on the data .

 

 

Transform our Data into ML format!

 

Test if it works

Now we can test our data with a sample classifier, add this function to your code and give it your transformed datase!

 

Enjoy and happy coding!

Spark Feature Engineering Tutorial 1 – Quantum Data transformation

Transforming Quantum Datasets into Spark ML Format – tackling the Particle Physics tasks by cs-Cornell

Since Spark ML algorithms only work on datasets in the correct data format, it is necessary to transform your data to the proper data format.

But what is the right data format?

To find out, today we are checking looking at quantum physics particle data! The challenge provided by Cornell is to learn a classification rule that differentiates between two types of particles generated in high energy collider experiments!

It has 78 attributes and 2 classes that we want to find.

First, we head to http://osmot.cs.cornell.edu/kddcup/datasets.html and register. Then we download the datasets provided.

It comes with 2 datasets, one for the classification of 2 different types of particles that are

Downloading the data using Wget

Helper Script

 

Download the tain data

 

Then convert the data with the helper script

Convert Data with the provided python helper script

 

Use a text editor like Nano to check out the data. The phy_test.dat looks something like this:

As we can see, the data is separated by tabs!

What is the data format?

Each line in the training and test files describe one example. The structure of each line is as follows:

  • The first element of each line is an EXAMPLE ID which uniquely describes the example. You will need this EXAMPLE ID when submitting results.
  • The second element is the example’s class. Positive examples are denoted by 1, negative examples by 0. Test examples have a “?” in this position. This is a balanced problem so the target values are roughly half 0s and 1s. All of the following elements are feature values. There are 78 feature values in each line.
  • Missing values: columns 22, 23, 24, and 46, 47, 48 use a value of “999” to denote “not available,” and columns 31 and 57 use “9999” to denote “not available.” These are the column numbers in the data tables starting with 1 for the first column (the case ID numbers). If you remove the first two columns (the case ID numbers and the targets), and start numbering the columns at the first attribute, these are attributes 20, 21, 22, and 44, 45, 46, and 29 and 55, respectively. You may treat missing values any way you want, including coding them as a unique value, imputing missing values, using learning methods that can handle missing values, ignoring these attributes, etc..

The elements in each line are separated by whitespace.

So we have an ID, 2 classes and continuous attributes

What Spark objects will we need?

https://spark.apache.org/docs/latest/api/java/index.html Get your documentation out, it’s time to program!

We will need the docs for DataFrame , Pipelines, Vectors, StringIndexer, VectorIndexer, Estimators, Transformers, Parameter  and VectorAssembler.

A pipeline consists of a sequence of stages in which each stage either has an estimator or transformer to be executed by calling Pipeline.fit(). On each estimator, the fit() method is called to generate a transformer which then transforms the data in the pipeline.

What is the vector indexer for?

The vector indexer enables us to detect whether the features of our data are categorical or continuous. We achieve this by passing a parameter N to Max Categories().

When the vector indexer is called during the pipeline execution process, it looks if there are more than N different values for each feature. If a feature has more than N different values, it is declared continuous. If a feature has N or fewer different values in its feature, this feature is declared categorical.

What is a Pipline for?

A pipeline is the thing, where we put all our Transformers  and  Estimators

– E.g.: Feature 0 has unique values {-1.0, 0.0} and feature 1 values {1.0, 3.0, 5.0}. If maxCategories = 2, then feature 0 is categorical and feature 1 is continuous.

 

Loading the data into Spark

 

 

We specify the tab delimiters with .option(“sep”, “\t”)

Doing this, Spark will label all the columns with _C0 to _C80.

_C0 represents the data point’s ID

_C1 represents the class

_C2 – _C80 are feature values

Define the labels

Define the column that contains the data labels with a StringIndexer. We know _c1 contains the class values, so we tell our indexer that that is the input column and the output column will be “indexedLabel” in our example.

 


As we can see by calling this code, all our columns are currently of type String, but   there are Double values in the rows.

 

 

Cast data types

This sexy loop updates all column data types to double

 

 

Index the features

Now it is time to use our VectorAssembler. We give it all the column names as input and a range of which columns to index into the features. Since column 0 is an ID and column 1 is a label, we start indexing features from the index 2.

 

Putting it all together

After defining all our estimators and transformers it is time to pass them all as a set to the Pipeline constructor. It will call every component of our Pipeline sequentially to generate the desired data format. Calling the transform function on the dataset services us with the processed data.

 

 

Verify that we transformed our data properly

Let’s create a simple classification tree to see if Spark ML can work with our data!

Here is the function, which you can just copy (make sure you have the same col names!):

 

 

If all went well, you should get an output similar to this:

Test Error = 0.3079487863430248
Accuracy: 0.6920512136569752

Congratulations! We have imported a dataset from the Internet, built a pipeline to transform its column datatype to the proper machine learning Spark format!

Spark Encoders, what are they there for and why do we need them?

  • Encoders main purpose is the task of performing  serialization  and deserialization (SerDe)
  • Since Spark does not save data as JVM objects, but instead in it’s very own binary format.
  • Spark comes with a lot of build in encoders
  • An Encoder  priovides information about a tables schema, without  having to deserialize the whole object.
  • Encoders are nessecary when mapping datasets

Spark Java Library overview

Spark Java Library overview

This is an overview of the full ecosystem, but we will just take a look at the most important classes for everyday usage.

Spark provides a huge library of useful classes and interfaces. It is important to know, what functionality your tool offers or otherwise you are not able to use it to its fullest potential.

 

That’s why I give you this great overview :

 

What is Sparksession for?

The Sparksession provided the main entry point for dataset functionality in spark.

It provides ways to

 

What is SparkContext for?

The Spark Context provides the main entry  point for Spark functionality.

The context represents the connection to a Spark cluster and you can do operations in clusters with it like

  • broadcast Variables to a cluster
  • create Resilient Distributed Datasets
  • create accumulators
  • distribute files in nodes of you spark cluster
  • distribute jars in nodes of your spark cluster
  • broadcast any variable to nodes in your cluster
  • Submit jobs  to the Cluster

 

What is the Spark Launcher for?

This class allows you, to launch any application in your spark cluster programmatically from java!  It will launch a new Spark Application as child process

 

What is the inProcessLauncher for?

This allows you to launch a new Spark application, within the invoking process

 

What is SparkAppHandle for?

The appHandler object is returned after starting a Spark app with the Spark launcher.

The appHandler gives you options to monitor and control the running application.

 

What is SparkAppHandle.listener for?

 

What is SparkSQLContext for?

The Spark Context provides the main entry point for Spark functionality.

The context representsit’s the connection to a Spark cluster and you can do operations in clusters with it like

  • broadcast Variables to a cluster
  • create Resilient Distributed Datasets
  • create accumulators

 

What are UserDefinedFunctions for?

 

Deployment Cylcle with Spark and Hadoop with java

This article will show you one of many possible cycles to deploy your code as quickly and efficiently as possible. Also, we will talk a little about what Hadoop and Spark actually is and how we can use it to make awesome distributed computations!

What is Hadoop and Spark for?

You use your Spark cloud, to do very computationally expensive tasks in a distributed fashion.

Hadoop provided the data in a distributed fashion, making it available from multiple nodes and by that increasing the rate at which every node in the cluster network will get its data.

We will write our code in Java and define cluster computations using the open source Apache Spark library.

After defining the code, we will use Maven to create a fat jar from it, which will contain all the dependencies.

We will make the Jar available from multiple sources, so that multiple computation nodes from our spark cluster can download it at the same time, this is achieved by making the data available distributed through hadoop.

What does a deployment cycle with kafka and hadoop look like in Java?

A typical cycle could look like this :

  1. Write code in Java
  2. Compile code into a fat Jar
  3. Make jar available in Hadoop cloud
  4. Launch Spark Driver which can allocate a dynamic amount of nodes to take care of the computations defined within the jar.

1.Write code in Java

You will have to define a main function with a main class. This will be the code that the cluster runs first, so everything starts from this function.

2. Compile code into fat Jar

mvn clean compile assembly:single

3. Make jar available from Hadoop cloud

Go into your Hadoop web interface and browse the file system

3.1 Create a folder in the cloud and upload the jar

After uploading your jar into your Hadoop cloud, it will be available to any computer that can talk with the Hadoop cloud. It is now distributed available on all the Hadoop nodes and is ready for highly efficient and fast data exchange with any cluster, in our example we use a Spark cluster.

If your hadoop node is called hadoop_fs and port is 9000, your jar is available to any node under the following URL:

hdfs://hadop_fs:9000/jars/example.jar

4. Launch distributed Spark Computation

To launch the driver, you need an instance of the spark-submit class. The most straightforward way to get it, is to just download the Spark library and unzip it.

wget http://apache.lauf-forum.at/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz

4.1 Launch Spark driver from command line.

Go to the directory where you have unzipped your Spark library, for me it would be

loan@Y510P:~/Libraries/Apache/Spark/spark-2.3.0-bin-hadoop2.7$

In the ./bin/spark-submit will have all the functionality we will require,

4.2 Gathering the Parameters

You need the following parameters, to launch your jar in the cluster

  • Spark Master URL
  • Hadoop Jar Url
  • Name of your main Class
  • Define –deploy-mode as Cluster to run the computation in cluster mode

4.3 Final step :Put the parameters together and launch the Jar in the cluster

./bin/spark-submit –class com.package.name.mainClass –master spark://10.0.1.10:6066 –deploy-mode cluster hdfs://hadop_fs:9000/jars/example.jar

This will tell the Spark cluster, where the Jar we want to run is. It will launch a user defined (or appropriate) amount of executors and finish the computation in a distributed fashion

Your Task should now show up in your Spark Webinterface.

What have you learned :

  • How to turn your java code into a fat jar
  • How to deploy your fat jar into the Hadoop cloud
  • How to run your code distributed in Spark, usinsg Hadoop as data source

Convolution Operators

Convolution Operators

Overview of  a few convolution operators

Sobel Operator for Edge Detection

The Sobel Operator calculates the pixels first order derivatives in the X and Y direction.

If the is little derivative, there is no unstructured area

If the derivative high in one direction, we have an edge in that direction

If the derivative high in two directions  , we have a corner / point of interest

 Gaussian Smoothing

Replace each pixel by it’s local average

.It is used for scale Space

The Laplace  Operator for

The differnce compard to the Sobel operator is, that it uses the second order derrivative.

This makes the Laplace operator very sensitive to noisy.

Edges are, where the second derivative is crossing over to 0 ( Zweite ableitung = 0 Hochpunkt!)

Laplacian of Gaussian Filter :

  1.  First smooth with gaussian filter.

  2. Then find zero crossings with the Laplacian filter.

  3. Can also combine one into one LOG  convolution

Double of Gaussian

LoG does not have to be calculated, it can be also approximated by calculating the difference between two Gaussian Filters at different scales . DoG

SIFT Detector

  1. Multiple DoG filters are applied to the image at different scales
  2. The resulting images are stacked on top each other to create a 3D volume
  3. Points that are local extrema inside of the 3D voluma, are considerd points of interest.  
  4. Remove bad points, like candidates in smooth regions  or directly on top of edges

SIFT Discriptor

  1. For a detected point of interest, choose 16×16 region around the point
  2. Compute each gradient for each pixel
  3. Subdivide into 16 * 4×4 groups
  4. Compute orientation histogram
  5. Glue the histograms together to get a 128 element feature vector

Characteristics

  • SIFT is very resillient to changes (invariant)   to constant intensity changes, as itt is based on gradients
  • Very invariant to contreast changes  , as the histogram binning provides normalization
  • invarient to small deformations  

Scale invariance is caused by the SIFT detector