Java Spark Tips, Tricks and Basics 6 – How to broadcast a variable to Spark cluster? Why do we need to broadcast variables?

Why do we need Spark broadcasters?

Spark is all about cluster computing. In a cluster of nodes, each node of course has it’s personal private memory.

If we want all the nodes in the cluster to work towards a common goal,  having shared variables just seems necessary.

Let’s say we want to sum up all the rows in a CSV table with 1 million lines. It makes just sense, to let 1 node work with 1/2 million and the other work with the other 1/2 million rows. Both calculate their results and then the driver program will combine their results.

Broadcasting allows us to create a read-only cached copy of a variable on every node in our cluster. The distribution of those variables is handled by efficient broadcast algorithms implemented by Spark under the hood. This will also take the burden of thinking about serialization and deserialization since good old Spark takes care of that!

This great functionality for broadcasting is provided by the SparkContext class.  Alternatively, one can also consider to use the broadcast class right away, do your work

How to broadcast a variable in Spark Java

What did we learn?

In this short tutorial, you learned what Spark Broadcast is for,  what Broadcast does and how to use it in Java.

Leave a Reply

Your email address will not be published. Required fields are marked *