Apache Spark, an open source cluster computing system, is growing fast. Apache Spark has a growing ecosystem of libraries and framework to enable advanced data analytics. Apache Spark's rapid success is due to its power and and ease-of-use. It is more productive and has faster runtime than the typical MapReduce based analysis. Apache Spark provides in-memory, distributed computing. It has APIs in Java, Scala, Python, and R. The Spark Ecosystem is shown below.
The entire ecosystem is built on top of the core engine. The core enables in memory computation for speed and its API has support for Java, Scala, Python, and R.Streaming enables processing streams of data in real time. Spark SQL enables users to query structured data and you can do so with your favorite language, a DataFrame resides at the core of Spark SQL, it holds data as a collection of rows and each column in the row is named, with DataFrames you can easily select, plot, and filter data. MLlib is a Machine Learning framework. GraphX is an API for graph structured data. This was a brief overview on the ecosystem.
A little history about Apache Spark:
The reason people are so interested in Apache Spark is it puts the power of Hadoop in the hands of developers. It is easier to setup an Apache Spark cluster than an Hadoop Cluster. It runs faster. And it is a lot easier to program. It puts the promise and power of Big Data and real time analysis in the hands of the masses. With that in mind, let's introduce
A great way to experiment with Apache Spark is to use the available interactive shells. There is a Python Shell and a Scala shell.
To download Apache Spark go here , and get the latest pre built version so we can run the shell out of the box.
Right now Apache Spark is version 1.4.1 released on July 15, 2015.
cd spark-1.4.1-bin-hadoop2.4 ./bin/pyspark
We won't use the Python shell here in this section.
The Scala interactive shell runs on the JVM therefore it enables you to use Java libraries.
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.4.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_25) Type in expressions to have them evaluated. Type :help for more information. 15/08/24 21:58:29 INFO SparkContext: Running Spark version 1.4.1
The following is a simple exercise just to get you started with the shell. You might not understand what we are doing right now but we will explain in detail later. With the Scala shell, do the following:
textFile.first() res3: String = # Apache Spark
You can filter the RDD
val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark.count() res10: Long = 19
To find the line with the most amount of words in the RDD
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) res11: Int = 14
Line 14 has the most words.
You can also import Java libraries for example like the
import java.lang.Math textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) res12: Int = 14
We can easily cache data in memory for example. Lets cache the filtered RDD
linesWithSpark.cache() res13: linesWithSpark.type = MapPartitionsRDD at filter at <console>:23 linesWithSpark.count() res15: Long = 19
This was a brief overview on how to use the Spark interactive shell.
Spark enables users to execute tasks in parallel on a cluster. This parallelism is made possible by using one of the main component of Spark, a RDD. A RDD (Resilient distributed data) is a representation of data. A RDD is data that can be partitioned on a cluster (sharded data if you will). The partitioning enables the execution of tasks in parallel. The more partitions you have, the more parallelism you can do. The diagram bellow is a representation of a RDD:
Think of each column as a partition, you can easily assign these partitions to nodes on a cluster.
In order to create a RDD, you can read data from an external storage; for example from Cassandra or Amazon Simple Storage Service, HDFS, or any data that offers Hadoop input format. You can also create a RDD by reading a text file, an array, or JSON. On the other hand if the data is local to your application you just need to parallelize it then you will be able to apply all the Spark features on it and do analysis in parallel across the Apache Spark Cluster. To test it out, with a Scala Spark shell:
val thingsRDD = sc.parallelize(List("spoon", "fork", "plate", "cup", "bottle")) thingsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD at parallelize at <console>:24
thingsRDD.count() res16: Long = 5
In order to work with Spark you need to start with a Spark Context. When you are using a shell, Spark Context already exists as
What can we do with a RDD?
With a RDD, we can either transform data or take actions on that data. This means with a transformation we can change its format, search for something, filter data etc. With actions you make changes, you pull data out, collect data, and even
For example, lets create a RDD
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
Using the previous diagram where we showed how a
It is worth mentioning, we also have what is called a Pair RDD, this kind of RDD is used when we have a key/value paired data. For example if we have data like the following table, Fruits matching its color:
We can execute a
pairRDD.groupByKey() Banana [Yellow] Apple [Red, Green] Kiwi [Green] Figs [Black]
This transformation just grouped 2 values which are (Red and Green) with one key which is (Apple). These are examples of transformation changes so far.
Once we have filtered a RDD, we can collect/materialize its data and make it flow into our application, this is an example of an action. Once we do this, all the data in the RDD are gone, but we can still call some operations on the RDD's data since they are still in memory.
Important to note that every time we call an action in Spark for example a count() action, Spark will go over all the transformations and computations done to that point and then return the count number, this will be somewhat slow. To fix this problem and increase the performance speed you can cache a RDD in memory. This way when you call an action time after time, you won't have to start the process from the beginning, you just get the results of the cached RDD from memory.
If you like to delete the RDD
Otherwise Spark automatically delete the oldest cashed RDD using the least recently used logic (LRU).
Here is a list to summarize the Spark process from start to end:
Here is a list of some of the transformations that can be used on a RDD:
Here is a list of some of the actions that can be made on a RDD:
For the full lists with their descriptions, check out the following Spark documentation.
Have a team who wants to get started with Apache Spark?
Check out our Apache Spark QuickStart Course for real-time analytics.
This two-day course introduces experienced developers and architects to Apache Spark™. Developers will be enabled to build real-world, high-speed, real-time analytics systems. This course has extensive hands-on examples. The idea is introduce key concepts that make Apache Spark™ such an important technology. This course should prepare architects, development managers, and developers to understand the possibilities with Apache Spark™.