How was my first contact with Apache Spark? To be honest it was not piece of cake.
There were several cases. First, after donwloading spark you have to manually download scala and set up scala environmental variables to make it living.
Second I had issue with JVM memory, after sbt package and setting
_JAVAOPTIONS to some
-Xmx values it was solved and Spark started. But later on JVM memory error returned then I have just changed JVM from jdk7u79 to jdk8 and .. no more tears.
count the words
I like learning by example. So we have simple use case. Count the words from webpage. To start with I have just copied data from page to text file. What to do? First make your first RDD,
load it via
val dlines = sc.textFile("C://path_to_file//site.txt")
check if it has any data and print first 10 lines
Now we may want to split words in lines by space
val word_arrays = dlines.map(ln=>ln.split(" "))
Having this strange array of array let’s flatten it making array of all words here
val words = word_arrays.collect().flatMap(y=>y)
val dwords = sc.parallelize(words)
With RDD made up from words it is possible to have fun. To count the words we assign to each word one number. After that we shall accumulate all by magic reduceByKey
val words_nr = dwords.map( s=>(s,1))
val counts = words_nr.reduceByKey((a, b) => a + b)
We have all the counts, but there is mess. Let us sort and see what words are most popular on the site.
scala> val sorted = counts.sortBy(k=>k._2, false)
Further nice things
Of course what is here is not everything, you may nicely communicate with databases or use Hadoop-dedicated parquets. There is rich number of addons, with worth to mention MLib used in machine learning.