SPARK(ling)

Problems

How was my first contact with Apache Spark? To be honest it was not piece of cake.
There were several cases. First, after donwloading spark you have to manually download scala and set up scala environmental variables to make it living.
Second I had issue with JVM memory, after sbt package and setting _JAVAOPTIONS to some -Xmx values it was solved and Spark started. But later on JVM memory error returned then I have just changed JVM from jdk7u79 to jdk8 and .. no more tears.

Sample use-case

count the words

I like learning by example. So we have simple use case. Count the words from webpage. To start with I have just copied data from page to text file. What to do? First make your first RDD,
load it via

[code lang=”scala”]
val dlines = sc.textFile("C://path_to_file//site.txt")
[/code]

check if it has any data and print first 10 lines

[code lang=”scala”]
dlines.count()
dlines.take(10).foreach(println)
[/code]

Now we may want to split words in lines by space

[code lang=”scala”]
val word_arrays = dlines.map(ln=>ln.split(" "))
[/code]

Having this strange array of array let’s flatten it making array of all words here

[code lang=”scala”]
val words = word_arrays.collect().flatMap(y=>y)
val dwords = sc.parallelize(words)
[/code]

With RDD made up from words it is possible to have fun. To count the words we assign to each word one number. After that we shall accumulate all by magic reduceByKey

[code lang=”scala”]
val words_nr = dwords.map( s=>(s,1))
val counts = words_nr.reduceByKey((a, b) => a + b)
counts.take(10).foreach(println)
[/code]

We have all the counts, but there is mess. Let us sort and see what words are most popular on the site.

[code lang=”scala”]
scala> val sorted = counts.sortBy(k=>k._2, false)
scala> sorted.take(20).foreach(println)
//resulting
(the,339)
(a,194)
(to,186)
(of,168)
(,144)
(in,118)
(and,99)
(is,97)
(on,88)
(Spark,77)
(for,65)
(be,62)
(an,59)
(RDD,52)
(that,52)
(can,49)
(each,45)
(or,45)
(are,44)
(as,43)
[/code]

Voila!

Further nice things

Of course what is here is not everything, you may nicely communicate with databases or use Hadoop-dedicated parquets. There is rich number of addons, with worth to mention MLib used in machine learning.