Hadoop 3 features
by Kamil Owczarek, 9 AM.
We started to learn how HDFS handles data. It is distributed storage system, where data is replicated by default.. Then we moved to bigdata’s hello world using Hadoop MapReduce, thinking of steps needed to do the word count. As Hadoop was developed there was YARN introduced to handle resources. The major flaw of Map-Reduce was people sometimes do not care about reduce phase, so what is the point of writing to file on each map phase, especially if we have many of them.
As hadoop 3 comes there are some features which can change future of Spark.
* Erasure coding
This is a feature helping you to have replication with less disk space required, having the same failover, but it comes with a cost of processing.
* Opportunistic containers, allow the user to focus more on the main part of computation, skipping some not needed tasks (f.e. Spark UI).
NameNodes allow to keep address-data in many places, helping to avoid SPoF
* Custom resource, we can use not only CPU and memory, but harness the power of GPU to run computations.
For a reason Hadoop 3 runs on Java8 , Spark can use Scala 2.12 using Java lambdas naturally compiled from Scala lambdas, f.e.
Spark will benefit from Hadoop 3, because it will be able to run faster, use all of the resources efficiently, make use of Java and Scala interfaces and increase availability.
by Kuba Nabrdalik, 10 AM
If you want to test efficiently, it is hard to do integration tests – they are slow. Solution is to create modules, separate entities, which makes them good canditates for microservices. TDD can start with writing sample inmemory DB-like HashMap. Doing good unit test means testing behaviour not looking underneath, because we don’t want to end up testing an implementation. We can be lost if we write too much, so test should be concise, we are to check relevant matters. If you want to simplify things, imagine you are using whiteboard to draw it. Some time ago there was a buzz about plain text tests that could be maintained by business. But face the reality, it is very unlikely they are willing to support it, so much better would be to write nice Spec using JUnit and care ourseleves. All of the examples were shown using Spock tests which look very „spoko” (PL). Writing specifications in plain English will prevent you from unneccessary tests. I learned also I should keep my tests running under 10 seconds not to lose interest, and to be distracted while executing them.
by Michał Matłoka, 11 AM
Of course there is a hype about that, so we are to be careful, BigData is not a magic tool we can use with any problem. There are various use cases, f.e. IoT or data analysis. Then we moved to exemplary problem: air pollution data. We have many NoSQL tools, which were presented shortly. Then we could differentiate between stream processing and batch processing – the first is about real-time data, the latter is done on demand. Stream processing can show up in few forms, f.e. microbatching (old Spark Streaming) or native (Flink). Do you know differences between Lambda and Kappa? Prepare as it is nice BigData interview question.
by Markus Winand, 1 PM
It started with clear statement, if you compare SQL with other tools basing on its implementation from 1992 it is not fair. So Markus took challenge to say about some of features that came after that memorable year.
With SQL 1999 we had
WITH keyword makes it possible to replace old queries nesting. You can then make sth like literate programming in SQL.
WITH RECURSIVE allows you to create loops inside of the query, then reusing dataset. Where to use: generating random data or filling gaps when needed for joins.
In SQL 2003 we received
OVER , which makes aggregation possible when used with
PARTITION BY. We can also use it with
ORDER BY to make running totals and moving averages (f.e. balance after transactions).
SQL 2006 introduced
XMLTABLE can parse contents of XML in the cell, there we should use XPath to create temporary table. In SQL 2008
FETCH FIRST is standard way of fetching first rows of table. From SQL 2011
OFFSET allows to use pagination, but it is not efficient so do not use it!
LAG enble you to refer to next and previous rows (fe balance calculating). Lately (2011) system versioning was done to version all changes in table with WITH VERSIONING. Tbe bottom line is to be remembered:
SQL has evolved beyond the relational idea.
As me being Luke since I was born I am even more to follow the guideline found in the URL of the speakers blog: https://use-the-index-luke.com/
by Marcin Szymaniuk, 2:10 PM
We were told some use cases.
User activity can be checked. We can find value out of it, f.e. what are the features we should care about. Spark makes good use of HDFS, any type of data can be utilized.
Network improvement. Problem: we check how customers are using telco network, and scoring features. Thanks to that we can understand how likely it is they will stay with us. But why do we care about fast analysis? Because we want to learn about data instantly, and not to store if we don’t need to, to draw conclusions and make decisions.
geospatial data. We have many sources but we want to blend it into one map.
Then we were told about basic building blocks. RDD / DataFrame. Task (f.e. map) can be done easily, there is no need to shuffle data between partitions. These are narrow ones. We also have wide transformations (f.e. group by). Task for file system is done per block usually. While doing join we should be careful about load of the task (controlled by spark properties). The way to solve it is to introduce randomness into data, by creating column with salt. Then jobs will be spread between executors nicely. At the end we were presented with another pyramid. This one proves, there is a lot of more work to be done in cleaning and data-ingestion than ML and cool stuff.
3:10PM, Philip Krenn
This was one of the most awaited talks by me. Elasticsearch advocate was telling what are the new features introduced in Elasticsearch v6 and more. Presentation started with nice offtopic I was not aware of: topic of Elasticsearch started when one husband wanted to help his wife with recipes management, so he started to automate search process.
Talking about ES6 there were some features introduced. ES5 were silently falling when extra parameters were introduced, the newer version is return meaningful errors that help to fix the problem. Moreover we have upgrade assistant, allowing to perform rolling upgrades, without need to stop Elastic cluster. We also have transaction log, useful when one node was unavailable. We will need to get rid of types, too. Soon we will have only indices (as data type). This will follow Lucene idea a bit closer, because it was always artificial to have document types from Lucene index perspective. And finally we will have chance to change sharding settings without need of reindexing.
What I extremely like in this presentation it was fully interactive and speaker was showing nearly all the things he mentioned during the talk. Good job, continue with work like that guys! Devs want to see code.
4.20PM by Andrzej Ludwikowski
When can it be said perfomance test is ok? If they simulate production env closely. We need monitoring to start with performance. And monitoring should cover various areas of development – JVM, app, server metrics etc. To analyze monitoring we’d better first understand how does the app behave when it is healthy. Logging is important, and graylog was suggested, however ELK was not considered as a whole (rather Logstash itself). JVM Profiling was mentioned, if you claim you’re senior developer, it is a must. Why?
„Without data you’re just another person with an opinion”.
Mathematics is something we should care, usually avg, median; percentiles should be considered. What tools we may use to execute performance tests: Gatling, JMeter (heavier). Why Gatling is better: uses programming lang and using less memory. We were shown some samples in Gatling DSL. It is nice in case of error management, cause it will handle even distrubuted messaging scenario delays. In cloud era we would like to test distributed system, too, but there are web services allowing to run it from cloud against the cloud (f.e. Flood.io)
by Christian Vorhemus, 11 AM, Friday
First we were given a quiz by Christian, how was are quantum computers compared to classical ones in terms of classical sort of data. The answer is: they are NOT. So the presentation we were given was to help us think and demistify quantum possibilities. On the other hand we expected to receive some nice piece of knowledge. And we were satisfied.
From the very beginning Christian introduced to us all of the building block and explained why searching for an element in the list has complexity of O(sq n) – much less then O(n) in classical computing. However there is some debt that comes along with the technical improvement. We have to face some quantum problems like tunelling. So how to buld an circuit? There are a bit of different from classical logic elements. Speaker tried to show Hadamard , X, and CNOT gates explaining how do they work, however I found it hard as my math background is not so strong.
Stream processing in telco
by Marcin Próchnik, 11AM, Friday
This was one of the most awaited by me as it tells about my niche – data processing and streaming. As I have some experience in Kafka+Flink I was eager to hear what issues do they have as a team working with streams and why did Nussknacker came to existance. So basically it was a story about Nussnacker, which is framework written by guys from Toux , and it is serving as nice GUI over bulding aggregations and alerting system based on Kafka and Flink. Pretty nice tool.
What are guys in domain of telco business interested in? There are several use cases, when stream processing brings added value. First we have marketing. It is good to reach the customer with fully tailored offers, helping him to get what he wants, f.e. when travelling in some area outside of the country he can be offered with good roaming offer (from what I understood). Secondly we could be interested in fraud detection, as someone could use services breaking rules in the same time. That fraud detection can be implemented by aggregations over some time windows.
Challenges they faced were pretty understandable. If we have lots of jobs defined on top of Flink we could have thread overflow 🙂 And we have to maintain these JVM threads in some manner. Another challenge is data enrichment. After ingesting it we may have a need with fetching data from other sources. If they had Kafka streams when they started working on the solution who knows if the tool had the same engine as it has now.
Some year ago I lost my enthusiasm toward the JVM techniques. However after this conference I understood that it would be great to 1) learn Scala and 2) dig in into JVM internals.
This is because I would like to do better data processing and both Spark and Kafka are natively using Scala, so why not to follow the best practices when using this tools. Scala has some different attitude toward programming so I guess I could be better software engineer aftewards, having another Pokemon in my toolbelt.
I was touched by expression heard during the conference we should have some knowledge on JVM. I must admit I was not so zealous about it lastly, rather acting as a client of the machine than machine-keeper. But I have to grow up, now we do maintain a lot of Java tools on prod, and working as prod-responsible developer I have to make right decisions basing not on my opinions but on facts.