luk creates datamining / integration / software Thu, 31 May 2018 20:15:15 +0000 pl-PL hourly 1 Tour of scala, first steps Thu, 31 May 2018 20:14:21 +0000 Czytaj dalej Tour of scala, first steps]]> Motivation

I attended Geecon 2018, this year’s IT conference. There are 2 main points I took into consideration. First, this is JVM, and as I can see there is a lot of that stuff running all around the world, so even if whole world would be magically converted to Python and stuff, it will take ages. And someone has to keep it running. Second, this is the language of Spark and Kafka, two of the projects of my interest in area of stream processing in Java world.

Therefore I would like to be more fluent and write these processing pipelines in the suggested language so I can make use of all of the features.

First snacks

Scala is nice. It is less verbose. And as I can see it learned some nice features from Java applying in its syntax as a built-in (f.e. immutability). This makes my first steps nice. Especially I like following:
– inferred types
– immutability
– in-line functions


I decided to take a tour of Scala. And you can see the way I am learning in my repo at github here. Learning is done by writing the unit test and checking its results with sbt test. Then I proceed to another point. It helps me to:
1. understand points better
2. have fun while learning
3. make a step-by-step report, easy to review

This is nice. There was a while to find the best tool, and to make it all up and running on my Windows PC, and to have it running on Intellij, but finally I succeeded.

This is a sample how do I play with that,
I write spec in Scala:

it should "make Fenix singleton by it being an Object" in {
val historyOfFenix = Fenix.createOrRecreateFromAshes()
assert(historyOfFenix == 2)

that uses following object:

object Fenix {
var livesCount = 0
def createOrRecreateFromAshes() ={
livesCount += 1

This and much more at my github, you can check it, now worries.

]]> 0
Geecon 2018, day 2&3 Fri, 11 May 2018 12:31:51 +0000 Czytaj dalej Geecon 2018, day 2&3]]>  

Hadoop 3 features

by Kamil Owczarek, 9 AM.
We started to learn how HDFS handles data. It is distributed storage system, where data is replicated by default.. Then we moved to bigdata’s hello world using Hadoop MapReduce, thinking of steps needed to do the word count. As Hadoop was developed there was YARN introduced to handle resources. The major flaw of Map-Reduce was people sometimes do not care about reduce phase, so what is the point of writing to file on each map phase, especially if we have many of them.

As hadoop 3 comes there are some features which can change future of Spark.
* Erasure coding
This is a feature helping you to have replication with less disk space required, having the same failover, but it comes with a cost of processing.
* Opportunistic containers, allow the user to focus more on the main part of computation, skipping some not needed tasks (f.e. Spark UI).
* many NameNodes allow to keep address-data in many places, helping to avoid SPoF
* Custom resource, we can use not only CPU and memory, but harness the power of GPU to run computations.
For a reason Hadoop 3 runs on Java8 , Spark can use Scala 2.12 using Java lambdas naturally compiled from Scala lambdas, f.e.

Spark will benefit from Hadoop 3, because it will be able to run faster, use all of the resources efficiently, make use of Java and Scala interfaces and increase availability.


by Kuba Nabrdalik, 10 AM
If you want to test efficiently, it is hard to do integration tests – they are slow. Solution is to create modules, separate entities, which makes them good canditates for microservices. TDD can start with writing sample inmemory DB-like HashMap. Doing good unit test means testing behaviour not looking underneath, because we don’t want to end up testing an implementation. We can be lost if we write too much, so test should be concise, we are to check relevant matters.  If you want to simplify things, imagine you are using whiteboard to draw it.  Some time ago there was a buzz about plain text tests that could be maintained by business.  But face the reality, it is very unlikely they are willing to support it, so much better would be to write nice Spec using JUnit and care ourseleves. All of the examples were shown using Spock tests which look very „spoko” (PL). Writing specifications in plain English will prevent you from unneccessary tests. I learned also I should keep my tests running under 10 seconds not to lose interest, and to be distracted while executing them.

Big Data

by Michał Matłoka, 11 AM
Of course there is a hype about that, so we are to be careful, BigData is not a magic tool we can use with any problem. There are various use cases, f.e. IoT or data analysis. Then we moved to exemplary problem: air pollution data. We have many NoSQL tools, which were presented shortly. Then we could differentiate between stream processing and batch processing – the first is about real-time data, the latter is done on demand. Stream processing can show up in few forms, f.e. microbatching (old Spark Streaming) or native (Flink). Do you know differences between Lambda and Kappa?  Prepare as it is nice BigData interview question.


by Markus Winand, 1 PM

It started with clear statement, if you compare SQL with other tools basing on its implementation from 1992 it is not fair. So Markus took challenge to say about some of features that came after that memorable year.
With SQL 1999 we had WITH keyword makes it possible to replace old queries nesting. You can then make sth like literate programming in SQL.  WITH RECURSIVE allows you to create loops inside of the query, then reusing dataset.  Where to use: generating random data or filling gaps when needed for joins.
In SQL 2003 we received OVER , which makes aggregation possible when used with PARTITION BY. We can also use it with ORDER BY to make running totals and moving averages (f.e. balance after transactions).
SQL 2006 introduced XMLTABLE can parse contents of XML in the cell, there we should use XPath to create temporary table.  In SQL 2008 FETCH FIRST is standard way of fetching first rows of table. From SQL 2011 OFFSET allows to use pagination, but it is not efficient so do not use it!  LEAD and LAG enble you to refer to next and previous rows (fe balance calculating).  Lately (2011) system versioning was done to version all changes in table with WITH VERSIONING.  Tbe bottom line is to be remembered:

SQL has evolved beyond the relational idea.

As me being Luke since I was born I am even more to follow the guideline found in the URL of the speakers blog:


by Marcin Szymaniuk, 2:10 PM

We were told some use cases.

User activity can be checked. We can find value out of it, f.e. what are the features we should care about. Spark makes good use of HDFS, any type of data can be utilized.
Network improvement. Problem: we check how customers are using telco network, and scoring features. Thanks to that we can understand how likely it is they will stay with us. But why do we care about fast analysis? Because we want to learn about data instantly, and not to store if we don’t need to, to draw conclusions and make decisions.
geospatial data. We have many sources but we want to blend it into one map.
Then we were told about basic building blocks. RDD / DataFrame. Task (f.e. map) can be done easily, there is no need to shuffle data between partitions. These are narrow ones. We also have wide transformations (f.e. group by). Task for file system is done per block usually.  While doing join we should be careful about load of the task (controlled by spark properties). The way to solve it is to introduce randomness into data, by creating column with salt. Then jobs will be spread between executors nicely.  At the end we were presented with another pyramid. This one proves, there is a lot of more work to be done in cleaning and data-ingestion than ML and cool stuff.


3:10PM, Philip Krenn
This was one of the most awaited talks by me. Elasticsearch advocate was telling what are the new features introduced in Elasticsearch v6 and more. Presentation started with nice offtopic I was not aware of: topic of Elasticsearch started when one husband wanted to help his wife with recipes management, so he started to automate search process.
Talking about ES6 there were some features introduced. ES5 were silently falling when extra parameters were introduced, the newer version is return meaningful errors that help to fix the problem. Moreover we have upgrade assistant, allowing to perform rolling upgrades, without need to stop Elastic cluster. We also have transaction log, useful when one node was unavailable. We will need to get rid of types, too. Soon we will have only indices (as data type). This will follow Lucene idea a bit closer, because it was always artificial to have document types from Lucene index perspective. And finally we will have chance to change sharding settings without need of reindexing.

What I extremely like in this presentation it was fully interactive and speaker was showing nearly all the things he mentioned during the talk.  Good job, continue with work like that guys! Devs want to see code.

Perfomance tests

4.20PM by Andrzej Ludwikowski
When can it be said perfomance test is ok? If they simulate production env closely. We need monitoring to start with performance. And monitoring should cover various areas of development – JVM, app, server metrics etc. To analyze monitoring we’d better first understand how does the app behave when it is healthy. Logging is important, and graylog was suggested, however ELK was not considered as a whole (rather Logstash itself). JVM Profiling was mentioned, if you claim you’re senior developer, it is a must. Why?

„Without data you’re just another person with an opinion”.

Mathematics is something we should care, usually avg, median; percentiles should be considered. What tools we may use to execute performance tests: Gatling, JMeter (heavier). Why Gatling is better: uses programming lang and using less memory. We were shown some samples in Gatling DSL. It is nice in case of error management, cause it will handle even distrubuted messaging scenario delays. In cloud era we would like to test distributed system, too, but there are web services allowing to run it from cloud against the cloud (f.e.

Quantum computers

by Christian Vorhemus, 11 AM, Friday

First we were given a quiz by Christian,  how was are quantum computers compared to classical ones in terms of classical sort of data.  The answer is: they are NOT.  So the presentation we were given was to help us think and demistify quantum possibilities.  On the other hand we expected to receive some nice piece of knowledge.  And we were satisfied.

From the very beginning Christian introduced to us all of the building block and explained why searching for an element in the list has complexity of O(sq n) – much less then O(n) in classical computing.  However there is some debt that comes along with the technical improvement.  We have to face some quantum problems like tunelling.   So how to buld an circuit?  There are a bit of different from classical logic elements.  Speaker tried to show Hadamard , X, and CNOT gates explaining how do they work, however I found it hard as my math background is not so strong.

Stream processing in telco

by Marcin Próchnik, 11AM, Friday

This was one of the most awaited by me as it tells about my niche – data processing and streaming.  As I have some experience in Kafka+Flink I was eager to hear what issues do they have as a team working with streams and why did Nussknacker came to existance.  So basically it was a story about Nussnacker, which is framework written by guys from Toux , and it is serving as nice GUI over bulding aggregations and alerting system based on Kafka and Flink.  Pretty nice tool.

What are guys in domain of telco business interested in?  There are several use cases, when stream processing brings added value.  First we have marketing.  It is good to reach the customer with fully tailored offers, helping him to get what he wants, f.e. when travelling in some area outside of the country he can be offered with good roaming offer (from what I understood).  Secondly we could be interested in fraud detection, as someone could use services breaking rules in the same time.  That fraud detection can be implemented by aggregations over some time windows.

Challenges they faced were pretty understandable.   If we have lots of jobs defined on top of Flink we could have thread overflow 🙂  And we have to maintain these JVM threads in some manner.  Another challenge is data enrichment.  After ingesting it we may have a need with fetching data from other sources.  If they had Kafka streams when they started working on the solution who knows if the tool had the same engine as it has now.

Overall feeling

Some year ago I lost my enthusiasm toward the JVM techniques.  However after this conference I understood that it would be great to 1) learn Scala and 2) dig in into JVM internals.

1st: Scala.

This is because I would like to do better data processing and both Spark and Kafka are natively using Scala, so why not to follow the best practices when using this tools.  Scala has some different attitude toward programming so I guess I could be better software engineer aftewards, having another Pokemon in my toolbelt.

2nd: JVM.

I was touched by expression heard during the conference we should have some knowledge on JVM.  I must admit I was not so zealous about it lastly, rather acting as a client of the machine than machine-keeper.  But I have to grow up, now we do maintain a lot of Java tools on prod, and working as prod-responsible developer I have to make right decisions basing not on my opinions but on facts.

]]> 0
Geecon, day 1 Thu, 10 May 2018 06:36:26 +0000 Czytaj dalej Geecon, day 1]]> Geecon is one of the biggest events in Polish IT calendar. This time I re-visited it, and wanted to share my thoughts not to forget things I learned. Following post is to sum up the 1st day.


10:00 AM, Pete larson
Conference started with 2 „soft” talks. The first brought to my mind that it us – developers – who are primarly responsible for the code quality. If we are insisting enough on some solutions, our managers (so called „business”) can learn how important it is to use the good tool for that task. And the whole quality comes from the team cooperating well. To do more, learn from your colleagues and teach them.


10:30 AM, Cliff Click
What happens if someone comes and breaks your flow, shouting at you? The only thing we can work on then, is our reaction. If we had to cope with problem, it is good to wonder what is the lesson we can learn. There is a nice saying.

„Harnest the best, toss the rest”

And if askedy some difficult questions, it is better to prepare response beforehand. You should be self-confident when talking about money, too. It happens in our IT world pretty often, huh? So the good part is to acknowledge that salary related discussion are kind of fight, a fight there are rules in, but still don’t forget you’re fighting. So know your value and do not surrender.


11:30 AM, by Peter Palaga
First we were introduced to differences between source and binary dependencies. Usually working with Maven you are using only binary, already compiled dependencies. With the tool introduced by Peter you can use source dependencies, meaning you can use git repository versions in your project. It can work with Maven or Gradle but we are advised to use Maven version. Basically srcdeps is intercepting calls to Maven repo and if there is „*-SRC-commitid” it goes to git repo searching for version, compiling in on-the-fly.

Continuous deployment

1:10 PM, by Marcin Grzejszczak
In the end of the day we as devs are being payed for a value we deliver. Delivering has to happen regularly, otherwise we have to face consequences. There are many changes we are introducing so there can be more errors after the deploy. In the ideal world, deployment should happen even without us being conscious of it. But we can safely go into this direction if we care for backward compatibility. This is the time where contract test come to a stage. Producer defines a contract, then tests are generated automatically to check if producer „do not lie”. We were shown a implementation of such a paradigm using spring specs. Having it as a part of your build process we can learn quickly (fail fast) if someone breaks the contract. This is the time where PaaS shines, as with it creating new environments is a piece of cake. It also encourages you to put your infra as a code, thus making it fully reproducible everywhere. Marcin showed that using CloudFoundry, that can be even run on local PC. It moves me to intensify my work on K8S as we have Openshift at my place. The difficult part in maintaing changes is especially having DB rollbacks. The ideal solution would be to avoid is whatsoever. And at the end we faced controversial opinion on removing E2E at all, if we have business allowance for so, and Test Pyramid shape is close to the perfect one. Gonna test Bats as tool for testing bash scripts.


by Jarosław Ratajski

Problem: many companies are using JVM, so to use Haskell on JVM. Follow possibilities you have at your workplace and stop fighting against this – this is why I am happy to have docker in at my desk 🙂 Using ETA as compiler, which is linking *.hs code to fat runnable we car run Haskell anywhere. Jaroslaw showed that writing quicksort implementation. Haskell is being built by CABAL. ETLAS is used for ETA. It contains some import from C-langages which makes it little dirty. We could learn that you can even become an ETA developer and contributor easily, because it is hard to find developer fluent in that language.

Static vs Dynamic typing

by Thomas Enebo
Thomas first of all nicely introduced himself, as him being professional programmer for around 30 yrs. We had some tricky quiz, moving us to guess if the lang is static or dynamic. It is really hard to guess it only looking at a code snippet. There is some disambiguity in many definitions of what exactly is dynamic vs static typed language.

„If it is only compile phase to check quality of your prod app – you are doing it wrong”.

This is the testing culture, that helps you to maintain a high quality code, even TDD acronym was mentioned during the talk. From the Thomas experience you can see he was able to check lots of languages, analyzing their dynamic’ness and static’ness. With dynamic vs static differences it is clear when using IDE, most of ‚dynamic users’ tend to choose bare text editors. Static pros: when codebase is bigger static typing is helpful, as you can treat it as kind of documentation.

]]> 1
Nieustające umieszczanie – CD z Travis Thu, 18 Jan 2018 09:10:27 +0000 Czytaj dalej Nieustające umieszczanie – CD z Travis]]> „continuous deployment” to nie tylko buzzword.  Dotyczy to każdego, kto tworzy jakiś kod, który da się umieścić (ang. deploy) na zewnątrz.  Jeśli chcę by mój kod klikał ktoś inny – niż ja zapamiętale wodząc przeglądarkę po localhoście – muszę zapewnić jego dostępność pod www.zewnę  Aplikację można umieścić ręcznie, ale nie o tym tutaj.  Automatyzacja to twoja siostra, a Travis to twój brat.  O tym jak umieściłem Flask-apkę na Heroku przez Travisa poniżej.  Kod apki jest tutaj na Githubie. A reszta jest milczeniem.

Moja apka

Tworzę na własne potrzeby kilka małych serwisów – nie wiem czy mikro ale duże nie są.  Bardzo prosty front (jakieś api z jednym endpointem) a pod spodem przetwarzanie danych w Pythonie.  W Pythonie bo lekko, w Pythonie bo lubię i nie chce mi się tworzyć klas gdy nie potrzebuję ich tworzyć.  A apka webowa w Flasku to prościzna.  O tym też kiedyś pisałem w takim oto mini-projekcie.


Travis to narzędzie które buduje.   Jeśli masz konto na Githubie to wystarczy się O-auth-omatycznie zalogować przez Githuba i wybierasz swój projekt.  To wyciąg z konfiguracji:

Twój projekt musi mieć takie coś w głównym katalogu i Travis sobie to przeczyta i zinterpretuje.  W powyższym mamy następujące informacje

  • gdzie kod jest
  • że trzeba go testować
  • gdzie go umieścić

Po każdym twoim puszu do Githuba cały ciąg rur (ang. pipeline) jest uruchamiany, żeby na koniec jeśli testy przeszły umieścić aplikację w środowisku docelowym.  Fajne nie?  Zero uruchamiania manualnie rzeźbionych skryptów *.sh na środowisku przez Putty’ego, zero martwienia się testami.  Po prostu piszesz kod i wypychasz.  To jest życie.  To życie pokazane jest na przykładowym zrzucie ekranu tutaj:


Heroku to tylko jedno z wielu serwisów gdzie można za niewielkie koszta (lub nawet za darmo [oczywiście aplikacja jest usypiana i dla klienta się nie nadaje]) umieścić aplikację w technologii dowolnej.  Żeby Heroku odczytał Flask-apkę i ją uruchomił trzeba było wygenerować klucz za pomocą Heroku CLI.
Potem zostaje konfiguracja gunicorn który jest jednym z runnerów internetowej aplikacji Pythonowej.
Umieszczamy ją w pliku Procfile interpretowanym przez Heroku

Taka to aplikacja działa i wystawia swój wesoły łebek na świat przez  Ładujesz przez konsolę Heroku zmienne systemowe, monitorujesz logi i inne devopsowe cuda.  Oglądanie logów, ach, oglądanie..


Da się.  Da się zrobić szybko Continuous Deployment z Travis.  Testy i deploy zamknięty tylko w Twoim git push-u.  Chcesz się przekonać? Przeczytaj powyższy post.

Następne kroki

Taki pipeline jest jednak dosyć ubogi, prezentuje pojedynczy poziom testowania.  Idealnie byłoby mieć ich kilka i osobno odpalać testy jednostkowe a potem integrować to z innymi częściami systemu, może na dockerowych kontenerach tworzonych tylko na potrzeby testu.  Ale takie cuda to już na Openshifcie.


]]> 0
4developers w Łodzi Thu, 16 Nov 2017 13:00:05 +0000 Czytaj dalej 4developers w Łodzi]]> Konferencja 4 developers w Łodzi należy już do historii, ale jakże fajna to historia.  Była ona też emocjonująca z uwagi na mój debiut na scenie.  Jakie wnioski?  Jakie prezentacje mnie zainspirowały?  Czytaj dalej

Był stres

W mojej prezentacji o ES najbardziej martwiłem się czy się będzie podobać i czy nagle wszystko nie przestanie działać.  Na szczęście Elasticsearch dał radę podczas pokazu i wszystko działało jak z nut (no może 1 wpadki z Logstashem).  Jestem jednak zadowolony z tego co było, bo udało się wpasować w tematykę bazodanową uzupełniając naturalnie timeseries i geospatial na ścieżce bazodanowej.  Świetne przeżycie, polecam.

Będzie lepiej

Człowiek uczy się całe życie.  Następnym razem – mniej tekstu a więcej obrazków.  Zauważyłem też u innych kolegów że bardziej działają liczbowe dowody (np ES szuka 3 razy szybciej niż Postgres) niż ogólne stwierdzenie (np ES jest szybkim silnikiem wyszukiwania).  Skoro Kibana działa bo wizualizacje, to moje prezentacje też będą działać lepiej jeśli będzie więcej kolorów mniej tekstu.  A o czym mówili inni koledzy?  Opowiem krótko o moich wrażeniach.

Magik JVM

Pan w czarnej todze o imieniu Jarek Ratajski opowiadał o całej historii adnotacji w świecie Javy.  Kiedyś było to tak, że pisałeś kod w Javie i się kompilował.  A teraz?  Jest tu dużo więcej niż klasy i funkcje.  Dla junior developera wejście w świat wysoko zaadnotowanego Springa jest trudny.  I wymaga tłumaczenia zawiłości @Każdej @Adnotacji @NaJegoDrodze.  Pan Jarek zaleca adnotacje, które są wskazówkami dla kompilatora lub JPA.  W innym wypadku mamy średniowieczne traktaty których i tak nikt nie czyta.


Ta prezentacja pokazała że teraz Java będzie się ukazywać dużo częściej niż ostatnio.  Ale niestety nie oczekuj że możesz ją wykorzystać w każdym projekcie firmowym – spora część popularnych bibliotek nie jest jeszcze kompatybilna.  Ale modułowość Javy wydaje się bardzo atrakcyjna.  Poparte przykładami Tomasza Adamczewskiego z Idemia.

Geospatial HANA

Podczas przemowy pana Witalija dowiedziałem się wreszcie dlaczego każdy samolot latał dookoła.  W związku z walcowym odzwierciedleniem Ziemi (odwzorowanie Mercatora) są zniekształcenia tychże tras, przez co samolot leci po tzw. GreatCircleRoute.  Na prezentacji mogłem dowiedzieć się jak tworzyć i wykorzystywać zaawansowane geodezyjne/geograficzne możliwości SAP HANA.  Szkoda tylko że to język SQL i trzeba tworzyć obiekty Selectem.

IoT a Influx DB

Internet rzeczy to wiele gadających ze sobą maszyn.  Maszyny te generują eventy.  Gdzie je składować? Na to pytania odpowiada kolega Ivan Vaskevytch w rozprawce o bazie Influx.  Jest ona dedykowana do szybkich obliczeń opartych na zdarzeniach czasowych.  Napisana w GoLang więc działa szybko.  I bije na łeb na szyję mojego znajomego ES jeśli chodzi o agregacje na czasie.  Fajna rzecz.  Było live demo pomiarów zanieczyszczenia Krakowa.

Czy serwer może cię usłyszeć

Ta prezentacja była najbardziej fascynująca jak dla mnie.  NLP jest to nie tylko neurolingwistyczne programowanie (sterowanie) z psychologii.  To przede wszystkim przetwarzanie języka naturalnego.  Jako ludzie rzeczywiście zostaliśmy obdarzeni wielkim darem rozumienia słów.  Żeby nauczyć tego maszynę trzeba dużego nakładu sił.  Machine learning i wykonywanie komend to zadanie z którym świetnie sobie radzimy jako człowieki od dnia narodzin, ale maszyna to nie człowiek.  Wg mówcy, kolegi Jacka Jagieła,  rozumienie tekstu to przyszłość przetwarzania danych i niedługo znikną nam GUI do sterowania np serwerowniami.  Próbowałem DialogFlow ale nie szło mi to za lekko przyznam szczerze.


Bardzo ciekawa konferencja.  Fajny wybór kilku ścieżek.  Zabrakło może tylko fajnych gadżetów w postaci kubków i koszulek 🙂

]]> 0
AWS Lambda part 4 – running from Android Thu, 12 Oct 2017 08:00:07 +0000 Czytaj dalej AWS Lambda part 4 – running from Android]]> The power coming from AWS Lambdas is clear when you think what and who can invoke it.  You can run it obviously from AWS Console.  But the real value of this solution is clear when you can use your exisiting microservice from the device or hitting some HTTP endpoint.  The last time me using Android is when there was no Intellij and emulators were heavier than my PC could endure.

Let’s start Android

Android Studio made an easy start to my first Android activity.  When I typed : Android project, and chose the first Android SDK that came to my mind – the window and project with many files predefined was created for me.  The blank activity was ready to be run and tried on my device!  Gradle-backed app and device deploy scritps works so fast comparing to the previous versions.

I played a little with basic components and prepared easy input/output operations like reading inputs and displaying outputs on Snacks.

public class MainActivity extends AppCompatActivity {
    protected void onCreate(Bundle savedInstanceState) {

        Button button = (Button) findViewById(;

        button.setOnClickListener( (View view) -> {
            Snackbar.make(view, "You clicked me, Wallet Navigator!", Snackbar.LENGTH_INDEFINITE)

Works! So why not to connect to the Amazon now.

Android SDK

To work with Amazon from Android you are to use SDK.  Amazon gives you nice tutorial that can be found there.

Using whole variety of factories is available just after declaring dependencies on your project. Please, have a look at Gradle snippet from properties of my project

dependencies {
    compile 'com.amazonaws:aws-android-sdk-core:2.2.+'
    compile 'com.amazonaws:aws-android-sdk-lambda:2.2.+'

Basics of Lambda Android

To run the app you have to care about all permission on Android and AWS backend (mentioned later on).  Then, you have to decide about what function will be invoked using @LambdaFunction annotation.  Types of the event are to be sent using simple POJO object with getters and setters.

public interface AddOperation {
    @LambdaFunction(functionName = "HelloAws")
    public String addOperation(String jsonOperation);

Then this function can be utilized with Invoker like here

CognitoCachingCredentialsProvider credentialsProvider = new CognitoCachingCredentialsProvider(
                Regions.EU_CENTRAL_1 // Region
        LambdaInvokerFactory factory = new LambdaInvokerFactory(
        AddOperation myInterface =;

That’s all.  Now I will focus on some permissions.


Android permissions

Project can be run when all the policies will be satisfied and you will be safely guarded by all of the Android and AWS Security officials.  To develop the sample Android invoker application the biggest issue was to satisfy all of the safety issues.

In Android app when using API 23 or Android version 6+ there is not enough to manifest all of the permissions as it was done previously (manifest file) but permission can / should be requested on runtime. Special permissions are needed, because Amazon SDK uses Internet domain.

AWS permissions

To use the Lambda from the outside I have play with Cognito roles.  I was forced to create the federated policy with no authorization from the outside.  Then this cognity pool id is passed to the invoker factory.

I was so happy when finally my Lambda was executed from within the Android application.  The Android context tests were good help for me, f.e. when invoking test like this below it is executed on emulator or your device and you are sure that this part of the flow will work, integration is tested here.

public class LambdaTest {

    public void lambdaCanBeExecuted() {
        Context appContext = InstrumentationRegistry.getTargetContext();
        LambdaOperationDao dao = new LambdaOperationDao(appContext, appContext.getSharedPreferences(MainActivity.PREFS_NAME, 0));
        Operation transaction = new Operation(1.2, "Desc", null);
        OperationResult operationResult = dao.saveOperation(transaction);
        assertTrue(operationResult.getAmount_after() != null);



I was happy to create sample 1-activity Android application and invoke the AWS Lambda function with parameters.  I can easily record all of my spendings using my cell phone. All of the progress is available on my GITHUB project HERE.

]]> 0
AWS Lambda part 3 – IO with Files using S3 Thu, 21 Sep 2017 06:00:06 +0000 Czytaj dalej AWS Lambda part 3 – IO with Files using S3]]> Previously I wrote my first Lambda (part 1) and connected it to data store Dynamo DB (part 2).  Today I am going to use default file system for AWS system – S3.  Again using boto3 library is very handy.

Business case

To make it easy and treat DynamoDB as historical table only, I decided to make calculations based on the state written to file.  To calculate current state of the wallet I am subtracting operation amount from the last total.  And I will keep it in simple JSON on my bucket.

Reading from S3

Before I start to do the basic operations, let me mention that there are no folders on AWS file system, we have buckets here.  These are basic containers for S3 files.  Here is how we can easily read file from bucket using Pythonic Lambda.

Writing to S3

Writing is nothing simpler than using boto3.resource again. Play with it on your AWS landing page.


Again we are to make all of our fun safe and clean.  So before running you will need to set permissions for user running lambda to use this file system.  Following policy AmazonS3FullAccess is a silver bullet to all s3 operations.

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "*"


Of course it is only tip of the iceberg, when it comes to S3 file system.  It can accept any files, provides versioning and triggers, that can be leveraged to any processing I can imagine.  But what about saving simple JSON file?  Works.

]]> 0
AWS Lambda, part 2. Saving data in DynamoDB Thu, 14 Sep 2017 06:00:24 +0000 Czytaj dalej AWS Lambda, part 2. Saving data in DynamoDB]]> The last time I was running bare Lambda function.  It was fun, but no persistence was touched.  Now if we are going to track real expenses we have to persist data somewhere.  One of the solutions made for that is to use DynamoDB table.

Preparing table

Following steps will help you to repeat the process of creating table.  For sake of simplicity I decided to create one table called operation.

The only constant you need to provide is future ID of table, I have chosen what follows.

As DynamoDB is kind of NoSQL database, all other fields will be added on runtime.

Lambda can write to DynamoDB

The main goal is to provide Lambda function ability to read/write „rows” in table.  In Python it is as easy as using boto3 library providing IO operations.  Following you have example of writing an item

def save(new_record):
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('operation')
    new_record['OperationId'] = str(uuid.uuid4())[0:6]
    item = new_record
        Item = item

Because AWS is so secure-aware environment you can never forget that now our function is not yet ready to modify real table.  If we are going to run it now we will receive AccessDeniedException.

What we need to add is to assign correct policies to our user invoking the method.  Here you have example what I have added. Built-in AmazonDynamoDBFullAccess seems to be enough to provide full access to your role/user.

Testing it

Let us run all of these machines.  Every time I change my code I am helping myself with scripts I have prepared. is used to move code to AWS server. is used to run function.  After running it with parameters I see entries in DynamoDB.

Voila, works!


Lambda using Python with boto3 library allows me easily to deal with AWS persistence – DynamoDB!

]]> 0
My First Lambda – playing with AWS, part 1 Thu, 24 Aug 2017 13:00:49 +0000 Czytaj dalej My First Lambda – playing with AWS, part 1]]> When reading through the whole internet lately I encountered AWS hype.  2015 I remember there was my first contact with this Amazon services but I skipped that somehow and forgot about my account.  Things changed when I learned that this portal allows you not only to host machines and real linux systems but Lambdas.  The code is here, check my Github.

There was some small project always in my head. To have all my money spendings recorded in nicer way than google sheet.  And using some Android money apps was not a solution.  As dev, I deserve to have something more!


I am not goint go provide here N-tish AWS tutorial, but the goal is to record here my steps from starting to play with AWS.

Lambda is cool

Why it’s cool?  Because Lambda is a microservice.  It means, you don’t care to where is it running and how.  You pay only for processing time.  If it is web app you don’t care about time of waiting for user requests.

Write my Lambda!


To start playing with Lambdas you have to have AWS account.  Nothing hard to do.  You will find how to do it in tutorials.

The code

Our first Lambda can be super stupid simple.

def my_handler(event, context):
    print('Someone is invoking me!')
    message = 'Hello {} {}!'.format(event['first_name'],
    return {
        'message' : message

Create Lambda, CLI and GUI

There are several ways of creating Lambda.  You can utilize some templates available online.  The easiest way is to use your own role, but the suggested way is to create the separate role for it.


You’d better first get your aws cli using pip for instance, this is the best language to write lambdas though (Python)

pip install aws

Then configure it for your role.  You will need access keys from this page to do it

Then knowing what is your IAM you can configue your CLI

aws configure --profile <your_profile_here>

Create lambda GUI

It is possible to create lambda using GUI as from this Web Console. There you will find some useful tool to execute and test your code here.Create lambda CLI

Having CLI configured it is even easier way to create Lambda with CLI

aws lambda create-function \
--region eu-central-1 \
--function-name HelloAws \
--zip-file fileb://$2 \
--role $1  \
--handler wallet.my_handler \
--runtime python3.6 \
--timeout 15 \
--memory-size 512

On this github repo I prepared some pre-defined scripts for uploading.  You can treat them as inspiration to make you own „continuous deployment pipeline”.

Move that house!

So there is the time for test!  You can invoke your lambda.  As usual, using Web console or nice and black-and-white CLI
For CLI way you just aws lambda invoke your piece of code.

aws lambda invoke \
--invocation-type RequestResponse \
--function-name HelloAws \
--region eu-central-1 \
--log-type Tail \
--payload '{"first_name":"Lukasz", "last_name":"Kuczyński"}' \
--profile $AWS_USER \

and the screen with the results of previous operation

Is it working? Prove me!

You can test if the Lambda has been running when invoking Lambda from CLI.  Known good practice is to redirect the output of function to a file, as shown on the previous example.

aws lambda invoke \

The way for any developer to communicate with your backend program is to read the logs.  It is possible here on AWS platform, too. And there is no need to use any specific loggers to accomplish it.  Cloudwatch is used when you do any print statement, in case of Python apps.

I encountered one issue several times when trying to browse Cloudwatch logs.  If your invokers role has no access to Logs (IAM policies) you cannot be able to browse or no logs are recorded.  So be careful to execute and browse logs with Policies of templates FULL Logs access or to have policy similar to following:

    "Version": "2012-10-17",
    "Statement": [
            "Action": [
            "Effect": "Allow",
            "Resource": "*"

or you can attach just one of these following:

This project will be developed here thanks to Github.

Next we are going to read write to DB, in part #2.  Stay tuned.

]]> 0
Elasticsearch rozumie po polsku Thu, 10 Aug 2017 07:00:27 +0000 Czytaj dalej Elasticsearch rozumie po polsku]]> Elasticsearch ma oficjalny polski plugin.  Stempel.  Udało mi się go testować i sprawdzać.  I choć polski analizator pozostawia trochę do życzenia to robi niezłą robotę.  Przypadek użycia?  Indeksujemy i wyszukujemy RSS feed z ofertami pracy ze znanego polskiego portalu.

Architektura rozwiązania

Chcę wyszukiwać wpisy na RSS.  Żeby to zrobić muszę je gdzieś agregować.  Pierwszym elementem każdego pajplajna jest Data Ingestion.  W moim wypadku na początek wykorzystam mój mikrokomputerek.  Będzie pieczołowicie zbierał wpisy do pliku.  Tenże plik potem będę zaciągał i indeksował w Elasticsearch – markowy silnik wyszukiwania.  A że wpisy są po polsku to będę ewaluował rozwiązanie polskiego analizatora.

Pracowite Raspberry

Dlaczego?  Bo nie znałem jeszcze Amazon i AWS 🙂  Tak tak, wiem, mamy kosze w S3 i automatyczne wyzwalacze które wyzwalają uśpione Lambdy.  Tak wiem że można inaczej.  Ale leży ten mój Raspberry taki zakurzony, no to myślę sobie: „A dam chłopu robotę!”

Zbieram dane

Dane agregowane są w pliku CSV.  Ot, taki popularny format danych.  Jest to o tyle wygodne, że w Pythonie fajnie się takie pliki obsługuje – mega proste pisanie i czytanie – a także korzystając z pewnych zaawansowanych bibliotek można potem te dane analizować, jak choćby w Pandas.  Dane czytane są przez proste i wygodne czytajdło pod nazwą rssparser.  A oto wyciąg z kodu

import feedparser

def main():
    d = feedparser.parse(url)
    rss_updated = d.feed.updated

Następnie te dane pobieramy na kompa.  I indeksujemy.  No właśnie, ale do czego?  Do Elasticsearcha

Elasticsearch i polska instalacja

Elasticsearch od lat (jeden i pół) wykorzystuję w postaci dockerowego obrazu.  Jeszcze nigdy mnie nie zawiódł 🙂 Tylko pamiętaj wystawiaj wolumeny zewnętrznie bo jak nie to 1) potracisz zapisane indeksy, 2) spuchnie Ci obraz.  Oto przykład jak to uruchamiam:

version: '2'

    image: kibana
     - 5601:5601
     - elasticsearch
     - ELASTICSEARCH_URL=http://elasticsearch:9200

    image: luk/polski_elasticsearch
     - 9200:9200
     - /var/rss_esdata:/usr/share/elasticsearch/data
      - bootstrap.memory_lock=true
      - xpack.monitoring.enabled=false
      - xpack.graph.enabled=false
      - "ES_JAVA_OPTS=-Xms4g -Xmx4g"
      - IPC_LOCK
        soft: -1
        hard: -1
    mem_limit: 8g

„Ale to nie jest oficjalny obraz” – zaoponujesz.  A no tak, zapomniałem.  Elasticsearch jest leciutko przeze mnie spersonalizowany.  Doinstalowałem polską wtyczkę – analizator Stempel.  Bo będziemy indeksować polskie teksty.

MAINTAINER luk &lt;;
RUN elasticsearch-plugin install analysis-stempel

Jak już zainstalowałeś (możesz uruchomić także na swojej instancji) to sprawdź czy Ci działa.  Zrobiłem nawet taki skrypt który to sprawdzi:

curl "localhost:9200/_analyze?analyzer=polish&amp;text=buraki"

W odpowiedzi powinieneś otrzymać token o treści burak.


Domyślnie do obsługi Elasticsearcha wygodnie jest zbierać dane przy pomocy dedykowanych rozwiązań jak Filebeat czy Logstash.  I tak też to pierwotnie wyglądało.  Logstash udostępnia Ci filtr csv, który rozbiera plik CSV na Python-dictowy dokument. Jednakowoż w wypadku wolumenów, które nie będą zbierane tak często, można użyć prostych Pythonowych bibliotek, tak jak zrobiłem to w skrypcie indeksującym:

def gen_bulk_actions():
    with open(path) as f:
        reader = DictReader(f, fieldnames=['title','link','created_at','text'])
        for elem in reader:
            yield {
                "_action": "insert",
                "_index": "infopraca_rss",
                "_type": "offer",
                "_source": dict(elem)

helpers.bulk(es, actions=gen_bulk_actions())


A teraz test E2E, wygenerujmy zapytanie które wykorzysta analizator i zapyta o pracę marzeń, np taką z Python i Elasticsearch, no może być też Java
POST infopraca_rss/offer/_search
    "query": {
        "bool": {
            "must": [
                {"match": {"text": "elasticsearch"}}
            "should": [
                {"match": {"text": "java"}},
                {"match": {"text": "python"}}
    "highlight": {
        "fields": {
            "text": {}


Wynik wykorzystuje mechanizmy skoringu, które pozwalają na dobór najbardziej pasujących wyników, bowiem właśnie relevance to cecha wyszukiwań pełnotekstowych jak to tutaj ładnie opisano na stronie


Elasticsearch udostępnia prosty i dobry analizator który możesz podpiąć pod dowolne źródło danych. Udało się to tutaj zrobić dla RSS. Dane pobierane są przez napisany w Pythonie program na RaspberryPi. Kod „aplikacji” jest oczywiście na Githubie.

]]> 0