445 words about your body,BigData and Grandma’s cake


There are many buzzwords around us.  One of them is BigData. There are people who would die in name of bigdata, others say: „this is not applicable here and in 90% other cases” or „we always had databases and noone was in such a rush”.

Personally I (don’t) know what to think about it.  My data-mining/processing experience is not enough to be sure about it but I strongly believe that big data is something natural to human being from the very beggining of his/her life.

3 V’s in your life

There was time I have read about 3 V’s of BigData. Maybe there are 4 or 5, who knows! What is the meaning? Complex stuff, blablabla. But wait, is it not that each one of you – including you reading that text – you actually do every day? We are created this way that we have natural ability to process big data.


Man. Have you ever been to big city. Going that street there are so many data-generators. You read titles on posters, listen to the music. Wait, there is radio news broadcasted, oh, thats.. interesting.. Your phone rings and you start to chat with your friend. Information is flowing a vast river, being fed by many torrents.


Your friend is talking to you via phone, neurons are very busy in your brain. He is telling stories. You start to remind yourself about facts. Each thought and every word being heard makes a pathway somewhere under your skull. There are maaaany of such connections. Can you remember taste of your favorite cake made by granma? I can.


In the same time you, mr Human, are obliged to draw conclusions and make decisions! No time for being lazy. Walking by the same street in crowdy city you hear someone’s yelling: „STOP”! You stop. You were approaching the road and didn’t notice car driving fast. Someone saved your life. But why you were able to survive, lucky guy? Because out of the data noise you were able to capture important words that you translated into command crucial to you. You did it even having a lot of other data to process in the same time, you were talking to friend via phone, looking at sky, smelling traffic clouds and thinking what to tell your boss because you are late.

Just do it

Don’t think it is buzzword. Just try to do it. It will increase your scientific value in the same time interacting with business value of your boss. Convince him.  I am gonna go into direction of data. Python was my experience in 2016. Spark has been started in the same year . Elasticsearch is here from ages. Just do it.

NLP made easy, Natural Language Processing

You may sometimes underestimate your skills. Understanding text is a complex task.

You are someone!

To analyze text in your mind you have to understand how things are joined together.
Language. While reading 2nd chapter of Taming text I felt little as I were back on my school, attending language lessons of Polish langauage (my native). As kids we were told what are parts of speech. We were told about lexical categories. We were told what is adjective, what is noun, etc. But we were told NOT that there are some super-extra automatic tools that can recognize it for us. But that is not all, as humans we can naturally work with bigger chunks of text: phrases. And we know that in sentence

[code lang=”text”]
Sun shines at people walking by the river.

walking by the river is important phrase for noun people and has no or little connection with Sun.
Morphology. Words are like trees. Every word have its root. Lexeme or root form is word without end-ing. Stemming is technique helping you to get rid of endings. Why so? Usually when searching in English we are interested in broader scope of word. Girl looking for skirt will not feel offended when search engine provide her article about skirtS.

Automation! NLP!

Java is not the only language having these possibilities. But we have java as TextTaming default programming language so I have explored Java libraries. And these are really cool. Lucene libraries, Apache OpenNLP, Tika.
Just do it!


Tika is the best. Guys who made Tika are great. Using that you can really do anything. I mean extracting RAW text from PowerPoint presentation, PDF documentation, Excel sheet and Word Document. Apache provides you with usable runnable jar here that you have just to run and voila. I have made for myself batch that I used to extract documents that I needed to use in Carrot2 IDE. Here you can find it.

Processing made easy


All following samples require included maven dependencies in java maven project

[code lang=”xml”]

To process your text first you make list of tokens. Tokenization is possible using such StandardTokenizer as available in OpenNLP library. Having tokens the game begins!

[code lang=”java”]
Tokenizer tokenizer = SimpleTokenizer.
String[] tokenArray = tokenizer.tokenize(text);
List<String> tokens = Arrays.asList(tokenArray);
return tokens;

Part of speech assignment (POS)

There is great tool in OpenNLP I can see big potential in. Why? I think we can draw conclusions about text from the way it is written, number of Nouns, Adjectives and so. What is more in effective text mining it could be wise to analyse only some part of speech only?

[code lang=”java”]
List<String> tokens = tokenizer.getTokens(text);
String[] tokenArray = (String[]) tokens.toArray();
InputStream is = this.getClass().getClassLoader().getResourceAsStream("en-pos-maxent.bin");
POSModel model = new POSModel(is);
POSTaggerME tagger = new POSTaggerME(model);
String[] tags = tagger.tag(tokenArray);


To understand sentence it is important not only to differentiate part of speech but also understand what is role of words in sentence. There is different goal of verb describing action of subject in complex sequence and main predicate. Again OpenNLP comes with the rescue.

[code lang=”java”]
InputStream isParser = this.getClass().getClassLoader().getResourceAsStream("en-parser-chunking.bin");
ParserModel model = new ParserModel(isParser);
Parser parser = ParserFactory.create(model);
InputStream isSentence = this.getClass().getClassLoader().getResourceAsStream("en-sent.bin");
SentenceModel sentenceModel = new SentenceModel(isSentence);
SentenceDetector sentenceDetector = new SentenceDetectorME(sentenceModel);
String[] sentences = sentenceDetector.sentDetect(text);

List<ParsedSentence> parsed = new ArrayList<ParsedSentence>();

for (String sentence : sentences) {
Parse[] parses = ParserTool.parseLine(sentence, parser, 1);
parsed.add(new ParsedSentence(sentence, Arrays.asList(parses)));


Having your text parsed you can draw conclusions, play with it. Whole testlike project available at github. Enjoy!

Working in a text mine – Taming

Lately there is one book I really enjoy reading. Taming the text. Why do I think topic is worth mentioning? We have never had such time as now before. Information is all around us. Data is cheap and we may find it everywhere. Do you want to mold your data and convert it into information? Up to you. How to automatize it? What if we could tame, subdue text.. There we have data/text mining.

For whom

questions-1922477_640At the beginning I was reading about the beast called text. Who is working with text? You are scientist – that’s you. Journalist? Of course. Student reading books, preparing to exams? Yes. Developer or IT worker who wants to improve her/his skills? Voila. Wants to analyze your system behavior? Go for it. Each one who is looking for answers and ask questions this is you.

You need it

There are few facts&questions making me realize I need it too. How many times was I looking for information in company’s documentation, how many documents and mails I am receiving, and how much out from it I am able to read and comprehend?  That is true : world has produced around 1.8 Zetta Byte of data until 2011. Imagine how big effort is required to find useful information in it.  Research IDC in 2009 gave following results:

Average time in middle-size IT-related company spent on finding information is around 9 hours a week

We could multiply these numbers with number of people, considering their money they earn we can spot a perfect room for improvement. These costs can be reduced.


Understanding text is hard.  This is unique to you, mr Human, that you are able to read and comprehend text, e.g. this one… From informatics puzzle-1746552_640perspective you have complex abilities to process textual data, analyzing words, stopwords, understanding syntax, fighting with flection (some languages), getting what is the subject of sentence and what is goal of an author. Implementing that in automatically requires effort. Effort in work that can be done only using some tools.


Of course you can hardly find programmer who doesn’t have some tool to work with any IT-related problem. Book I have mentioned at the beginning also tells about some tools helping to tame the text: Solr, Mahout, OpenNLP.  But I would like to humbly mention my favorite tools, not mentioned in the book : elastic Elasticsearch (basing on the same engine that Solr), and fast Spark.  Langugages? Python, Java. Maybe R? One thing is the most important, use right tools to right tasks.

Be elastic. Search.

Why am I elasticsearch’ing

What does it mean to do IT? What is information? Information is not data. It is data that has been crunched and digested, so we can learn something about that. Do you have some text PDFs or Word documents that make you sick everytime you try to search them? No more! Behold the tool

the tool

Elasticsearch is my hobby. Not only as ELK part. Can be easily entangled to work with any data. This data is in your files. Underneath there is information.


Here in this repo you will find doc searcher. You can run it locally. Make your documents searchable. How does it work?

Step #1 Run engine – Elastic

Following is about Elastic version = 5 you have embedded mechanism ingest not tested by me yet.

Elasticsearch will accept all docs. There is great libary written in Java for attachments – Tika. Based on that guys made fancy plugin mapper-attachment that can be found here. Follow instruction matching your Elasticsearch version to install that using good old bin\plugin install. After running successful plugin installation you should have list of at least one plugin:

[code lang=”bash”]
>> bin\plugin.bat list
Installed plugins in D:\p\elasticsearch-2.4.0\plugins:
– mapper-attachments

Step #2 Index docs – Python

You could pass all of document contents manually using Sense or Postman but it could be little troublesome. That’s why automation is good. So I have wrote some Python script that can do it. There is Python3 needed. You just run the python app and it indexes by default all contents of folder files_to_index to your local ES instance. Python library for that is very cool, you just type the following and it opens up connection happy and ready for any query.

[code lang=”python”]
from elasticsearch import Elasticsearch
HOST = ‚http://localhost:9200’
es = Elasticsearch([HOST])

Step #3 Browse your docs – Browser

Let us show indexed docs! I have made little js browser. You can also check how does it look with scoring for each document. Just run web/view.html and type anything in input at this ultra-simple html page.

right tool to right task

If you only have a hammer, you tend to see every problem as a nail.

Elasticsearch is not a database but search engine. It doesn’t come without cost – indexes with words from all documents can take some space. But whenever you encounter issue with search please consider search tool.

CodeEurope – subiektywna relacja z Warszawy (PL)

Słowo wstępne

Gdy to piszę CodeEurope Warszawa 2016 to już historia. Ale że nie minęło dużo czasu (w sumie to godziny) to napiszę coś na świeżo.



Wczesna pora. Wrocławskie drogi były puste jak nigdy. Znacznie ułatwiło to drogę na dworzec. 5:30 odjazd pociągu. Pełen luksus jak zawsze w polskiej kolei : klimatyzowane ciepłe wagony, kawa na pokładzie, laptop z „długą baterią” i prąd pod siedzeniem – EIC jest sprawcą mojego dev-rozwoju!

Pierwsze wrażenia i oczekiwania

Stadion narodowy w Warszawie to nie tylko całotygodniowe lodowisko, na konferencyjnym 2. piętrze rozstawiła się cała świta małego polskiego światka IT wraz ze wszystkimi panami i paniami ładnie reklamującymi swoją firmę która na koniec i tak okazuję się takim samym smutnym korpo albo niedojrzałym startupem jak poprzedni wpis w CV 😛
Cóż tam oni, zaraz zaczyna się pierwszy wykład na który jestem spóźniony, bo jednak idealny pociąg okazuje się nie być idealnym do końca – mimo „ekspresu” w nazwie jest 15 minut później. Zdyszany wpadam na 1. wykład. Szczególnie oczekiwany. Najbardziej byłem ciekaw wywodów kolegi z Amazon/Google nt. silników wyszukiwania i osławionego Crockford’a rozprawiającego o ciemnych stronach JS.


Pozwolę sobie w kilku słowach opisać to co zapamiętałem korzystając z niektórych prelekcji.

Szukamy i skrobiemy – O’Neall

W fascynującym So, how do Google, Bing and Yahoo work? pan O’Neill opisał działanie silników wyszukiwania. Indeksy którymi panowie w wyszukiwarkach operują to na prawdę BIIG data. Nie byłem świadomy że na prawdę ogrom pracy wykonują ludzie – którzy ręcznie oceniają i kategoryzują teksty. Machine learning potrzebuje czasem ludzkiego czynnika. Nie wiem czy wiesz ale Ty też wspierasz wyszukiwarki. Każdym zestawem słów uczysz mechanizm że są to słowa powiązane, dzięki temu Twojemu następcy będzie już łatwiej.

Nie lekceważ potęgi Search’u

Po wykładzie podszedłem do Allena zapytać czemu nie wspomniał o moim koniku : Elasticsearch. Ale okazuje się on też jest tam wykorzystywany do przeglądania historycznych danych. Prelekcję zakończyły bardzo interesujące porady nt. tworzenia web-scrapera samemu. Zabrakło mi jedynie tzw. prawnych kwestii które się rodzą w temacie polityki prywatności i zasad wykorzystywania portali. Koniec końców mam mobilizację do dokończenia mojego projektu scrap i indeksuj (do ES oczywiście).

Java8 – Nihil Novi

Spodziewałem się nieco więcej po 12 rzeczy o Java 8 Jakuba Marchiwckiego ale w sumie poszedłem może ze zbyt wygórowanymi oczekiwaniami. Ale spodobało mi się określenie

Java to już dorosły język

dlatego od pojawienia się strumieni nie musimy mówić jak ale co ma zrobić i ona już wie jak sobie z tym poradzić. Fajnie, że teraz można się skupić bardziej na biznesowych zadaniach – dobrze uwidocznił to Jakub na przykładzie analizy pliku CSV.

Nowy JS to już nie ten JS (this JS) – ECMA6

Przegląd najważniejszych cech nowego JS był przedsmakiem do wytwornej kolacji (wykład Crokforda), której wszyscy nie mogli się już doczekać. W FutureJs Jakub Waliński zaprezentował parę smaczków nowego(lub/i)lepszego Javascript a właściwie ECMAScript. Szczególną nadzieje wiąże z pozbyciem się odwiecznego problemu var na rzecz let i const co raz na zawsze rozwiąże problem dziwnie – z punktu widzenia programisty JAVA – zachowujących się var-ów. Nie musimy już też martwić się bardzo wieloznacznym this, gdy mamy arrow functions. Czas na pogodzenie się z Babel i niepisanie już w jakże starożytnym ES5.

Dzielna Czeszka z Cambridge szybko analizuje duże dane ze Stackoverflow

Gdy kobieta ma prelekcję zawsze spodziewam się czegoś fajnego. Jakaś odmiana. Miękkie, ludzkie spojrzenie na programowanie. Ale tego się nie spodziewałem, niepozorna na pierwszy rzut oka Evelina Gabasova wgniotła nas wszystkich w fotele i krzesełka analizując z pomocą F# dane ze Stackoverflow na naszych oczach. To była najpłynniej i najciekawiej przeprowadzona prelekcja jaką do tej pory widziałem, ever. Swoją drogą to na prawdę niezłe: w tym F# że można te dane analizować z taką łatwością z wielu źródeł danych.

Nie zrobi ojciec nie umie matka tego co zrobi w F# templatka

W kilku linijkach dosłownie Evelina analizowała na naszych oczach HTML, JSON żeby przeprowadzać na nich operacje i dopasowania matematyczne których efektem były zaawansowane statystyki. Dowiedzieliśmy się m.in. jakie języki są przez programistów wykorzystywane ‚po godzinach’ a także jakie pojęcia (tagi) są ze sobą powiązane, jakie agregacje pojęć można spotkać na Stack’u. I to rzeczywiście działało, wszelkie powiązania jakie znalazła były jak najbardziej uzasadnione. A jakby tego mało to importowała funkcje z R bo było jej mało i skompilowała to jeszcze do JS żeby robić ładne „widoczki”. Po prostu pełny profesjonalizm a do tego ŚWIETNY przekaz: płynnie i bardzo interesująco opowiedziane. Jest się na czym wzorować. Polecam.

Czarnobiała prawda Doug’a – dla JS nie ma szans

Znany z wielu książek nt. Javascript Douglas Crokford był wisienką na torcie w tym dev-dniu. Rozpoczął od uroczego przedstawienia uroczej zasady pewnej Japoneczki KonMari która zaleca odśmiecanie na wielką skalę. W domu należy pozostawić tylko rzeczy, co do których jesteśmy w stanie twierdząco odpowiedzieć sobie na kluczowe pytanie

Does it spark joy?

Jeśli nie wytwarza w tobie to pozytywnych uczuć : do śmieci. I tak właśnie pan Douglas bez ogródek potraktował następujące tematy:

  • po co nam tab w ASCII (zaszłość po maszynach do pisania)
  • var vs let i const
  • null i undefined (twórca idei NULL w Java po dziś dzień miewa koszmary)
  • funkcje które przy kolejnych wywołaniach zwracają różne wyniki (niezgodnie z matematyką)
  • generatory z ES6
  • try catch finally jako próba ukrycia GOTO
  • nieobowiązkowe {} w if
  • JS matematyka (0/0, 0*NaN, 0.1+0.2 != 0.3)
  • niefortunne nazewnictwo throw
    Ogólnie chciałby napisać język od nowa ale uważa że dobrze że powstał ten ES6. Troszkę naprawi. Ale niewiele.
    Cała prezentacja w zupełnych ciemnościach i czarnobiałe Matrix-slajdy zrobiły wrażenie. Wyszliśmy jak Neo.


Booting spring with spring boot

What the following is not

This is not the copy of any tutorial. This is just my feeling about why I see SpringBoot as a tool that every Web/Java developer has to know – at least – that it exists.

Easy springing!

Have you ever had such a feeling when starting new Spring project?

How was I supposed to start…
What do I need to start all this stuff

Thinking of all the WebApplicationInitializers and all Spring’s stuff all around can make it hard when just little project is to be started. With SpringBoot you have everything packed in one box. Just start and look! even Tomcat engine inside that can run all this stuff, full MVC Web app. Just one big fat jar, runnable one.

Fast start

How to start? You have ton’s of tutorials. But basically this is the way how do I start..


Personally I like to initialize my app with initializer, rather initializr. You have better one? Please place your comment.

Code and be happy

But to be honest coding a controller it is nothing more than creating one @Controller class with @GetMapping and voila. All clients’ requests can be served now by your small WebAppServer. Sounds to you like microservice? I bet you are right.

Test before! or after, or .. anytime

No matter you believe TDD or not testing is important. Sometimes you just want to integrate-test something. And better you’ll do it with lots of Spring test tools, mentioning @SpringBootTest for instance..

Dockerize that!

Having one fat jar it is easy to start it anywhere. Especially when I recall some little projects – using Docker it is very easy to build docker image with Java only than with Tomcat and deploying it. Of course it is possible but there is more Java-ready images on dockerhub than Tomcat’s images.

Make dev happy

Have you ever ‚sys-outed’ all the beans of your project? Or you were thinking how to reach all request mapping, list them? ‚Sysouting’ never again. Now everything is just under one project – actuator

Lots of hots, where are NOTs?

This is not the golden bullet. I would think twice before putting it as gateway for big production project – of course many will argue with that. But talking about smaller project? Building spring-like ProofOfConcepts? The best choice I think.

Trello for project management and brainstorming


Managing project

Are you developing new project for yourself and don’t want to use JIRA for that? I felt some mess when I was making a tool right for me and there was no ticket-like system I could use.

Don’t waste it

Or maybe during your day you encounter some great ideas that need many WorkDays that you do not currently have right now? Don’t waste that. Write it down. Somewhere in time there may be a need to do something with that.

Short story about one project

I had to do some project for myself. It was some Google API, PDF printing stuff with some little business logic inside. I needed to do some rewriting and fixing on montly basis. But every time I approached the project, man.. I haven’t really known what should I do and what tasks are most important. Why? It was written nowhere. I have noticed that in my case if someting is not written down then my brain tries to pretend that.. the problem doesn’t exists 😀 After writing all tasks to be done in some categories (cards lists) to Trello I categorized tasks to do. There are some tasks waiting in backlog, there is a list of bugs that need to be repaired, some spikes that need investigation. And if I will have some time I can always take care of nice to have stuff. After having done some thing I archive it, having great pleasure while it is removed from my board!

Why I have several ideas boards

As I am guy interested in several areas of IT I know that there may happen a time when I will need any idea. To move on. To evolve. But tell me where all these shining ideas come from? Out of nothing. When? Often unexpectedly. Often then, when there is no time to do anything because you are at work, jogging or in kitchen doing the dishes. But the idea can be cool and is to be saved. So I have several boards that save my ideas. About what?
* IoT movement, how to use my RaspPi
* Elastic: where can I use it even more
* web ideas: what are uServices ideas that can bring millions of dollars if I will start to implement them.
Start having your personalized Trello-boards or „any other supplier”-boards to implement your life right now.

Internet radio on Orange PI

Having any microcomputer you have many options. What about Internet radio? Basically I have chosen node.js and some debian music player like mplayer. What about me presenting just few glimpses here? Maybe you would like to copy that? No problem. So follow steps here.


I assume you have PC with internet and OrangePi or similiar Debian-powered microcontroller. Disclaimer: I was not testing that on other computers than OrangePiPc but I suppose there will be no problem with that.

Node and express way of web app

We want to have quickly web application.  Cool way of making it fast is using express and it’s generator which makes app in a fast.  Few commands and we have web app running, proudly speaking to us at localhost:3000: „Welcome to express”. The rest of the app what was made by me can be found on my github.

Prepare Mediaplayer

Our node radio app will be communicating via fifo file. This fifo file has to be created and mplayer is to use that file to accept commands.
[code lang=”bash”]
apt-get install mplayer
apt-get install mplayer
Then make fifo file and run mplayer. Mplayer will use this fifo as communication channel.
[code lang=”bash”]
mkfifo /tmp/mplayer_fifo
mplayer -slave -quiet -input file=/tmp/mplayer_fifo -idle</pre>

Radio app

On my supermicrocomputer I had to install nodejs to get it all running.
[code lang=”bash”]
apt-get update
curl -sL https://deb.nodesource.com/setup_4.x | sudo -E bash –
sudo apt-get install -y nodejs
Here we are, you can test if you can node any *.js file. Or you need not if you don’t want. Then downloading my project and running it is possible.
[code lang=”bash”]
git clone https://github.com/pishon/radiogaga.git
cd radiogaga
npm install
npm start
Then go to localhost:3000 and you should see web interface where you can send commands to your radio.


It’s troublesome to do it always manually. So there is a bash script prepared that turns on radio, makes fifo file and starts node app. Having that script place him on /etc/init.d


No sound, man!

If having problems with sound in OrangePi(PC) ensure the right card is selected as default, this settings can be found at
[code lang=”bash”]
nano /etc/asound.conf
You also can have your sound muted, than go to alsamixer and ensure your LineOut is not muted – usually it is maked as „MM” at the bottom of sound bar.  After changing settings you have to save them.
[code lang=”bash”]
sudo alsactl store



How was my first contact with Apache Spark? To be honest it was not piece of cake.
There were several cases. First, after donwloading spark you have to manually download scala and set up scala environmental variables to make it living.
Second I had issue with JVM memory, after sbt package and setting _JAVAOPTIONS to some -Xmx values it was solved and Spark started. But later on JVM memory error returned then I have just changed JVM from jdk7u79 to jdk8 and .. no more tears.

Sample use-case

count the words

I like learning by example. So we have simple use case. Count the words from webpage. To start with I have just copied data from page to text file. What to do? First make your first RDD,
load it via

[code lang=”scala”]
val dlines = sc.textFile(&quot;C://path_to_file//site.txt&quot;)

check if it has any data and print first 10 lines

[code lang=”scala”]

Now we may want to split words in lines by space

[code lang=”scala”]
val word_arrays = dlines.map(ln=&gt;ln.split(&quot; &quot;))

Having this strange array of array let’s flatten it making array of all words here

[code lang=”scala”]
val words = word_arrays.collect().flatMap(y=&gt;y)
val dwords = sc.parallelize(words)

With RDD made up from words it is possible to have fun. To count the words we assign to each word one number. After that we shall accumulate all by magic reduceByKey

[code lang=”scala”]
val words_nr = dwords.map( s=&gt;(s,1))
val counts = words_nr.reduceByKey((a, b) =&gt; a + b)

We have all the counts, but there is mess. Let us sort and see what words are most popular on the site.

[code lang=”scala”]
scala&gt; val sorted = counts.sortBy(k=&gt;k._2, false)
scala&gt; sorted.take(20).foreach(println)


Further nice things

Of course what is here is not everything, you may nicely communicate with databases or use Hadoop-dedicated parquets. There is rich number of addons, with worth to mention MLib used in machine learning.