DajSiePoznać Skaucie

Gol | Cel

Drogi pamiętniczku. To czego brakuje mi w wielu apkach to że są niezmianialne, brak możliwości tworzenia wtyczek. Chcę zrobić system powiadamiania. Z wtyczkami. Będzie to skaut. On szuka a ty pijesz kawę, jesz chińszczyznę i czekasz. Fajnie co? I będę pisał po polsku.

Czytaj dalej DajSiePoznać Skaucie

445 words about your body,BigData and Grandma’s cake

Buzzwords

There are many buzzwords around us.  One of them is BigData. There are people who would die in name of bigdata, others say: „this is not applicable here and in 90% other cases” or „we always had databases and noone was in such a rush”.

Personally I (don’t) know what to think about it.  My data-mining/processing experience is not enough to be sure about it but I strongly believe that big data is something natural to human being from the very beggining of his/her life.

3 V’s in your life

There was time I have read about 3 V’s of BigData. Maybe there are 4 or 5, who knows! What is the meaning? Complex stuff, blablabla. But wait, is it not that each one of you – including you reading that text – you actually do every day? We are created this way that we have natural ability to process big data.

Variety

Man. Have you ever been to big city. Going that street there are so many data-generators. You read titles on posters, listen to the music. Wait, there is radio news broadcasted, oh, thats.. interesting.. Your phone rings and you start to chat with your friend. Information is flowing a vast river, being fed by many torrents.

Volume

Your friend is talking to you via phone, neurons are very busy in your brain. He is telling stories. You start to remind yourself about facts. Each thought and every word being heard makes a pathway somewhere under your skull. There are maaaany of such connections. Can you remember taste of your favorite cake made by granma? I can.

Velocity

In the same time you, mr Human, are obliged to draw conclusions and make decisions! No time for being lazy. Walking by the same street in crowdy city you hear someone’s yelling: „STOP”! You stop. You were approaching the road and didn’t notice car driving fast. Someone saved your life. But why you were able to survive, lucky guy? Because out of the data noise you were able to capture important words that you translated into command crucial to you. You did it even having a lot of other data to process in the same time, you were talking to friend via phone, looking at sky, smelling traffic clouds and thinking what to tell your boss because you are late.

Just do it

Don’t think it is buzzword. Just try to do it. It will increase your scientific value in the same time interacting with business value of your boss. Convince him.  I am gonna go into direction of data. Python was my experience in 2016. Spark has been started in the same year . Elasticsearch is here from ages. Just do it.

NLP made easy, Natural Language Processing

You may sometimes underestimate your skills. Understanding text is a complex task.

You are someone!

To analyze text in your mind you have to understand how things are joined together.
Language. While reading 2nd chapter of Taming text I felt little as I were back on my school, attending language lessons of Polish langauage (my native). As kids we were told what are parts of speech. We were told about lexical categories. We were told what is adjective, what is noun, etc. But we were told NOT that there are some super-extra automatic tools that can recognize it for us. But that is not all, as humans we can naturally work with bigger chunks of text: phrases. And we know that in sentence

[code lang=”text”]
Sun shines at people walking by the river.
[/code]

walking by the river is important phrase for noun people and has no or little connection with Sun.
Morphology. Words are like trees. Every word have its root. Lexeme or root form is word without end-ing. Stemming is technique helping you to get rid of endings. Why so? Usually when searching in English we are interested in broader scope of word. Girl looking for skirt will not feel offended when search engine provide her article about skirtS.

Automation! NLP!

Java is not the only language having these possibilities. But we have java as TextTaming default programming language so I have explored Java libraries. And these are really cool. Lucene libraries, Apache OpenNLP, Tika.
Just do it!

Extract

Tika is the best. Guys who made Tika are great. Using that you can really do anything. I mean extracting RAW text from PowerPoint presentation, PDF documentation, Excel sheet and Word Document. Apache provides you with usable runnable jar here that you have just to run and voila. I have made for myself batch that I used to extract documents that I needed to use in Carrot2 IDE. Here you can find it.

Processing made easy

Tokenize

All following samples require included maven dependencies in java maven project

[code lang=”xml”]
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.7.1</version>
</dependency>
[/code]

To process your text first you make list of tokens. Tokenization is possible using such StandardTokenizer as available in OpenNLP library. Having tokens the game begins!

[code lang=”java”]
Tokenizer tokenizer = SimpleTokenizer.
String[] tokenArray = tokenizer.tokenize(text);
List<String> tokens = Arrays.asList(tokenArray);
return tokens;
[/code]

Part of speech assignment (POS)

There is great tool in OpenNLP I can see big potential in. Why? I think we can draw conclusions about text from the way it is written, number of Nouns, Adjectives and so. What is more in effective text mining it could be wise to analyse only some part of speech only?

[code lang=”java”]
List<String> tokens = tokenizer.getTokens(text);
String[] tokenArray = (String[]) tokens.toArray();
InputStream is = this.getClass().getClassLoader().getResourceAsStream("en-pos-maxent.bin");
POSModel model = new POSModel(is);
POSTaggerME tagger = new POSTaggerME(model);
String[] tags = tagger.tag(tokenArray);
[/code]

Parsing

To understand sentence it is important not only to differentiate part of speech but also understand what is role of words in sentence. There is different goal of verb describing action of subject in complex sequence and main predicate. Again OpenNLP comes with the rescue.

[code lang=”java”]
InputStream isParser = this.getClass().getClassLoader().getResourceAsStream("en-parser-chunking.bin");
ParserModel model = new ParserModel(isParser);
Parser parser = ParserFactory.create(model);
InputStream isSentence = this.getClass().getClassLoader().getResourceAsStream("en-sent.bin");
SentenceModel sentenceModel = new SentenceModel(isSentence);
SentenceDetector sentenceDetector = new SentenceDetectorME(sentenceModel);
String[] sentences = sentenceDetector.sentDetect(text);

List<ParsedSentence> parsed = new ArrayList<ParsedSentence>();

for (String sentence : sentences) {
Parse[] parses = ParserTool.parseLine(sentence, parser, 1);
parsed.add(new ParsedSentence(sentence, Arrays.asList(parses)));
}
[/code]

Summary

Having your text parsed you can draw conclusions, play with it. Whole testlike project available at github. Enjoy!