NLP made easy, Natural Language Processing

You may sometimes underestimate your skills. Understanding text is a complex task.

You are someone!

To analyze text in your mind you have to understand how things are joined together.
Language. While reading 2nd chapter of Taming text I felt little as I were back on my school, attending language lessons of Polish langauage (my native). As kids we were told what are parts of speech. We were told about lexical categories. We were told what is adjective, what is noun, etc. But we were told NOT that there are some super-extra automatic tools that can recognize it for us. But that is not all, as humans we can naturally work with bigger chunks of text: phrases. And we know that in sentence

[code lang=”text”]
Sun shines at people walking by the river.

walking by the river is important phrase for noun people and has no or little connection with Sun.
Morphology. Words are like trees. Every word have its root. Lexeme or root form is word without end-ing. Stemming is technique helping you to get rid of endings. Why so? Usually when searching in English we are interested in broader scope of word. Girl looking for skirt will not feel offended when search engine provide her article about skirtS.

Automation! NLP!

Java is not the only language having these possibilities. But we have java as TextTaming default programming language so I have explored Java libraries. And these are really cool. Lucene libraries, Apache OpenNLP, Tika.
Just do it!


Tika is the best. Guys who made Tika are great. Using that you can really do anything. I mean extracting RAW text from PowerPoint presentation, PDF documentation, Excel sheet and Word Document. Apache provides you with usable runnable jar here that you have just to run and voila. I have made for myself batch that I used to extract documents that I needed to use in Carrot2 IDE. Here you can find it.

Processing made easy


All following samples require included maven dependencies in java maven project

[code lang=”xml”]

To process your text first you make list of tokens. Tokenization is possible using such StandardTokenizer as available in OpenNLP library. Having tokens the game begins!

[code lang=”java”]
Tokenizer tokenizer = SimpleTokenizer.
String[] tokenArray = tokenizer.tokenize(text);
List<String> tokens = Arrays.asList(tokenArray);
return tokens;

Part of speech assignment (POS)

There is great tool in OpenNLP I can see big potential in. Why? I think we can draw conclusions about text from the way it is written, number of Nouns, Adjectives and so. What is more in effective text mining it could be wise to analyse only some part of speech only?

[code lang=”java”]
List<String> tokens = tokenizer.getTokens(text);
String[] tokenArray = (String[]) tokens.toArray();
InputStream is = this.getClass().getClassLoader().getResourceAsStream("en-pos-maxent.bin");
POSModel model = new POSModel(is);
POSTaggerME tagger = new POSTaggerME(model);
String[] tags = tagger.tag(tokenArray);


To understand sentence it is important not only to differentiate part of speech but also understand what is role of words in sentence. There is different goal of verb describing action of subject in complex sequence and main predicate. Again OpenNLP comes with the rescue.

[code lang=”java”]
InputStream isParser = this.getClass().getClassLoader().getResourceAsStream("en-parser-chunking.bin");
ParserModel model = new ParserModel(isParser);
Parser parser = ParserFactory.create(model);
InputStream isSentence = this.getClass().getClassLoader().getResourceAsStream("en-sent.bin");
SentenceModel sentenceModel = new SentenceModel(isSentence);
SentenceDetector sentenceDetector = new SentenceDetectorME(sentenceModel);
String[] sentences = sentenceDetector.sentDetect(text);

List<ParsedSentence> parsed = new ArrayList<ParsedSentence>();

for (String sentence : sentences) {
Parse[] parses = ParserTool.parseLine(sentence, parser, 1);
parsed.add(new ParsedSentence(sentence, Arrays.asList(parses)));


Having your text parsed you can draw conclusions, play with it. Whole testlike project available at github. Enjoy!

Why Javer should know some Python?

I am javer, during worktime I am using Java syntax with all Java micro and macroworld. But for some reason I started to learn that language. Reason? There are two people around me who use it (work/unofficially) and to coopearate with them I had to learn it.
While making some little Python project I encountered feelings like that: there are some things I love in Python comparing to Java, some I miss

Why I miss Java writing Python

Static typing

As dev I learned how to statically type. I was static in Delphi, I am static in Java. I prefer to be static. I feel safer then.

Virtual methods

Writing piece of code that should be done according to OOP rules and design patterns may not be easy with Python. You will not have a chance to write code that is virtual function-like. Strategy method can be hard.

Why do I love in Python

Lack of static typing

Yes, the sam reason why I prefer Java. But think about it: you just write. Your variable is exactly that what is. You easily bring to reality all your ideas. You just type list=[] and that is list! You type a = {'a':1} and you have a map (in Pythonish: ‚dict’). Life is easier but don’t forget about consequences!

Ease with daily tasks

For tasks like json, file IO, network and other issue Python is just less verbose. Any examples? Here we go.

Json made simple

No more thinking how to materialize your ideas as file. You may easy save dict/object as json or save it directly as file

[code lang=”python”]
with open(output, 'w', encoding='utf-8') as f :
json.dump(list_of_objects, f, ensure_ascii=False, indent=4)

File is something you open

Nothing easier more than read the file. What do you need to read the file? Just open it

[code lang=”python”]
f = open(path, 'a')

While opening just type the mode (append/read/write) and act!


No more thinking about immortal IOExceptions, networking is way of requesting the space, like that

[code lang=”python”]
page = requests.get(self.url)
page_tree = html.fromstring(page.content)
element_nodes = page_tree.cssselect(self.element_parent)

Above you just GET the url, then take content as text, after that converting to DOM elements, easily traversable. Poetry.

No superheros

I am not trying to say that Python rules. But what I want to remember is fact, that knowing one language you are tight, wider perspective is much more attractive. There are no super-languages, just take what is good and use it in your project. Before that take some time to learn at least foundations of languages available.