NLP made easy, Natural Language Processing

You may sometimes underestimate your skills. Understanding text is a complex task.

You are someone!

To analyze text in your mind you have to understand how things are joined together.
Language. While reading 2nd chapter of Taming text I felt little as I were back on my school, attending language lessons of Polish langauage (my native). As kids we were told what are parts of speech. We were told about lexical categories. We were told what is adjective, what is noun, etc. But we were told NOT that there are some super-extra automatic tools that can recognize it for us. But that is not all, as humans we can naturally work with bigger chunks of text: phrases. And we know that in sentence

[code lang=”text”]
Sun shines at people walking by the river.

walking by the river is important phrase for noun people and has no or little connection with Sun.
Morphology. Words are like trees. Every word have its root. Lexeme or root form is word without end-ing. Stemming is technique helping you to get rid of endings. Why so? Usually when searching in English we are interested in broader scope of word. Girl looking for skirt will not feel offended when search engine provide her article about skirtS.

Automation! NLP!

Java is not the only language having these possibilities. But we have java as TextTaming default programming language so I have explored Java libraries. And these are really cool. Lucene libraries, Apache OpenNLP, Tika.
Just do it!


Tika is the best. Guys who made Tika are great. Using that you can really do anything. I mean extracting RAW text from PowerPoint presentation, PDF documentation, Excel sheet and Word Document. Apache provides you with usable runnable jar here that you have just to run and voila. I have made for myself batch that I used to extract documents that I needed to use in Carrot2 IDE. Here you can find it.

Processing made easy


All following samples require included maven dependencies in java maven project

[code lang=”xml”]

To process your text first you make list of tokens. Tokenization is possible using such StandardTokenizer as available in OpenNLP library. Having tokens the game begins!

[code lang=”java”]
Tokenizer tokenizer = SimpleTokenizer.
String[] tokenArray = tokenizer.tokenize(text);
List<String> tokens = Arrays.asList(tokenArray);
return tokens;

Part of speech assignment (POS)

There is great tool in OpenNLP I can see big potential in. Why? I think we can draw conclusions about text from the way it is written, number of Nouns, Adjectives and so. What is more in effective text mining it could be wise to analyse only some part of speech only?

[code lang=”java”]
List<String> tokens = tokenizer.getTokens(text);
String[] tokenArray = (String[]) tokens.toArray();
InputStream is = this.getClass().getClassLoader().getResourceAsStream("en-pos-maxent.bin");
POSModel model = new POSModel(is);
POSTaggerME tagger = new POSTaggerME(model);
String[] tags = tagger.tag(tokenArray);


To understand sentence it is important not only to differentiate part of speech but also understand what is role of words in sentence. There is different goal of verb describing action of subject in complex sequence and main predicate. Again OpenNLP comes with the rescue.

[code lang=”java”]
InputStream isParser = this.getClass().getClassLoader().getResourceAsStream("en-parser-chunking.bin");
ParserModel model = new ParserModel(isParser);
Parser parser = ParserFactory.create(model);
InputStream isSentence = this.getClass().getClassLoader().getResourceAsStream("en-sent.bin");
SentenceModel sentenceModel = new SentenceModel(isSentence);
SentenceDetector sentenceDetector = new SentenceDetectorME(sentenceModel);
String[] sentences = sentenceDetector.sentDetect(text);

List<ParsedSentence> parsed = new ArrayList<ParsedSentence>();

for (String sentence : sentences) {
Parse[] parses = ParserTool.parseLine(sentence, parser, 1);
parsed.add(new ParsedSentence(sentence, Arrays.asList(parses)));


Having your text parsed you can draw conclusions, play with it. Whole testlike project available at github. Enjoy!

  • Maciej

    Hi! Your example is short and sweet but it lacks conclusions about what can be done next. It would be great if you could provide some examples of how those parsed sentences can be used in target applications.