Lately there is one book I really enjoy reading. Taming the text. Why do I think topic is worth mentioning? We have never had such time as now before. Information is all around us. Data is cheap and we may find it everywhere. Do you want to mold your data and convert it into information? Up to you. How to automatize it? What if we could tame, subdue text.. There we have data/text mining.
At the beginning I was reading about the beast called text. Who is working with text? You are scientist – that’s you. Journalist? Of course. Student reading books, preparing to exams? Yes. Developer or IT worker who wants to improve her/his skills? Voila. Wants to analyze your system behavior? Go for it. Each one who is looking for answers and ask questions this is you.
You need it
There are few facts&questions making me realize I need it too. How many times was I looking for information in company’s documentation, how many documents and mails I am receiving, and how much out from it I am able to read and comprehend? That is true : world has produced around 1.8 Zetta Byte of data until 2011. Imagine how big effort is required to find useful information in it. Research IDC in 2009 gave following results:
Average time in middle-size IT-related company spent on finding information is around 9 hours a week
We could multiply these numbers with number of people, considering their money they earn we can spot a perfect room for improvement. These costs can be reduced.
Understanding text is hard. This is unique to you, mr Human, that you are able to read and comprehend text, e.g. this one… From informatics perspective you have complex abilities to process textual data, analyzing words, stopwords, understanding syntax, fighting with flection (some languages), getting what is the subject of sentence and what is goal of an author. Implementing that in automatically requires effort. Effort in work that can be done only using some tools.
Of course you can hardly find programmer who doesn’t have some tool to work with any IT-related problem. Book I have mentioned at the beginning also tells about some tools helping to tame the text: Solr, Mahout, OpenNLP. But I would like to humbly mention my favorite tools, not mentioned in the book : elastic Elasticsearch (basing on the same engine that Solr), and fast Spark. Langugages? Python, Java. Maybe R? One thing is the most important, use right tools to right tasks.