Classifying Unstructured Text – Deterministic and Machine Learning Approaches – Christian Winkler & Stephanie Fischer
Text is one of the most used forms of communication and ubiquitous in the Internet. Social networks like Facebook and Twitter mainly contain unstructured text; the same is true for content-driven websites.
For humans it is easy to grasp the meaning of text – much more difficult for computers. Used correctly, computers can help humans tremendously in structuring and classifying huge amounts of text. This “symbiosis” can help humans work more efficiently, reduce repetitve work and use the uncovered structure.
Our talk starts with visualizations giving us ideas how to automatically classify texts. Then we will demonstrate that manual intervention is sometimes necessary and how this can be used as a basis for machine learning. This helps significantly in classifying more complicated cases.
As software tools we use R, Apache Solr, D3.js, and several NLP and ML tools from the ASF.