Text Mining: An Example in R

Word clouds are more than just fancy images – sometimes…

Do you sometimes come across word clouds and think, “Strategy consultants really have too much time”? Often that is probably true and word clouds are just put together randomly by dedicated followers of fads and fashion, but they can also be based on statistics. In the latter case, that is called text mining: The word clouds are created based on the frequency with which the words appear in a given text, or in a collection of texts.

I applied text mining (using R) to my recent research paper on inflation forecasting to illustrate how this works — the resulting word cloud looks quite nice:

 

Text Mining in R

 

A brief overview of text mining tools in R

This is relatively easy to create using the tm package in R, which is designed for text mining tasks. Some things that are important, if you want to create your own word cloud based on a text of yours:

  • Ideally you have the text in plain text format. If you only have PDF files (or HTML code), it also works, but it a bit more tedious. For PDF, you can transform the PDF file to plain text using R — or manually, but if you work with a collection of texts rather than just one, an automated procedure is crucial. If you use the linked method, it is necessary to have RTools installed and to add it to the system path variable (both of these things you should do anyway).
  • You don’t want your word cloud crowded with insignificant words like “do” or “and”. This can be avoided using so-called stop words that are removed in the data preprocessing step. You also should remove punctuation, as well as numbers. Luckily, the package comes with functions for these things — removePunctuation, removeNumbers, and removeWords. Personally, I don’t find the stop list provided through the package extensive enough and rather work with the ones provided here. Another useful preprocessing steps is transforming the entire text to lower case.
  • The final preprocessing step is stemming: Words with the same root should count as one word. For example, if your text contains “walk”, “walked”, and “walking”, you wouldn’t want them to be counted separately. This can be done using the package SnowballC through the function stemDocument. It requires some adjustments to lead to good results, but is a great start.
  • Once your text is preprocessed, you can compute a frequency table of all the used words and let the package wordcloud (and the function of the same name) create your word cloud for you.

That’s it for today — things get a bit more interesting when you do text mining using a collection of texts rather than just a single one, in which case you can emphasise interesting features using hierarchical clustering and often work with sparse frequency matrices. I am currently working on an application for this using a large set of texts collected via web scraping in R and will soon post about it.

If you experience any problems with text mining in R, or have any other questions about the techniques, leave a comment!

Leave a Reply

Your email address will not be published. Required fields are marked *

Protected with IP Blacklist CloudIP Blacklist Cloud