Text Mining with R

This is my first time participating in a reading group. Initialy there are only 3 of us in the group. This is a trail with a view of running a reading group with the Johannesburg R user groups.

I initially proposed we do 1 chapter a week. On the due date we’ll send a brief email on what we thought of the chapter. Proposed dates:

# Chapter Name Complete Chapter by:
1 The tidy text format 01-Feb-19
2 Sentiment analysis with tidy data 08-Feb-19
3 Analyzing word and document frequency: tf-idf 15-Feb-19
4 Relationships between words: n-grams and correlations 22-Feb-19
5 Converting to and from non-tidy formats 01-Mar-19
6 Topic modelling 08-Mar-19
7 Case study: comparing Twitter archives 15-Mar-19
8 Case study: mining NASA metadata 22-Mar-19
9 Case study: analyzing usenet text 29-Mar-19

My summary:

Preface

Julia Silge and David Robinson created the tidytext package in 2016. They found that using tidy text principals could make text mining tasks easier and more efficient. Treating text as dataframes allows us to manipulate, summarise and visualise characteristics of the text. This book is an introduction to tidytext, with relatively simplex examples. What is important is the possible applications. The book also contains real world text mining problems.

Outline

The book is divided into 3 sections:

  • Chapter 1 to 4: Introduction to tidy text format and tidyverse tools which allow informative analysis of the text structure.
  • Chapter 5 to 6: We won’t always have tidy text, these chapters cover converting between tidy and non-tidy formats.
  • Chapter 7 to 9: Several case studies that make use of tidy text approaches covered.

Topics this book does not cover

This book is an introduction to the tidy text mining framework. More details of R NLP (Natural Language Processing) can be found at CRAN Task View on Natural Language Processing

  • Clustering, classification, and prediction: Techniques of applying Machine learning on text. Only topic modelling is covered in this book.
  • Word embedding: text analysis using words to vector, used to examine linguistic relationships between words and to classify text. Not covered in this book. Applications in machine learning algorithms.
  • More complex tokenization: Many other ways of tokenisation exist for specific applications. tidytext uses tokenizers package, which itself wraps a variety of tokenizers with a consistent interface.
  • Languages other than English: Some people have used tidytext with other languages. Non covered here.

About this book

Focus on code, not equations. No prior experience needed. Assume reader familiar with dplyr, ggplot2 and the %>% (pipe) operator in R. If you don’t have this background R for Data Science is recommended.

Using code examples

In the interest of space, source code for similar visualisations are not repeated. Full source code used to generate this book can be found on this GitHub repo.

Acknowledgements

Technical Reviewers:

References

My take aways

What have you Learned

  • This book is an introduction to tidytext.
  • Some familiarity with the tidyverse is assumed. If you don’t have this background R for Data Science is recommended.
  • NLP (Natural Language Processing) field is large, we cover only a small portion. NLP has many applications in machine learning.
  • This book and all the R packages used are built by the community.

How does this apply to your life

  • I’ve followed a number of the contributors / technical reviewers of this book on twitter

1 The tidy text format

Tidy data has a specific structure:

  • Each variable is a column
  • Each observation is a row
  • Each type of observational unit is a table

We define tidy text as a table with one-token-per-row. For tidy text mining, the token is most often a single word, can also be n-gram, sentence or paragraph. The tidytext package, provides functionality to tokenise by these common units of text.

By keeping data in a tidy format, users can easily transition between tidyverse packages.

Tidytext package doesn’t expect user to keep text data in tidy form. This allows for example a workflow where the data can be imported, filtered and processed using the tidyverse, after which it’s converted into a document-term matrix for machine leaning. Then the model can be reconverted into a tidy form to be interpreted and visualised with ggplot2.

1.1 Contrasting tidy text with other data structures

Different ways text is often stored in text mining approaches:

  • String
  • Corpus: raw strings annotated with additional metadata and details
  • Document-term matrix: sparse matrix describing collection (i.e. corpus) of documents with one rows for each document and one column for each term. Value in the matrix typically word count or tf-idf.

1.2 The unnest_tokens function

#string
text <- c("some random",
          "strings")
text

#character vectors
library(dplyr)
text_df <- tibble(line= 1:2, text = text)
text_df

A tibble is a modern class of data frame within R avalible in dplyr and the tibble package. For a tidy text analysis we need to convert the tibble so that it has one-token-per-document-per-row.

A token is a meaninigful unit of text, most often a word. Tokenisation is the process of splitting text into tokens.

library(tidytext)

text_df %>% unnest_tokens(word,text)

1.3 Tidying the works of Jane Austen

1.4 The gutenbergr package

1.5 Word frequencies

1.6 Summary

My take aways

What have you Learned

How does this apply to your life

2 Sentiment analysis with tidy data

2.1 The sentiments dataset

2.2 Sentiment analysis with inner join

2.3 Comparing the three sentiment dictionaries

2.4 Most common positive and negative words

2.5 Wordclouds

2.6 Looking at units beyond just words

2.7 Summary

3 Analyzing word and document frequency: tf-idf

3.1 Term frequency in Jane Austen’s novels

3.2 Zipf’s law

3.3 The bind_tf_idf function

3.4 A corpus of physics texts

3.5 Summary

4 Relationships between words: n-grams and correlations

4.1 Tokenizing by n-gram

4.2 Counting and correlating pairs of words with the widyr package

4.3 Summary

5 Converting to and from non-tidy formats

5.1 Tidying a document-term matrix

5.2 Casting tidy text data into a matrix

5.3 Tidying corpus objects with metadata

5.4 Summary

6 Topic modeling

6.1 Latent Dirichlet allocation

6.2 Example: the great library heist

6.3 Alternative LDA implementations

6.4 Summary

7 Case study: comparing Twitter archives

7.1 Getting the data and distribution of tweets

7.2 Word frequencies

7.3 Comparing word usage

7.4 Changes in word use

7.5 Favorites and retweets

7.6 Summary

8 Case study: mining NASA metadata

8.1 How data is organized at NASA

8.2 Word co-ocurrences and correlations

8.3 Calculating tf-idf for the description fields

8.4 Topic modeling

8.5 Summary

9 Case study: analyzing usenet text

9.1 Pre-processing

9.2 Words in newsgroups

9.3 Sentiment analysis

9.4 Summary

10 References