Normal view MARC view ISBD view

Supervised machine learning for text analysis in R

Hvitfeldt, Emil

Supervised machine learning for text analysis in R - Boco Raton CRC Press 2022 - xix, 381 p.

I Natural Language Features

1. Language and modeling

Linguistics for text analysis

A glimpse into one area: morphology

Different languages

Other ways text can vary

Summary

2. Tokenization

What is a token?

Types of tokens

Character tokens

Word tokens

Tokenizing by n-grams

Lines, sentence, and paragraph tokens

Where does tokenization break down?

Building your own tokenizer

Tokenize to characters, only keeping letters

Allow for hyphenated words

Wrapping it in a function

Tokenization for non-Latin alphabets

Tokenization benchmark

Summary

3. Stop words

Using premade stop word lists

Stop word removal in R

Creating your own stop words list

All stop word lists are context-specific

What happens when you remove stop words

Stop words in languages other than English

Summary

4. Stemming

How to stem text in R

Should you use stemming at all?

Understand a stemming algorithm

Handling punctuation when stemming

Compare some stemming options

Lemmatization and stemming

Stemming and stop words

Summary

5. Word Embeddings

Motivating embeddings for sparse, high-dimensional data

Understand word embeddings by finding them yourself

Exploring CFPB word embeddings

Use pre-trained word embeddings

Fairness and word embeddings

Using word embeddings in the real world

Summary

II Machine Learning Methods

Regression

A first regression model

Building our first regression model

Evaluation

Compare to the null model

Compare to a random forest model

Case study: removing stop words

Case study: varying n-grams

Case study: lemmatization

Case study: feature hashing

Text normalization

What evaluation metrics are appropriate?

The full game: regression

Preprocess the data

Specify the model

Tune the model

Evaluate the modeling

Summary

Classification

A first classification model

Building our first classification model

Evaluation

Compare to the null model

Compare to a lasso classification model

Tuning lasso hyperparameters

Case study: sparse encoding

Two class or multiclass?

Case study: including non-text data

Case study: data censoring

Case study: custom features

Detect credit cards

Calculate percentage censoring

Detect monetary amounts

What evaluation metrics are appropriate?

The full game: classification

Feature selection

Specify the model

Evaluate the modeling

Summary

III Deep Learning Methods

Dense neural networks

Kickstarter data

A first deep learning model

Preprocessing for deep learning

One-hot sequence embedding of text

Simple flattened dense network

Evaluation

Using bag-of-words features

Using pre-trained word embeddings

Cross-validation for deep learning models

Compare and evaluate DNN models

Limitations of deep learning

Summary

Long short-term memory (LSTM) networks

A first LSTM model

Building an LSTM

Evaluation

Compare to a recurrent neural network

Case study: bidirectional LSTM

Case study: stacking LSTM layers

Case study: padding

Case study: training a regression model

Case study: vocabulary size

The full game: LSTM

Preprocess the data

Specify the model

Summary

Convolutional neural networks

What are CNNs?

Kernel

Kernel size

A first CNN model

Case study: adding more layers

Case study: byte pair encoding

Case study: explainability with LIME

Case study: hyperparameter search

The full game: CNN

Preprocess the data

Specify the model

Summary

IV Conclusion

Text models in the real world

Appendix

A Regular expressions

A Literal characters

A Meta characters

A Full stop, the wildcard

A Character classes

A Shorthand character classes

A Quantifiers

A Anchors

A Additional resources

B Data

B Hans Christian Andersen fairy tales

B Opinions of the Supreme Court of the United States

B Consumer Financial Protection Bureau (CFPB) complaints

B Kickstarter campaign blurbs

C Baseline linear classifier

C Read in the data

C Split into test/train and create resampling folds

C Recipe for data preprocessing

C Lasso regularized classification model

C A model workflow

C Tune the workflow

ISBN: 9780367554194

Subjects--Topical Terms:
Computational linguistics - Statistical methods

Dewey Class. No.: 006.35 / HVI