Supervised machine learning for text analysis in R

By:

Hvitfeldt, Emil

Contributor(s):

Silge, Julia

Material type: Text

TextPublication details: CRC Press Boco Raton 2022Description: xix, 381 pISBN:

9780367554194

Subject(s):

Computational linguistics - Statistical methods

DDC classification:

006.35 HVI

Summary: I Natural Language Features 1. Language and modeling Linguistics for text analysis A glimpse into one area: morphology Different languages Other ways text can vary Summary 2. Tokenization What is a token? Types of tokens Character tokens Word tokens Tokenizing by n-grams Lines, sentence, and paragraph tokens Where does tokenization break down? Building your own tokenizer Tokenize to characters, only keeping letters Allow for hyphenated words Wrapping it in a function Tokenization for non-Latin alphabets Tokenization benchmark Summary 3. Stop words Using premade stop word lists Stop word removal in R Creating your own stop words list All stop word lists are context-specific What happens when you remove stop words Stop words in languages other than English Summary 4. Stemming How to stem text in R Should you use stemming at all? Understand a stemming algorithm Handling punctuation when stemming Compare some stemming options Lemmatization and stemming Stemming and stop words Summary 5. Word Embeddings Motivating embeddings for sparse, high-dimensional data Understand word embeddings by finding them yourself Exploring CFPB word embeddings Use pre-trained word embeddings Fairness and word embeddings Using word embeddings in the real world Summary II Machine Learning Methods Regression A first regression model Building our first regression model Evaluation Compare to the null model Compare to a random forest model Case study: removing stop words Case study: varying n-grams Case study: lemmatization Case study: feature hashing Text normalization What evaluation metrics are appropriate? The full game: regression Preprocess the data Specify the model Tune the model Evaluate the modeling Summary Classification A first classification model Building our first classification model Evaluation Compare to the null model Compare to a lasso classification model Tuning lasso hyperparameters Case study: sparse encoding Two class or multiclass? Case study: including non-text data Case study: data censoring Case study: custom features Detect credit cards Calculate percentage censoring Detect monetary amounts What evaluation metrics are appropriate? The full game: classification Feature selection Specify the model Evaluate the modeling Summary III Deep Learning Methods Dense neural networks Kickstarter data A first deep learning model Preprocessing for deep learning One-hot sequence embedding of text Simple flattened dense network Evaluation Using bag-of-words features Using pre-trained word embeddings Cross-validation for deep learning models Compare and evaluate DNN models Limitations of deep learning Summary Long short-term memory (LSTM) networks A first LSTM model Building an LSTM Evaluation Compare to a recurrent neural network Case study: bidirectional LSTM Case study: stacking LSTM layers Case study: padding Case study: training a regression model Case study: vocabulary size The full game: LSTM Preprocess the data Specify the model Summary Convolutional neural networks What are CNNs? Kernel Kernel size A first CNN model Case study: adding more layers Case study: byte pair encoding Case study: explainability with LIME Case study: hyperparameter search The full game: CNN Preprocess the data Specify the model Summary IV Conclusion Text models in the real world Appendix A Regular expressions A Literal characters A Meta characters A Full stop, the wildcard A Character classes A Shorthand character classes A Quantifiers A Anchors A Additional resources B Data B Hans Christian Andersen fairy tales B Opinions of the Supreme Court of the United States B Consumer Financial Protection Bureau (CFPB) complaints B Kickstarter campaign blurbs C Baseline linear classifier C Read in the data C Split into test/train and create resampling folds C Recipe for data preprocessing C Lasso regularized classification model C A model workflow C Tune the workflow

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings
Item type	Current library	Collection	Call number	Copy number	Status	Date due	Barcode
Book	Indian Institute of Management LRC General Stacks	IT & Decisions Sciences	006.35 HVI (Browse shelf(Opens below))	1	Available		005569

Browsing Indian Institute of Management LRC shelves, Shelving location: General Stacks, Collection: IT & Decisions Sciences Close shelf browser (Hides shelf browser)

Previous								Next
Previous	006.32 WEI Deep learning from scratch: building with Python from first principles	006.35 GEO Python text mining: perform text processing, word embedding, text classification and machine translation	006.35 GOY Deep learning for natural language processing:	006.35 HVI Supervised machine learning for text analysis in R	006.35 KAM Deep learning for NLP and speech recognition	006.35 LAK Hands-on supervised learning with Python: learn how to solve machine learning problems with supervised learning algorithms using Python	006.35 SAR Text analytics with python: a practitioner's guide to natural language processing	Next

I Natural Language Features

1. Language and modeling

Linguistics for text analysis

A glimpse into one area: morphology

Different languages

Other ways text can vary

Summary

2. Tokenization

What is a token?

Types of tokens

Character tokens

Word tokens

Tokenizing by n-grams

Lines, sentence, and paragraph tokens

Where does tokenization break down?

Building your own tokenizer

Tokenize to characters, only keeping letters

Allow for hyphenated words

Wrapping it in a function

Tokenization for non-Latin alphabets

Tokenization benchmark

Summary

3. Stop words

Using premade stop word lists

Stop word removal in R

Creating your own stop words list

All stop word lists are context-specific

What happens when you remove stop words

Stop words in languages other than English

Summary

4. Stemming

How to stem text in R

Should you use stemming at all?

Understand a stemming algorithm

Handling punctuation when stemming

Compare some stemming options

Lemmatization and stemming

Stemming and stop words

Summary

5. Word Embeddings

Motivating embeddings for sparse, high-dimensional data

Understand word embeddings by finding them yourself

Exploring CFPB word embeddings

Use pre-trained word embeddings

Fairness and word embeddings

Using word embeddings in the real world

Summary

II Machine Learning Methods

Regression

A first regression model

Building our first regression model

Evaluation

Compare to the null model

Compare to a random forest model

Case study: removing stop words

Case study: varying n-grams

Case study: lemmatization

Case study: feature hashing

Text normalization

What evaluation metrics are appropriate?

The full game: regression

Preprocess the data

Specify the model

Tune the model

Evaluate the modeling

Summary

Classification

A first classification model

Building our first classification model

Evaluation

Compare to the null model

Compare to a lasso classification model

Tuning lasso hyperparameters

Case study: sparse encoding

Two class or multiclass?

Case study: including non-text data

Case study: data censoring

Case study: custom features

Detect credit cards

Calculate percentage censoring

Detect monetary amounts

What evaluation metrics are appropriate?

The full game: classification

Feature selection

Specify the model

Evaluate the modeling

Summary

III Deep Learning Methods

Dense neural networks

Kickstarter data

A first deep learning model

Preprocessing for deep learning

One-hot sequence embedding of text

Simple flattened dense network

Evaluation

Using bag-of-words features

Using pre-trained word embeddings

Cross-validation for deep learning models

Compare and evaluate DNN models

Limitations of deep learning

Summary

Long short-term memory (LSTM) networks

A first LSTM model

Building an LSTM

Evaluation

Compare to a recurrent neural network

Case study: bidirectional LSTM

Case study: stacking LSTM layers

Case study: padding

Case study: training a regression model

Case study: vocabulary size

The full game: LSTM

Preprocess the data

Specify the model

Summary

Convolutional neural networks

What are CNNs?

Kernel

Kernel size

A first CNN model

Case study: adding more layers

Case study: byte pair encoding

Case study: explainability with LIME

Case study: hyperparameter search

The full game: CNN

Preprocess the data

Specify the model

Summary

IV Conclusion

Text models in the real world

Appendix

A Regular expressions

A Literal characters

A Meta characters

A Full stop, the wildcard

A Character classes

A Shorthand character classes

A Quantifiers

A Anchors

A Additional resources

B Data

B Hans Christian Andersen fairy tales

B Opinions of the Supreme Court of the United States

B Consumer Financial Protection Bureau (CFPB) complaints

B Kickstarter campaign blurbs

C Baseline linear classifier

C Read in the data

C Split into test/train and create resampling folds

C Recipe for data preprocessing

C Lasso regularized classification model

C A model workflow

C Tune the workflow

There are no comments on this title.

to post a comment.