TY - BOOK AU - Hvitfeldt, Emil AU - Silge, Julia TI - Supervised machine learning for text analysis in R SN - 9780367554194 U1 - 006.35 PY - 2022/// CY - Boco Raton PB - CRC Press KW - Computational linguistics - Statistical methods N2 - I Natural Language Features 1. Language and modeling Linguistics for text analysis A glimpse into one area: morphology Different languages Other ways text can vary Summary 2. Tokenization What is a token? Types of tokens Character tokens Word tokens Tokenizing by n-grams Lines, sentence, and paragraph tokens Where does tokenization break down? Building your own tokenizer Tokenize to characters, only keeping letters Allow for hyphenated words Wrapping it in a function Tokenization for non-Latin alphabets Tokenization benchmark Summary 3. Stop words Using premade stop word lists Stop word removal in R Creating your own stop words list All stop word lists are context-specific What happens when you remove stop words Stop words in languages other than English Summary 4. Stemming How to stem text in R Should you use stemming at all? Understand a stemming algorithm Handling punctuation when stemming Compare some stemming options Lemmatization and stemming Stemming and stop words Summary 5. Word Embeddings Motivating embeddings for sparse, high-dimensional data Understand word embeddings by finding them yourself Exploring CFPB word embeddings Use pre-trained word embeddings Fairness and word embeddings Using word embeddings in the real world Summary II Machine Learning Methods Regression A first regression model Building our first regression model Evaluation Compare to the null model Compare to a random forest model Case study: removing stop words Case study: varying n-grams Case study: lemmatization Case study: feature hashing Text normalization What evaluation metrics are appropriate? The full game: regression Preprocess the data Specify the model Tune the model Evaluate the modeling Summary Classification A first classification model Building our first classification model Evaluation Compare to the null model Compare to a lasso classification model Tuning lasso hyperparameters Case study: sparse encoding Two class or multiclass? Case study: including non-text data Case study: data censoring Case study: custom features Detect credit cards Calculate percentage censoring Detect monetary amounts What evaluation metrics are appropriate? The full game: classification Feature selection Specify the model Evaluate the modeling Summary III Deep Learning Methods Dense neural networks Kickstarter data A first deep learning model Preprocessing for deep learning One-hot sequence embedding of text Simple flattened dense network Evaluation Using bag-of-words features Using pre-trained word embeddings Cross-validation for deep learning models Compare and evaluate DNN models Limitations of deep learning Summary Long short-term memory (LSTM) networks A first LSTM model Building an LSTM Evaluation Compare to a recurrent neural network Case study: bidirectional LSTM Case study: stacking LSTM layers Case study: padding Case study: training a regression model Case study: vocabulary size The full game: LSTM Preprocess the data Specify the model Summary Convolutional neural networks What are CNNs? Kernel Kernel size A first CNN model Case study: adding more layers Case study: byte pair encoding Case study: explainability with LIME Case study: hyperparameter search The full game: CNN Preprocess the data Specify the model Summary IV Conclusion Text models in the real world Appendix A Regular expressions A Literal characters A Meta characters A Full stop, the wildcard A Character classes A Shorthand character classes A Quantifiers A Anchors A Additional resources B Data B Hans Christian Andersen fairy tales B Opinions of the Supreme Court of the United States B Consumer Financial Protection Bureau (CFPB) complaints B Kickstarter campaign blurbs C Baseline linear classifier C Read in the data C Split into test/train and create resampling folds C Recipe for data preprocessing C Lasso regularized classification model C A model workflow C Tune the workflow ER -