Computational text analysis: a survival guide

Jeremi Ochab & Artjoms Šeļa

You may have heard that computational text analysis for digital humanities and cultural analytics is fun. We assure you it’s not: it’s a grim endeavour that can easily go wrong really quickly and requires patience, expertise and making responsible choices. In this two-week course, we offer a comprehensive survival guide to multivariate text analysis with R, where we start with the basics of counting words and spend a lot of time on fundamentals: text representation, calculation of differences and similarities, vector manipulations, unsupervised and supervised methods of text classification. We will guide you through the user-friendly interface of `stylo` software to introduce important concepts and operations. Then, we will show you how to expand on that: understand the workflow, design your own research, discuss real-world studies and run simple replication experiments. By the end of the course, you will be able to pursue research questions like:

  • Which textual features can betray an author’s or translator’s identity?
  • What unconscious elements of language reflect the author’s education, gender, religious background, and social or historical conditions?
  • What elements of style are affected by literary period, genre, and topic?
  • What are the textual relationships between books or authors?

Expanded description of the two weeks:

In the first week, you will learn the fundamental methodology for text analysis: turning texts into numbers that represent (or hope to represent) their different aspects, like style, authorship, theme or genre. In this first part, we focus on stylometry and authorship attribution as the oldest tradition of quantitative text analysis and talk about the historical background of these methods and the assumptions and biases they introduce. We will work through the basics of unsupervised and supervised text analysis and their question-driven applications in computational literary studies. Here the course focuses on an accurate conceptual understanding of methodology and illustrates it through a graphic user interface of text analysis software `stylo` that requires no coding skills.

However, all off-the-shelf software tends to prioritize some practices over others and make a lot of invisible choices that push researchers to perceive possibilities as laws. In the second week, we will work our way through the illusion by taking apart `stylo` and demonstrating the whole text analysis pipeline step-by-step, also showing how to minimally reproduce it yourself with R and `tidytext` workflow. We will use this disambiguated workflow to talk about critically important problems: representation of uncertainty and making results explainable. The former will necessitate a conversation about statistics, the latter – about the nature of classification and feature importance. The hands-on nature of the second part of the workshop will make you naturally familiar with the R programming language, even if pure coding won’t be the focus. At the very end of the course, participants will work in groups to replicate a real-world computational study or have an opportunity to work on their own projects.

Week 1 prerequisites: none

Week 2 prerequisites: basics of text analysis / some familiarity with `stylo` library

Week 1: Fundamentals

Day 1. Sorrows of software set-up and introduction to crimes of literary computation

Day 2. Torture of text representations and multidimensional misery

Day 3. Unsupervised and network analysis abyss

Day 4. Classification carnage

Day 5. Agony of application: authorship, genre, gender, poetic meters

Week 2: Research workflow

Day 6. Despair of deconstructing the programmatic pipelines

Day 7. Excruciatingly explainable results and features

Day 8. Sampling slaughterhouse and bootstrap bloodbath

Day 9. Revenge of replication studies and own projects’ purgatory

Day 10. Frightful future of text analysis and QAs

Technical requirements:

Before the workshop, please have installed:

  • R and RStudio [https://posit.co/download/rstudio-desktop/] (and Xquartz the latter for Mac)
  • a simple text reading program (for .txt and .csv) like Sublime Text Editor for Windows, Linux and Mac
  • zip/unzip programs (these are programs that you normally have by default in your computer, like 7-Zip or WinZip for Windows, to manage compressed folders)