Small Forays into Word Clouds and Text Mining

This week: Harriet Jacobs’s Incidents in the Life of a Slave Girl

(Note: After I had taught and wrote about this, Jacob Heil @dr_heil led me to Ryan Cordell’s more developed work with word clouds and Paul Fyfe’s “How to Not Read a Victorian Novel.” My next foray will benefit enormously from their work.)

Each Friday we have an hour in the computer lab. Today, we went to Wordle.net and spent half of the class selecting chapters from Harriet Jacobs’s Incidents in the Life of a Slave Girl to create “beautiful word clouds.” I am charmed by word clouds, and I think that they provide an engaging and unintimidating way to introduce data visualization–its affordances and its pitfalls–to students, even those who are struggling with how to use “Ctrl V” or who don’t know why we have to use Firefox as our browser.

I was guided by Ted Underwood’s blog on “Where to Start with Text Mining,” to introduce students to the purpose of text mining. I knew we wouldn’t go anywhere near the primary research he and his colleagues are working with. But getting students to have a sense of these categories of text mining helped them see what “big data” can do in the humanities. Underwood’s list pointed the way: Categorize documents. Contrast the vocabulary of different corpora. Trace the history of particular features (words or phrases) over time. Cluster features that tend to be associated in a given corpus of documents (aka topic modeling). Entity extraction. Visualization.

So we began with visualization. No one in the room had heard of “wordle”–which surprised me. And so we went went to UNC’s website where a full-text version of the book was available in a single page, and I asked them to select “all” and copy this into the wordle window. The wordle we all came up with, more or less, looked like this:

We quickly saw that “Page” was not a word that would help us to interpret Jacobs’s book, and so this led us to a very clear idea of the methodological problems that might come up with text mining: all words in a book don’t “mean” the same, and so finding ways to filter those words out is key.

I walked them through the quick process of “find” and “replace” in Word, such that we got rid of the offending “Page,” and a new wordle emerged:

A wordle using the text of _Incidents_ with the word “Page” removed 297 times.

Everyone now saw that “children” was the most frequent word in the book. While “Flint” and “master” and “slave” were also prevalent, this word “children” supported our discussions all week about the book: that Jacobs had a rhetorical purpose in mind: she wanted to build a bridge between herself and her audience, and being first a child and then a mother helped her to do this with her white audience.

I then asked students to spend the rest of the class selecting portions of the book–chapters that had intrigued them or interested them–and see how creating a wordle supported their initial interpretations. We all wrote up our comments in the live forum in our LMS.

During class, one student went and found a text by Sojourner Truth and created a wordle and reported on the prevalence of “God” (smaller in Jacobs’s text).

A wordle from Sojourner Truth’s narrative.

I think that I will use this wordle exercise again–and maybe extend it to the ngram viewer so that we can look more closely at relationships among words over time.

Definitely a useful introduction–very small steps, just enough to intrigue the students but not overwhelm them. Something to build on this term.

Doing DH at the CC

Doc McGrail's Blog about Digital Humanities at the Community College

Small Forays into Word Clouds and Text Mining

Leave a Reply