• HC Visitor
Skip to content
Information Ecosystems
Information Ecosystems

Information, Power, and Consequences

Primary Navigation Menu
Menu
  • InfoEco Podcast
  • InfoEco Blog
  • InfoEco Cookbook
    • About
    • Curricular Pathways
    • Cookbook Modules

Literature from a distance

By: Erin O'Rourke
On: September 19, 2020
In: Ted Underwood
Tagged: English Literature, Libraries, Machine Learning, Reading, Ted Underwood

Library with a woman seated on a bench in the center

For readers today, there is a wealth of information available about any given work they read, from its date of publication, its author’s biographical information, its genre, details about any previous editions or formats, and the like. On top of that, it seems as though nearly any book is available through online retailers like Amazon, public or university libraries, or as an e-book. With all this information about recent works at our fingertips, it was surprising to learn how much there still is to know about collections of written works that span only the past few centuries.

On Friday, September 20, Ted Underwood, professor of English Literature and Information Science at the University of Illinois, addressed the participants of the Mellon Sawyer Seminar, answering questions about how data and the absence of data relate to his work.

Underwood’s area of expertise is distant reading — drawing conclusions about large collections of written work through an analysis of metadata and other less-subjective characteristics. As Underwood described in Digital Humanities Quarterly, distant reading was a technique in literature long before computers became equipped to help with it. When researchers first had access to computational methods like optical character recognition, which make the full text of a piece searchable, they applied them to a variety of problems, soon realizing what kinds of questions computers are suited to answer. People, rather than computers, are much better at closely reading a single text or the works of one author to characterize them, as well as answering “why” questions, such as “why did an author make a specific literary choice?”

In his most recent project, Ted Underwood and his team worked with the Hathi Trust Digital Library, a collection of seventeen million volumes gathered from over sixty research libraries. With a library that large, it’s hard to imagine a question one couldn’t answer, given enough time, but Underwood made it clear that even such an extensive collection has its limitations. This brought us to the question of who and what is absent in the data Underwood considers. Since most of the sixty libraries are located in the US, Canada, and Europe, the inclusion of international works correlates with their prestige — it’s likely that award-winning novels from other countries will be included, but not much popular literature. In addition, certain genres like children’s and juvenile fiction and newspaper fiction are underrepresented. Another issue, less specific to computational methods, is how generalizable conclusions are to the population — fiction writing certainly isn’t a perfect representation of humanity, or even the United States, with regards to issues of gender and race. A final, more technical challenge, is that many books published after 1923 are still under copyright. In order for these books to be searchable for consideration in projects like Underwood’s, users can look up frequencies of words on a given page, just not the text word-for-word.

With all of these factors considered, Underwood reiterated: you can’t model away bias or the limitations of a collection. What he does do is better understand how the characterization of gender in literature has changed over time, by finding phrases and words that correlate with the gender of the character being described, or use machine learning to detect the genre of a piece of writing, or understand what kinds of backgrounds authors come from in a large body of literature. In understanding the models Underwood used to answer these questions, readers realize that in literature, shifting human definitions of things like genre matter more than inherent qualities of the work. Informed by knowledge like this, humanists can be prepared to ask different questions.

Arriving at conclusions like these requires a team with a specific skill set. Underwood sought a dual appointment between the Information Science and English departments at the University of Illinois in part to collaborate with researchers trained in quantitative methods like computer programming and statistics. The divide in technical skills between departments seems less than ideal: Underwood’s projects are focused on literature, and it could greatly benefit literature students if there were training opportunities within their department to work on projects like these. In the question and answer portion of the talk, Underwood and several seminar participants identified strategies to include more quantitative methods in humanities degree programs. These included department-specific methods courses, sending students to a related department like information science or statistics to take intro classes, and even a course on the subject of evidence, co-taught by statisticians and humanists and considering both quantitative and qualitative methods. Perhaps the most unique approach proposed at the Friday seminar is designing degree programs with a lighter core, and encouraging deeper dives into subfields, taking a quantitative intro course through a student’s home discipline, followed by courses from other departments for more detailed content. Underwood emphasized the importance of these courses in grad school to avoid restricting research roles to students who already have the necessary technical skills, which would create additional biases in the field towards the kinds of students who already have technical skills.

Underwood’s talks and published work challenge readers to expand their definition of what methods literary scholarship can employ, what questions these methods can be used to answer, and how universities should adapt their course offerings to handle a changing landscape of research projects. It was great to hear from him at this talk, and I look forward to the rest of the series.

2020-09-19
Previous Post: What Covid-19 Means for Automation, Digital Media, and the Future of Work
Next Post: Data, Desert Islands, and Digital Dark Ages: Richard Marciano on Records and Data Management

Invited Speakers

  • Annette Vee
  • Bill Rankin
  • Chris Gilliard
  • Christopher Phillips
  • Colin Allen
  • Edouard Machery
  • Jo Guldi
  • Lara Putnam
  • Lyneise Williams
  • Mario Khreiche
  • Matthew Edney
  • Matthew Jones
  • Matthew Lincoln
  • Melissa Finucane
  • Richard Marciano
  • Sabina Leonelli
  • Safiya Noble
  • Sandra González-Bailón
  • Ted Underwood
  • Uncategorized

Recent Posts

  • EdTech Automation and Learning Management
  • The Changing Face of Literacy in the 21st Century: Dr. Annette Vee Visits the Podcast
  • Dr. Lara Putnam Visits the Podcast: Web-Based Research, Political Organizing, and Getting to Know Our Neighbors
  • Chris Gilliard Visits the Podcast: Digital Redlining, Tech Policy, and What it Really Means to Have Privacy Online
  • Numbers Have History

Recent Comments

    Archives

    • June 2021
    • April 2021
    • March 2021
    • February 2021
    • January 2021
    • December 2020
    • October 2020
    • September 2020
    • May 2020
    • March 2020
    • February 2020
    • January 2020
    • December 2019
    • November 2019
    • October 2019
    • September 2019

    Categories

    • Annette Vee
    • Bill Rankin
    • Chris Gilliard
    • Christopher Phillips
    • Colin Allen
    • Edouard Machery
    • Jo Guldi
    • Lara Putnam
    • Lyneise Williams
    • Mario Khreiche
    • Matthew Edney
    • Matthew Jones
    • Matthew Lincoln
    • Melissa Finucane
    • Richard Marciano
    • Sabina Leonelli
    • Safiya Noble
    • Sandra González-Bailón
    • Ted Underwood
    • Uncategorized

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org

    Tags

    Algorithms Amazon archives artificial intelligence augmented reality automation Big Data Bill Rankin black history month burnout cartography Curation Darwin Data data pipelines data visualization digital humanities digitization diversity Education election maps history history of science Information Information Ecosystems Information Science Libraries LMS maps mechanization medical bias medicine Museums newspaper Open Data Philosophy of Science privacy racism risk social science solutions journalism Ted Underwood Topic modeling Uber virtual reality

    Menu

    • InfoEco Podcast
    • InfoEco Blog
    • InfoEco Cookbook
      • About
      • Curricular Pathways
      • Cookbook Modules

    Search This Site

    Search

    The Information Ecosystems Team 2026

    This site is part of Knowledge Commons. Explore other sites on this network or register to build your own.
    Terms of ServicePrivacy PolicyGuidelines for Participation