Skip to main content

New Languages for NLP

The goal of this curriculum is to introduce humanist scholars to the key tools and methods in Natural Language Processing and help them address inequalities in scholarship created by access to technology. The curriculum is aimed specifically at the scholars working on under-resourced languages who would like to use their linguistic and domain expertise to create assets for NLP tools.

Learning objectives

  • one
  • two
  • three

Prerequisites

There are no specific prerequisites for this curriculum.

Context

Every day when we read the news, we are reminded of how many of today’s global conflicts are rooted in a deep misunderstanding of local histories and cultures. While language barriers have always been an issue for historians, journalists and scholars, 21st-century researchers also face the issue of scale: in our era of “big data,” the cultural record is expanding at an astounding rate of 2.5 million terabytes per day; this is more information than a person could ever read, let alone analyze, in a lifetime. Making sense of so much text requires natural language processing - or NLP - a combination of linguistics and machine learning, that allows computers to handle human text and speech efficiently and accurately.

The problem is that most NLP tools are designed for globally dominant languages, with English being the first. But what if you’re speaking or studying Yiddish? Or what if you’re a journalist or scholar working on history, politics or literature written in Old Chinese or Ottoman Turkish? There are very few tools out there that can help you locate, extract and visualize information in these languages.

With this curriculum, we would like to help humanities scholars and students unlock the door to new knowledge in a greater range of world languages, so that scholars from around the globe can do cutting-edge research in their own idiom.

The lack of linguistic diversity in NLP today means that entire communities of speakers and scholars are left out, while resources, opportunities and knowledge in dominant languages continues to be amplified. The materials and experience gained through our project can empower a new generation of scholars for whom language is no longer a barrier to new knowledge.

The curriculum is based on the NEH-funded Advanced Institute in Digital Humanities and a series of workshops held online and at Princeton University in 2021 and 2022 in which we worked with 10 teams interested in using technology to access texts, ideas and cultures in non-dominant languages.

What to expect

The curriculum consists of 9 courses. Each course builds on the knowledge and terminology presented in the previous course.

In Introduction to NLP for Humanists, we’ll introduce you to the key concepts and workflows used when applying computational methods to the study of language. This will a non-technical, conceptual introduction, which should give you a better sense of what NLP is about, what kinds of things can be done, and what kind of tools and assets you would need to have in order to start using NLP in your own research.

In Practical Introduction to spaCy, we will get you started with one such tool called spaCy. We’ll walk you through concrete examples of how to search for textual patterns in English texts, how to xxx and how to xxxx.

In Machine Learning in NLP, we’ll delve deeper into the specifics of machine-learning approaches so that xxxxx.

In spaCy Architecture, we’ll help you understand how spaCy works on the inside and what kind of assets are needed if you want to start using spaCy for currently unspported languages.

In Cadet and Inception, we’ll introduce you to two further software packages which should help you annotate texts and prepare them for machine learning and subsequent use in spaCy.

In Model Training, we will show you how to use annotated texts (produced by Cadet or spaCy) to create language models that can be used by spaCy to process new texts.

In Applications of Models we’ll show you how to use language models in spaCy, as well as how to adapt existing models to new texts or even language variants.

Finally, in Next Steps we’ll chart out some of the paths that you may want to take upon completing this curriculum.

Resources