New Languages for NLP

New Languages for NLP

Authors

Topics:

Natural Language Processing

The goal of this curriculum is to introduce humanist scholars to the key tools and methods in Natural Language Processing and help them address inequalities in scholarship created by access to technology. The curriculum is aimed specifically at the scholars working on under-resourced languages who would like to use their linguistic and domain expertise to create assets for NLP tools.

Learning objectives

one
two
three

Prerequisites

There are no specific prerequisites for this curriculum.

Context

Every day when we read the news, we are reminded of how many of today’s global conflicts are rooted in a deep misunderstanding of local histories and cultures. While language barriers have always been an issue for historians, journalists and scholars, 21st-century researchers also face the issue of scale: in our era of “big data,” the cultural record is expanding at an astounding rate of 2.5 million terabytes per day; this is more information than a person could ever read, let alone analyze, in a lifetime. Making sense of so much text requires natural language processing - or NLP - a combination of linguistics and machine learning, that allows computers to handle human text and speech efficiently and accurately.

The problem is that most NLP tools are designed for globally dominant languages, with English being the first. But what if you’re speaking or studying Yiddish? Or what if you’re a journalist or scholar working on history, politics or literature written in Old Chinese or Ottoman Turkish? There are very few tools out there that can help you locate, extract and visualize information in these languages.

With this curriculum, we would like to help humanities scholars and students unlock the door to new knowledge in a greater range of world languages, so that scholars from around the globe can do cutting-edge research in their own idiom.

The lack of linguistic diversity in NLP today means that entire communities of speakers and scholars are left out, while resources, opportunities and knowledge in dominant languages continues to be amplified. The materials and experience gained through our project can empower a new generation of scholars for whom language is no longer a barrier to new knowledge.

The curriculum is based on the NEH-funded Advanced Institute in Digital Humanities and a series of workshops held online and at Princeton University in 2021 and 2022 in which we worked with 10 teams interested in using technology to access texts, ideas and cultures in non-dominant languages.

What to expect

The curriculum consists of 9 courses. Each course builds on the knowledge and terminology presented in the previous course.

In Introduction to NLP for Humanists, we’ll introduce you to the key concepts and workflows used when applying computational methods to the study of language. This will a non-technical, conceptual introduction, which should give you a better sense of what NLP is about, what kinds of things can be done, and what kind of tools and assets you would need to have in order to start using NLP in your own research.

In Practical Introduction to spaCy, we will get you started with one such tool called spaCy. We’ll walk you through concrete examples of how to search for textual patterns in English texts, how to xxx and how to xxxx.

In Machine Learning in NLP, we’ll delve deeper into the specifics of machine-learning approaches so that xxxxx.

In spaCy Architecture, we’ll help you understand how spaCy works on the inside and what kind of assets are needed if you want to start using spaCy for currently unspported languages.

In Cadet and Inception, we’ll introduce you to two further software packages which should help you annotate texts and prepare them for machine learning and subsequent use in spaCy.

In Model Training, we will show you how to use annotated texts (produced by Cadet or spaCy) to create language models that can be used by spaCy to process new texts.

In Applications of Models we’ll show you how to use language models in spaCy, as well as how to adapt existing models to new texts or even language variants.

Finally, in Next Steps we’ll chart out some of the paths that you may want to take upon completing this curriculum.

New Languages for NLP

Learning objectives

Prerequisites

Context

What to expect

Resources

NLP for Humanists: An Introduction to Key Concepts and Workflows

Practical Introduction to Spacy for Humanists

Machine Learning for NLP

spaCy Architecture for Humanists

Cadet: Preparing Data for New Language Models in spaCy

Annotating Linguistic Data with Inception

Training New Language Models in spaCy

Applying New Language Model in spaCy

A Pathfinder to NLP Resources for Humanists

#Learning objectives

#Prerequisites

#Context

#What to expect

Resources

Learning objectives

Prerequisites

Context

What to expect