CLS-INFRA Training School on Data and Annotation

Lisanne van Rossum; Maarten Janssen; Silvie Cinková; Justin Tonra; Ciara L Murphy; Michal Křen; Václav Cvrček

Summer School

CLS-INFRA Training School on Data and Annotation

Univerzita Karlova, Praha
7-9 June 2022

Session 1Intro - Information extraction from the Shakespeare Drama Corpus
Session 2XML, TEI, and TEITOK I
Session 3Universal Dependencies – Morphology
Session 4Base CQL
Session 5Metadata
Session 6Advanced CQL
Session 7Statistics
Session 8Universal Dependencies – Syntax
Session 9Named-Entity Recognition and bulk editing
Session 10Tree queries - Grew

Intro - Information extraction from the Shakespeare Drama Corpus

We use the Shakespeare Drama Corpus to show you how to extract information with corpus-linguistic methods - which questions you can ask and what answers you can expect. This is a general overview; you will learn details of the individual tools and methods in the following sessions.

Session 1 - Introduction

Download the slides
(PDF)

Speakerfor this session

Silvie Cinková
Silvie's background is German and Swedish philology. She entered academia dreaming of a lifetime among medieval manuscripts - until a lecture on Corpus Linguistics and text search demo threw her off that path! Silvie has since joined the Charles University Institute of Formal and Applied Linguistics (Faculty of Mathematics and Physics). She has been captivated by the power of linguistic annotation and query languages, and they have even inspired her to learn some statistics and coding.

Download the complete session synthesis
(PDF)

XML, TEI, and TEITOK I

This session introduces the basic principles of the XML markup and presents the Text Encoding Initiative (TEI-XML) guidelines as an encoding standard for digital editions and textual corpora. The students produce a valid TEI-XML document and upload it to TEITOK, our web-based platform for viewing, creating, and editing corpora with both rich textual mark-up and linguistic annotation.

Session 2 - XML, TEI and TEITOK

Download the slides
(PDF)

Speakerfor this session

Maarten Janssen
With a background in computational linguistics, Maarten has been involved in many corpus projects. Over the course of time he has developed the TEITOK environment, which is intended to allow linguists to build, maintain, and improve their own corpus without the need for extensive computational skills. Maarten is currently employed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, at Charles University in Prague.

Download the complete session synthesis
(PDF)

Universal Dependencies – Morphology

This session presents UDPipe, an NLP tool to analyse texts in more than seventy languages. UDPipe works with Universal Dependencies. Universal Dependencies is a framework for consistent grammar annotation across human languages. We particularly focus on its morphological annotation scheme, explaining the individual part-of-speech labels, as well as the more fine-grained morphological features, on English. A separate session deals with the syntactic markup.

Session 3 - Universal Dependencies – Morphology

Download the slides
(PDF)

Speakerfor this session

Silvie Cinková
Silvie's background is German and Swedish philology. She entered academia dreaming of a lifetime among medieval manuscripts - until a lecture on Corpus Linguistics and text search demo threw her off that path! Silvie has since joined the Charles University Institute of Formal and Applied Linguistics (Faculty of Mathematics and Physics). She has been captivated by the power of linguistic annotation and query languages, and they have even inspired her to learn some statistics and coding.

Base CQL

CQL (Corpus Query Language, developed in the 1990s) is the de facto standard in the field, used by the most current corpus query tools. We start with explaining the students the regular expressions and gradually and diverse restrictions within a single token query. We demonstrate the searches in Kontext, the corpus manager developed and maintained by the Institute of the Czech National Corpus.

Session 4 - Base CQL

Speakerfor this session

Michal Křen
Michal, a mathematician by background, did his Ph.D. in linguistics at Charles University in Prague, Czech Republic. Building, editing, annotating, and querying text corpora, as well as maintaining, customization and development of corpus tools, is his daily business at the Institute of the Czech National Corpus – and so is research! Michal has matured along with the institute to serve his turn as department head, and he has extensive experience in teaching corpus linguistics and basic linguistic programming to humanities students.

Download the complete session synthesis
(PDF)

Metadata

This session explains the TEI-XML header structure, its relation to the document body, good practice and its implementation in TEITOK. It prepares the students for setting up their own corpus of philological texts with complex headers and genre-specific text metadata. The students receive guidance to create their own document headers.

Session 5 - Metadata

Download the slides
(PDF)

Speakerfor this session

Maarten Janssen
With a background in computational linguistics, Maarten has been involved in many corpus projects. Over the course of time he has developed the TEITOK environment, which is intended to allow linguists to build, maintain, and improve their own corpus without the need for extensive computational skills. Maarten is currently employed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, at Charles University in Prague.

Advanced CQL

This session is a continuation of the Base CQL session. It draws on queries about individual tokens and proceeds to queries on a sequence of tokens, introducing the concept of group referencing and metadata scope (e. g. within a single sentence). It also introduces aggregation and filtering functionalities of Kontext, the corpus manager used in the demonstration.

Session 6 - Advanced CQL

Speakerfor this session

Michal Křen
Michal, a mathematician by background, did his Ph.D. in linguistics at Charles University in Prague, Czech Republic. Building, editing, annotating, and querying text corpora, as well as maintaining, customization and development of corpus tools, is his daily business at the Institute of the Czech National Corpus – and so is research! Michal has matured along with the institute to serve his turn as department head, and he has extensive experience in teaching corpus linguistics and basic linguistic programming to humanities students.

Download the complete session synthesis
(PDF)

Statistics

This session explains statistical considerations on frequency in corpus linguistics, mainly the statistical significance and effect size of a difference between two frequency counts. Besides, it introduces several quantitative stylistics metrics, such as different flavors of lexical richness, descriptivity vs. narrativity, thematic concentration, and thematic weights of individual words. Students learn about the on-line tools Calc and QuitaUp to calculate these metrics automatically.

Session 7 - Statistics

Download the slides
(PDF)

Speakerfor this session

Václav Cvrček
Václav studied Czech studies, linguistics and phonetics at the Faculty of Arts, Charles University, and is fluent in related statistics and computing. In his dissertation in the field of mathematical (corpus) linguistics, he focused on the issue of language regulation and linguistic interventions in language development. He is currently a researcher at the Institute of the Czech National Corpus, Faculty of Arts, Charles University, where he is head of department. Václav has been teaching humanities students for many years, in subjects such as lexicology, statistics, and corpus linguistics.

Universal Dependencies – Syntax

This is a continuation of the Universal Dependencies – Morphology session. It explains the principles of dependency grammar and its UD flavor, touching upon the interplay between the linguistic form and function, as well as ambiguity and vagueness in the linguistic annotation. We explain the principles of dependency grammar and elaborate on the most common syntactic labels and their typical usage.

Session 8 - Universal Dependencies – Syntax

Download the slides
(PDF)

Speakerfor this session

Silvie Cinková
Silvie's background is German and Swedish philology. She entered academia dreaming of a lifetime among medieval manuscripts - until a lecture on Corpus Linguistics and text search demo threw her off that path! Silvie has since joined the Charles University Institute of Formal and Applied Linguistics (Faculty of Mathematics and Physics). She has been captivated by the power of linguistic annotation and query languages, and they have even inspired her to learn some statistics and coding.

Named-Entity Recognition and bulk editing

In this session, students gain the basic overview of the state of the art in the Named-Entity Recognition and Entity Linking (referring from a linguistic entity to an external knowledge base). They learn about the main Entity Linking authorities, such as WikiData and VIAAF. They get a hands-on experience with TEITOK’s manual entity annotation module. Eventually, they get acquainted with TEITOK’s bulk editing module.

Session 9 - Named-Entity Recognition and bulk editing

Download the slides
(PDF)

Speakerfor this session

Maarten Janssen
With a background in computational linguistics, Maarten has been involved in many corpus projects. Over the course of time he has developed the TEITOK environment, which is intended to allow linguists to build, maintain, and improve their own corpus without the need for extensive computational skills. Maarten is currently employed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, at Charles University in Prague.

Tree queries - Grew

This session teaches the students the foundations of Grew, a declarative tree query language, working with its implementation in TEITOK. We demonstrate the power of tree querying on searching the syntactically parsed Shakespeare corpus for salient semantic participants of a verb, which grammatical constructions to look for and how to implement the search in Grew.

Session 10 - Tree queries - Grew

Download the slides
(PDF)

Speakerfor this session

Silvie Cinková
Silvie's background is German and Swedish philology. She entered academia dreaming of a lifetime among medieval manuscripts - until a lecture on Corpus Linguistics and text search demo threw her off that path! Silvie has since joined the Charles University Institute of Formal and Applied Linguistics (Faculty of Mathematics and Physics). She has been captivated by the power of linguistic annotation and query languages, and they have even inspired her to learn some statistics and coding.

Summer School

CLS-INFRA Training School on Data and Annotation

Intro - Information extraction from the Shakespeare Drama Corpus

Speakerfor this session

Silvie Cinková

XML, TEI, and TEITOK I

Speakerfor this session

Maarten Janssen

Universal Dependencies – Morphology

Speakerfor this session

Silvie Cinková

Base CQL

Speakerfor this session

Michal Křen

Metadata

Speakerfor this session

Maarten Janssen

Advanced CQL

Speakerfor this session

Michal Křen

Statistics

Speakerfor this session

Václav Cvrček

Universal Dependencies – Syntax

Speakerfor this session

Silvie Cinková

Named-Entity Recognition and bulk editing

Speakerfor this session

Maarten Janssen

Tree queries - Grew

Speakerfor this session

Silvie Cinková

Organisation