Sociolinguistics has always been an empirical field. With the availability of large amounts of data, it has met new possibilities, but also (methodological) challenges. Recent advances in machine learning have produced promising approaches to gain new insights and corroborate perceived wisdom.
In this talk, Hovy gives a brief introduction of a method called embeddings, and will show several applications of it. Embeddings are a new way of representing words (a direct implementation of the distributional hypothesis by Firth) as points in in a multi-dimensional vector space. This is not unlike arranging word magnets on a fridge. Each word’s position relative to all others is determined by the contextual similarity to all other words, thereby determining semantic and syntactic groupings. The resulting vector representations of words have turned out to capture a variety of latent factors, from lexical semantics to syntax to socio-demographic aspects to societal attitudes.
The ease of use and the range of applications make embeddings a valuable tool for further research in (computational) sociolinguistics. Hovy shows how they capture regional variation at an intra- and interlingual level, how they distinguish varieties and linguistic resources, and how they allow for the assessment of changing societal norms and associations.
After viewing this lecture, learners should be able to:
- Understand how machine learning can contribute to the ongoing scholarship of sociolinguistics, particularly language variation and change
- summarise the method of ‘embedding’ as a new way of representing corpus data in relation to socio-demographic aspects and societal attitudes.
- apply the embedding methodology as a means to capture and distinguish international and regional language variation at varying levels.