Corpus linguistics and the use of language technologies in lexicography

COURSE DESCRIPTION

Corpus linguistics and the use of language technologies in lexicography

Programme:

Comparative Studies of Ideas and Cultures (3rd level)

Modul:

Course code: P057

Year of study: without

Course principal:

Assoc. Prof. Darja Fišer, Ph.D.

ECTS: 6

Workload: lectures 20 hours, seminar 10 hours, individual work 150 hours

Course type: general elective

Languages: Slovene

Learning and teaching methods: lectures, discussion classes

Course Syllabus

Prerequisits:

There are no specific prerequisites. However, prior knowledge of basic linguistic theories, general lexicology, and grammar is recommended.

Content (Syllabus outline)

1. Introduction

a. Humanities and computer science;

b. Formal and corpus linguistics;

c. Dictionaries and other digital manuals;

d. Language resources and technologies.

2. Corpus linguistics

a. Purpose, definition, historical development;

b. Types of corpora with representative examples;

c. Corpus labels;

d. The use of concordance software;

e. Regular expressions.

3. Language resources management

a. Standards and Open source;

b. Character encoding sets;

c. XML standard;

d. TEI guidelines;

e. Linguistic annotations.

4. Creation and development of language resources

a. The setting up of a language corpus;

b. Licenses, copyright and privacy laws;

c. Corpus annotation;

d. Annotation environment;

e. Crowdsourcing;

f. Rule-based programmes;

g. Machine learning.

Seminar classes

Seminar classes accompany the lectures and complement them by exploring individual chapters covered by the course syllabus. The main objective is to provide the student with the opportunity to conduct individual research on the selected topic. This also involves individual study of relevant reference works and independent use of language resources (especially language corpora). The results of the student’s own research are analysed as part of group work in seminars.

Cross-curricular integration

Basic computer knowledge and logical thinking ability are necessary for a successful participation in the course. Within the module the course is cross-referenced by »Lexicology, lexicography, contemporary grammar« and »Historical lexicology, historical lexicology and historical grammar«.

Readings

The following list contains basic reference works. A series of additional readings for individual lectures and/or seminars will be supplied subsequently.

Erjavec, Tomaž. 2013: Korpusi in konkordančniki na strežniku nl.ijs.si. Slovenščina 2.0, 1/1, str. 24-49. http://www.trojina.org/slovenscina2.0/arhiv/2013/1/Slo2.0_2013_1_03.pdf.
Finlayson, Mark A., Erjavec, Tomaž. Overview of annotation creation: processes and tools. V: IDE, Nancy M. (ur.), PUSTEJOVSKY, James (ur.). Handbook of linguistic annotation. Amsterdam: Springer. 2017, str. 167-192. https://arxiv.org/abs/1602.05753
Fišer, Darja, Ljubešić, Nikola, Erjavec, Tomaž. The Janes project: language resources and tools for Slovene user generated content. Language resources and evaluation. 2020, vol. 54, str. 223–246. https://rdcu.be/7RX4
Fišer, Darja, Ljubešić, Nikola. Distributional modelling for semantic shift detection. International journal of lexicography, ISSN 0950-3846, June 2019, vol. 32, no. 2, str. 163-183
Gorjanc, Vojko, Fišer, Darja 2013: Korpusna analiza. 2., predelana in razširjena izd. Ljubljana: Znanstvena založba Filozofske fakultete.
Logar, Nataša in dr. 2012: Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Zbirka Sporazumevanje. Ljubljana: Trojina, zavod za uporabno slovenistiko: Fakulteta za družbene vede. https://knjigarna.fdv.si/i_578_korpusi-slovenskega-jezika-gigafida-kres-ccgigafida-in-cckres-gradnja-vsebina-uporaba
Raziskovalna infrastruktura CLARIN.SI: http://www.clarin.si/
Standard XML: http://en.wikipedia.org/wiki/XML
Priporočila TEI: https://tei-c.org/

Objectives and competences

Computer-aided research of language material and its presentation is an integral part of contemporary lexicography and grammaticography. The course is therefore devoted to the area of digital humanities, with special emphasis on corpus linguistics and Slovenian language technologies. Over the past few years both research areas have witnessed substantial progress, which has resulted in the availability of a wide range of language corpora (e.g. reference corpus Gigafida, spoken language corpus GOS, two corpora IMP for historical Slovene etc.), accompanied by a large number of (online) tools for linguistic annotation such as lemmatizers, morpho-syntactic and syntactic annotation tools etc. The main objective of the course is to equip students with sufficient knowledge to encourage independent use of the available corpora and other tools in contemporary linguistic research and development of new language technologies. The course will cover three thematic fields: corpus linguistics, language resource management and linguistic annotation. The first part will present the apparatus offered by contemporary concordance programs (concordance, frequency lexicon, key words, and collocations), which will require basic comprehension of regular expressions, corpus annotations and specifications, and the functionalities of the available corpora and concordancers for Slovene. For a more sophisticated insight about the corpora, digital dictionary databases and formal descriptions of language models basic computer knowledge is required, both for encoding and structuring of textual data. The most widely used standard for character encoding is Unicode, in which most of the world’s writing systems can be represented, while XML, as a meta-language, serves the annotation of semistructured data. XML makes it possible to define schemas (of which several standardised models are in use) that specify the lexis and reciprocality of annotations for individual types of documents. Text encoding and linguistic annotation in humanities generally follow TEI Guidelines (Text Encoding Initiative Guidelines), which enable the generation of a highly diverse range of texts in any given language and are utilised by the majority of Slovene language corpora. The course will cover the basics of Unicode, XML, XML Schemas and TEI, which will provide students with a good foundation for future confident use of standards and guidelines. The last series of lectures will be devoted to various approaches and methods in the development of language resources, predominantly language corpora, involving text collecting, data processing and manual annotation. More detailed examination will be devoted to automatised annotation with particular emphasis on machine learning, which in the last few years has proved to be the most successful method in linguistic annotation.

Intended learning outcomes

Detailed familiarity with the apparatus of corpus linguistics and the practical ability to use the resources;
Managing language resources;
Principles of manual and automated linguistic annotation;
Specialised knowledge in information technology.

Learning and teaching methods:

Types of learning/teaching:

Frontal teaching
Independent students work
e-learning

Teaching methods:

Explanation
Conversation/discussion/debate
Case studies

Assessment

Long written assignments (80 %),
Final examination (written/oral) (20 %).

Lecturer’s references

FIŠER, Darja, LJUBEŠIĆ, Nikola, ERJAVEC, Tomaž. The Janes project: language resources and tools for Slovene user generated content. Language resources and evaluation, ISSN 1574-020X, 2020, vol. 54, no. 1, str. 223-246, ilustr.
FIŠER, Darja, LJUBEŠIĆ, Nikola. Distributional modelling for semantic shift detection. International journal of lexicography. June 2019, vol. 32, no. 2, str. 163-183, ilustr., tabele. ISSN 0950-3846. https://academic.oup.com/ijl/advance-article/doi/10.1093/ijl/ecy011/5051703.
GORJANC, Vojko, FIŠER, Darja. Twitter in razmerja moči: diskurzna analiza kampanj ob referendumu za izenačitev zakonskih zvez v Sloveniji. Slavistična revija: časopis za jezikoslovje in literarne vede. [Tiskana izd.]. okt.-dec. 2018, letn. 66, št. 4, str. 473-495, ilustr. ISSN 0350-6894. https://srl.si/ojs/srl/article/view/2018-4-1-5.
MILIČEVIĆ, Maja, LJUBEŠIĆ, Nikola, FIŠER, Darja. Nestandardno zapisivanje srpskog jezika na Tviteru: mnogo buke oko malo odstupanja?. Anali Filološkog fakulteta. 2017, vol. 29, no. 2, str. 111-136, ilustr. ISSN 0522-8468. http://doi.fil.bg.ac.rs/pdf/journals/analiff/2017-2/analiff-2017-29-2-8.pdf.

COURSE DESCRIPTION