Corpus linguistics and the use of language technologies in lexicography
Comparative Studies of Ideas and Cultures (3rd level)Modul:
Course code: 57
Year of study: Brez letnika
Workload: lectures 60 hours, seminar 30 hours
Course type: general elective
Learning and teaching methods: lectures, discussion classes
Content (Syllabus outline)
a. Humanities and computer science;
b. Formal and corpus linguistics;
c. Dictionaries and other digital manuals;
d. Language resources and technologies.
2. Corpus linguistics
a. Purpose, definition, historical development;
b. Types of corpora with representative examples;
c. Corpus labels;
d. The use of concordance software;
e. Regular expressions.
3. Language resources management
a. Standards and Open source;
b. Character encoding sets;
c. XML standard;
d. TEI guidelines;
e. Linguistic annotations.
4. Creation and development of language resources
a. The setting up of a language corpus;
b. Licenses, copyright and privacy laws;
c. Corpus annotation;
d. Annotation environment;
f. Rule-based programmes;
g. Machine learning.
Seminar classes accompany the lectures and complement them by exploring individual chapters covered by the course syllabus. The main objective is to provide the student with the opportunity to conduct individual research on the selected topic. This also involves individual study of relevant reference works and independent use of language resources (especially language corpora). The results of the student’s own research are analysed as part of group work in seminars.
Basic computer knowledge and logical thinking ability are necessary for a successful participation in the course. Within the module the course is cross-referenced by »Lexicology, lexicography, contemporary grammar« and »Historical lexicology, historical lexicology and historical grammar«.
The following list contains basic reference works. A series of additional readings for individual lectures and/or seminars will be supplied subsequently.
- Erjavec, Tomaž. 2013: Korpusi in konkordančniki na strežniku nl.ijs.si. Slovenščina 2.0, 1/1, str. 24-49. http://www.trojina.org/slovenscina2.0/arhiv/2013/1/Slo2.0_2013_1_03.pdf.
- Finlayson, Mark A., Erjavec, Tomaž. Overview of annotation creation: processes and tools. V: IDE, Nancy M. (ur.), PUSTEJOVSKY, James (ur.). Handbook of linguistic annotation. Amsterdam: Springer. 2017, str. 167-192. https://arxiv.org/abs/1602.05753
- Fišer, Darja, Ljubešić, Nikola, Erjavec, Tomaž. The Janes project: language resources and tools for Slovene user generated content. Language resources and evaluation. 2020, vol. 54, str. 223–246. https://rdcu.be/7RX4
- Fišer, Darja, Ljubešić, Nikola. Distributional modelling for semantic shift detection. International journal of lexicography, ISSN 0950-3846, June 2019, vol. 32, no. 2, str. 163-183
- Gorjanc, Vojko, Fišer, Darja 2013: Korpusna analiza. 2., predelana in razširjena izd. Ljubljana: Znanstvena založba Filozofske fakultete.
- Logar, Nataša in dr. 2012: Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Zbirka Sporazumevanje. Ljubljana: Trojina, zavod za uporabno slovenistiko: Fakulteta za družbene vede. https://knjigarna.fdv.si/i_578_korpusi-slovenskega-jezika-gigafida-kres-ccgigafida-in-cckres-gradnja-vsebina-uporaba
- Raziskovalna infrastruktura CLARIN.SI: http://www.clarin.si/
- Standard XML: http://en.wikipedia.org/wiki/XML
- Priporočila TEI: https://tei-c.org/
Objectives and competences
Computer-aided research of language material and its presentation is an integral part of contemporary lexicography and grammaticography. The course is therefore devoted to the area of digital humanities, with special emphasis on corpus linguistics and Slovenian language technologies. Over the past few years both research areas have witnessed substantial progress, which has resulted in the availability of a wide range of language corpora (e.g. reference corpus Gigafida, spoken language corpus GOS, two corpora IMP for historical Slovene etc.), accompanied by a large number of (online) tools for linguistic annotation such as lemmatizers, morpho-syntactic and syntactic annotation tools etc. The main objective of the course is to equip students with sufficient knowledge to encourage independent use of the available corpora and other tools in contemporary linguistic research and development of new language technologies. The course will cover three thematic fields: corpus linguistics, language resource management and linguistic annotation. The first part will present the apparatus offered by contemporary concordance programs (concordance, frequency lexicon, key words, and collocations), which will require basic comprehension of regular expressions, corpus annotations and specifications, and the functionalities of the available corpora and concordancers for Slovene. For a more sophisticated insight about the corpora, digital dictionary databases and formal descriptions of language models basic computer knowledge is required, both for encoding and structuring of textual data. The most widely used standard for character encoding is Unicode, in which most of the world’s writing systems can be represented, while XML, as a meta-language, serves the annotation of semistructured data. XML makes it possible to define schemas (of which several standardised models are in use) that specify the lexis and reciprocality of annotations for individual types of documents. Text encoding and linguistic annotation in humanities generally follow TEI Guidelines (Text Encoding Initiative Guidelines), which enable the generation of a highly diverse range of texts in any given language and are utilised by the majority of Slovene language corpora. The course will cover the basics of Unicode, XML, XML Schemas and TEI, which will provide students with a good foundation for future confident use of standards and guidelines. The last series of lectures will be devoted to various approaches and methods in the development of language resources, predominantly language corpora, involving text collecting, data processing and manual annotation. More detailed examination will be devoted to automatised annotation with particular emphasis on machine learning, which in the last few years has proved to be the most successful method in linguistic annotation.
Intended learning outcomes
- Detailed familiarity with the apparatus of corpus linguistics and the practical ability to use the resources;
- Managing language resources;
- Principles of manual and automated linguistic annotation;
- Specialised knowledge in information technology.
Long written assignments (80 %), Final examination (written/oral) (20 %).