LVK2018  Search Word Frequency List

The Balanced Corpus of Modern Latvian

LVK2018 is the representative 10 million word corpus of contemporary Latvian. LVK2018 is an extended version of LVK2013 based on slightly modified corpus design criteria that were also applied for the previous corpora from the LVK series. LVK2018 is designed as a general language, representative and balanced corpus that aims to cover the variety of existing texts in certain estimated proportions. The corpus consists of five different sections: journalism (60%), fiction (20%), scientific (10%), legal (8%), parliamentary transcripts (2%).

Publication to be cited:
K. Levane-Petrova
Līdzsvarotais mūsdienu latviešu valodas tekstu korpuss, tā nozīme gramatikas pētījumos
Language: Meaning and Form (The Balanced Corpus of Modern Latvian, its role in grammar studies), 10, 131-146, 2019
PDF DOI
Corpus size 10M words (12M tokens)
Development period 2016–2018
Developers Institute of Mathematics and Computer Science UL
Funding The European Regional Development Fund (1.1.1.1/16/A/219); Latvian Language Agency
CLARIN http://hdl.handle.net/20.500.12574/11
Other publications
R. Dargis, K. Levane-Petrova, I. Poikans
Lessons Learned from Creating a Balanced Corpus from Online Data
IOS Press, 2020
PDF DOI