LVK2018  Search Word Frequency List

The Balanced Corpus of Modern Latvian

LVK2018 is the representative 10 million word corpus of contemporary Latvian. LVK2018 is an extended version of LVK2013 based on slightly modified corpus design criteria that were also applied for the previous corpora from the LVK series. LVK2018 is designed as a general language, representative and balanced corpus that aims to cover the variety of existing texts in certain estimated proportions. The corpus consists of five different sections: journalism (60%), fiction (20%), scientific (10%), legal (8%), parliamentary transcripts (2%).

Citation
Publication
K. Levane-Petrova
Līdzsvarotais mūsdienu latviešu valodas tekstu korpuss, tā nozīme gramatikas pētījumos
Language: Meaning and Form (The Balanced Corpus of Modern Latvian, its role in grammar studies), 10, 131-146, 2019
Data
K. Levāne-Petrova, R. Darģis
The Balanced Corpus of Modern Latvian (LVK2018)
CLARIN-LV digital library, 2018
http://hdl.handle.net/20.500.12574/11
Corpus size 10M words (12M tokens)
Development period 2016–2018
Developers Institute of Mathematics and Computer Science UL
Funding The European Regional Development Fund (1.1.1.1/16/A/219); Latvian Language Agency
CLARIN http://hdl.handle.net/20.500.12574/11
Other publications
R. Dargis, K. Levane-Petrova, I. Poikans
Lessons Learned from Creating a Balanced Corpus from Online Data
IOS Press, 2020