LRK2013  Search Word Frequency List

Latvian Speech Recognition Corpus

The corpus consists of two parts: an orthographically annotated corpus containing 100 hours of orthographically transcribed audio data and a phonetically annotated corpus containing approx. 4 hours of phonetically transcribed audio data. Metadata files in XML format provide additional details about the speakers (age, gender, education, etc.), including noise levels, speech styles and Latvian language proficiency. The corpus is mainly used for the development of speech recognition software and is not publicly available.

Publication to be cited:
M. Pinnis, I. Auzina, K. Goba
Designing the Latvian speech recognition corpus
2014
PDF
Corpus size 100 hours (1.1M tokens)
Development period 2013
Developers Institute of Mathematics and Computer Science UL, Tilde, LETA
Funding European Regional Development Fund (KC/2.1.2.1.1/10/01/001, project No. 2.9)
Homepage http://runa.korpuss.lv/
Other publications
I. Auzina, M. Pinnis, R. Dargis
Comparison of rule-based and statistical methods for grapheme to phoneme modelling
IOS Press, 2014
PDF DOI
A. Znotins, K. Polis, R. Dargis
Media monitoring system for Latvian radio and TV broadcasts
2015
PDF