LATE-mediji  Search Word Frequency List

LATE-media

Corpus includes audio recordings of media broadcasts and their transcripts in orthographic transcription. The data are written down in the orthography of Standard Latvian, observing also the principles of punctuation.

Corpus size 50 hours (433 000 tokens)
Data period 2015–2020
Development period 2021–...
Developers Institute of Mathematics and Computer Science UL
Funding State Research Programme "Letonika – Fostering a Latvian and European Society" (VPP-LETONIKA-2021/1-0006)