BolsuTolka  Search Word Frequency List Speech Corpus (Common Voice 17.0)

The speech corpus includes sentences in Latgalian, read by different speakers of Latgalian dialects. The Mozilla Common Voice platform is used for data collection. Part-of-speech tagging and lemmatization has been done manually in this Latgalian corpus.

Corpus size 24 hours (130k tokens)
Data period 2023–2024
Development period 2024
Developers Rezekne Academy of Technologies, Institute of Mathematics and Computer Science UL, Institute of Literature, Folklore and Art UL, Latvian Open Technologies Association
Funding EU Recovery and Resilience Facility "Language Technology Initiative" (; State Research Programme "Digital Humanities" (VPP-IZM-DH-2022/1-0002)