Tīmeklis2007  Search Word Frequency List

Latvian Web Corpus 2007

The Latvian Web Corpus 2007 contains 700,000 Latvian webpages published before 2005. The corpus is automatically annotated.

Publication to be cited:
J. Dzerins and K. Dzonsons
Harvesting national language text corpora from the Web
2007
Corpus size 99M words (123M tokens)
Development period 2006–2007
Developers Institute of Mathematics and Computer Science UL
Funding Research and Development of the Semantic Web Technologies for Latvia (SemTi-Kamols)
CLARIN http://hdl.handle.net/20.500.12574/46