Nalazite se na CroRIS probnoj okolini. Ovdje evidentirani podaci neće biti pohranjeni u Informacijskom sustavu znanosti RH. Ako je ovo greška, CroRIS produkcijskoj okolini moguće je pristupi putem poveznice www.croris.hr
izvor podataka: crosbi !

hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene (CROSBI ID 45032)

Prilog u knjizi | izvorni znanstveni rad

Ljubešić, Nikola ; Erjavec, Tomaž hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene // Text, Speech and Dialogue, Lecture Notes in Computer Science / Ivan Habernal and Vaclav Matousek (ur.). Berlin : Heidelberg: Springer, 2011. str. 395-402

Podaci o odgovornosti

Ljubešić, Nikola ; Erjavec, Tomaž

engleski

hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene

Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard “Web as Corpus” pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates text-types of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC.

web corpus, Croatian, Slovene, topic modeling

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

Podaci o prilogu

395-402.

objavljeno

Podaci o knjizi

Text, Speech and Dialogue, Lecture Notes in Computer Science

Ivan Habernal and Vaclav Matousek

Berlin : Heidelberg: Springer

2011.

978-3-642-23537-5

Povezanost rada

Informacijske i komunikacijske znanosti