Bilingual Lexicon Extraction from Comparable Corpora for Closely Related Languages

Fišer, Darja; Ljubešić, Nikola

izvor podataka: crosbi !

Bilingual Lexicon Extraction from Comparable Corpora for Closely Related Languages (CROSBI ID 581489)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Fišer, Darja ; Ljubešić, Nikola Bilingual Lexicon Extraction from Comparable Corpora for Closely Related Languages // Proceedings of the International Conference Recent Advances in Natural Language Processing 2011. Hisarya: RANLP 2011 Organising Committee, 2011. str. 125-131

Podaci o odgovornosti

Autori

Fišer, Darja ; Ljubešić, Nikola

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Bilingual Lexicon Extraction from Comparable Corpora for Closely Related Languages

Sažetak

In this paper we present a knowledge-light approach to extract a bilingual lexicon for closely related languages from comparable corpora. While in most related work an existing dictionary is used to translate context vectors, we take advantage of the similarities between languages instead and build a seed lexicon from words that are identical in both languages and then further extend it with context-based cognates and translations of the most frequent words. We also use cognates for reranking translation candidates obtained via context similarity and extract translation equivalents for all content words, not just nouns as in most related work. The results are very encouraging, suggesting that other similar languages could bene- fit from the same approach. By enlarging the seed lexicon with cognates and translations of the most frequent words and by cognate-based reranking of translation candidates we were able to improve the average baseline precision from 0.592 to 0.797 on the mean reciprocal rank for the ten top- ranking translation candidates for nouns, verbs and adjectives with a 46% recall on the gold standard of 1000 random entries from a traditional dictionary.

Ključne riječi

comparable corpora; lexicon extraction; closely related languages

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Stranice rada

125-131.

Godina izdavanja

2011.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Naslov

Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

Izdavač

Hisarya: RANLP 2011 Organising Committee

Podaci o skupu

Skup

Recent Advances in Natural Language Processing 2011

Vrsta sudjelovanja

predavanje

Datum održavanja skupa

12.09.2011-14.09.2011

Mjesto održavanja skupa

Hisar, Bugarska

Povezanost rada

Povezane osobe

Nikola Ljubešić (CroRIS ID: 4119; MBZ: 272820) (autor)

Povezane ustanove

Filozofski fakultet u Zagrebu (130) (autorova ustanova)

Povezani projekti

Hrvatska rječnička baština i hrvatski europski identitet (rezultat rada na projektu)

Područje

Informacijske i komunikacijske znanosti

Poveznice

aclweb.org