Standardizing Tweets with Character-level Machine Translation

Ljubešić, Nikola; Erjavec, Tomaž; Fišer, Darja

izvor podataka: crosbi !

Standardizing Tweets with Character-level Machine Translation (CROSBI ID 55173)

Prilog u knjizi | izvorni znanstveni rad

Ljubešić, Nikola ; Erjavec, Tomaž ; Fišer, Darja Standardizing Tweets with Character-level Machine Translation // Computational Linguistics and Intelligent Text Processing / Gelbukh, Alexander (ur.). Berlin: Springer, 2014. str. 164-175

Podaci o odgovornosti

Autori

Ljubešić, Nikola ; Erjavec, Tomaž ; Fišer, Darja

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Standardizing Tweets with Character-level Machine Translation

Sažetak

This paper presents the results of the standardization procedure of Slovene tweets that are full of colloquial, dialectal and foreign-language elements. With the aim of minimizing the human input required we produced a manually normalized lexicon of the most salient out-of-vocabulary (OOV) tokens and used it to train a character-level statistical machine translation system (CSMT). Best results were obtained by combining the manually constructed lexicon and CSMT as fallback with an overall improvement of 9.9% increase on all tokens and 31.3% on OOV tokens. Manual preparation of data in a lexicon manner has proven to be more efficient than normalizing running text for the task at hand. Finally we performed an extrinsic evaluation where we automatically lemmatized the test corpus taking as input either original or automatically standardized wordforms, and achieved 75.1% per-token accuracy with the former and 83.6% with the latter, thus demonstrating that standardization has significant benefits for upstream processing.

Ključne riječi

twitterese, standardization, character-level machine translation

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Stranice rada

164-175.

Status objave rada

objavljeno

Podaci o knjizi

Knjiga u kojoj je prilog objavljen

Computational Linguistics and Intelligent Text Processing

Urednici

Gelbukh, Alexander

Izdavač

Berlin: Springer

Godina izdavanja

2014.

ISBN

978-3-642-54903-8

Povezanost rada

Povezane osobe

Nikola Ljubešić (autor/i)

Povezane ustanove

Filozofski fakultet u Zagrebu (130) (autorova ustanova)

Područje

Informacijske i komunikacijske znanosti

Poveznice

link.springer.com