N-gram overlap in automatic detection of document derivation

Bosanac, Siniša; Štefanec, Vanja

izvor podataka: crosbi !

N-gram overlap in automatic detection of document derivation (CROSBI ID 43784)

Prilog u knjizi | izvorni znanstveni rad

Bosanac, Siniša ; Štefanec, Vanja N-gram overlap in automatic detection of document derivation // INFuture2011: The Future of Information Sciences - Information Sciences and e-Society / Billenness, Clive ; Hemera, Annette ; Mateljan, Vladimir et al. (ur.). Zagreb: Odsjek za informacijske i komunikacijske znanosti Filozofskog fakulteta Sveučilišta u Zagrebu, 2011. str. 373-382

Podaci o odgovornosti

Autori

Bosanac, Siniša ; Štefanec, Vanja

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

N-gram overlap in automatic detection of document derivation

Sažetak

Establishing authenticity and independence of documents in relation to others is not a new problem, but in the era of hyper production of e-text it certainly gained even more importance. There is an increased need for automatic methods for determining originality of documents in a digital environment. The method of n-gram overlap is only one of several methods proposed by the literature and is used in a variety of systems for automatic identification of text reuse. Although the aforementioned method is quite trivial, determining the length of n-grams that would be a good indicator of text reuse is a somewhat complex issue. We assume that the optimal length of n-grams is not the same for all languages but that it depends on the particular language properties such as morphological typology, syntactic features, etc. The aim of this study is to find the optimal length of n-grams to be used for determining document derivation in Croatian language. Among the potential areas of implementation of the results of this study, we could point out automatic detection of plagiarism in academic and student papers, citation analysis, information flow tracking and event detection in on-line texts.

Ključne riječi

document derivation, text reuse, n-gram overlap, automatic plagiarism detection, string metrics

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Stranice rada

373-382.

Status objave rada

objavljeno

Podaci o knjizi

Knjiga u kojoj je prilog objavljen

INFuture2011: The Future of Information Sciences - Information Sciences and e-Society

Urednici

Billenness, Clive ; Hemera, Annette ; Mateljan, Vladimir ; Banek Zorica, Mihaela ; Stančić, Hrvoje ; Seljan, Sanja

Izdavač

Zagreb: Odsjek za informacijske i komunikacijske znanosti Filozofskog fakulteta Sveučilišta u Zagrebu

Godina izdavanja

2011.

ISBN

978-953-175-408-8

Povezanost rada

Povezane osobe

Vanja Štefanec (autor/i)

Povezane ustanove

Filozofski fakultet u Zagrebu (130) (autorova ustanova)

Područje

Informacijske i komunikacijske znanosti, Filologija

Poveznice

infoz.ffzg.hr