||
基于引用的剽窃检测
武夷山
JASIST杂志2014年第8期发表加大伯克利分校三位学者的文章,Citation-Based Plagiarism Detection: Practicality on a Large-Scale Scientific Corpus(基于引用的剽窃检测:在大型语料库上检验其实用性)。第一作者Bela Gipp(他的专长是信息检索与语义分析)和第二作者Norman Meuschke(其专长是剽窃检测)都是统计学系的,第三作者Corinna Breitinger女士(其专长是剽窃检测与用户界面设计)是SciPlore(科学探索)研究组的,而第一作者也是科学探索研究组的负责人。
传统的剽窃检测方法是基于字符相似性的,它发现不了乔装打扮的剽窃,如用新的词语和句型改写原文,将翻译过来的文章冒充为自己的文章,剽窃别人的思路,等等。我们《情报学报》就曾收到过将译文作为自己文章的来稿,幸好我们一位资深编辑的火眼金睛将其识破,且查找到了被剽窃的英语原文,铁证如山。
这三位作者介绍的基于引用的剽窃检测方法(简称CbPD)能检测出语义相似性,即使文字不相似。本文利用18.5万篇生物医学论文的文献集合,从中检测伪装程度不一的各种剽窃文本。与两种传统检测方法进行对比后发现:对于伪装程度较高的剽窃文本,CbPD对疑似剽窃文本的排序更符合真实情况。另外,CbPD方法的计算效率更高。
算法所识别出的疑似文本,必须经过人的判断核实才能下定论。本研究发现,若将CbPD方法与传统检测方法结合起来,则随后的人工核实判断更加省力。
文章中有一个检测翻译文本的例子。遗憾的是,这个例子是英译汉的文本。
对该方法感兴趣的,除了读这篇论文外,也可访问他们介绍此方法的专题网站:(http://www.sciplore.org/projects/citation-based-plagiarism-detection)。在网站的首页,就能看见疑似剽窃的中文文本的例子。
他们这个团队----科学探索研究组是2008年成立的。除了剽窃检测外,他们还有5个项目,包括同被引邻近性分析和文献元数据抽取。
本文摘要如下:
The automated detection of plagiarism is an information retrieval task of increasing importance as the volume of readily accessible information on the web expands. A major shortcoming of current automated plagiarism detection approaches is their dependence on high character-based similarity. As a result, heavily disguised plagiarism forms, such as paraphrases, translated plagiarism, or structural and idea plagiarism, remain undetected. A recently proposed language-independent approach to plagiarism detection, Citation-based Plagiarism Detection (CbPD), allows the detection of semantic similarity even in the absence of text overlap by analyzing the citation placement in a document's full text to determine similarity. This article evaluates the performance of CbPD in detecting plagiarism with various degrees of disguise in a collection of 185,000 biomedical articles. We benchmark CbPD against two character-based detection approaches using a ground truth approximated in a user study. Our evaluation shows that the citation-based approach achieves superior ranking performance for heavily disguised plagiarism forms. Additionally, we demonstrate CbPD to be computationally more efficient than character-based approaches. Finally, upon combining the citation-based with the traditional character-based document similarity visualization methods in a hybrid detection prototype, we observe a reduction in the required user effort for document verification.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-25 21:16
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社