||
历经多年的古汉语研究积累,陈小荷教授领衔的南京师范大学中文信息处理研究团队发布了一个古汉语语料库,《左传》分词、词性标注语料库。该库共约18万字,使用了自行设计的17个词类标记,先后进行了四次校对。2010年就已发表论文进行词法分析实验(石民,李 斌, 陈小荷. 基于CRF的 先秦汉语分词标注一体化研究,中文信息学报,2010年 第2期.),许多同行发邮件来索取语料。有鉴于此,我们在国际语言资源联盟LDC正式发布了这个语料,供学界使用。价格为$50,非常便宜,我们没有从中获利,仅为平台运营收费。
The Ancient Chinese Corpus (ACC) V1.0, contains the word segmented, POS-tagged data of Zuozhuan (an ancient Chinese historyclassical book). It has 180,000 Chinese characters, 195,000 segment units(including words and punctuations). It is separated to 2 parts, training data (166,138words) and test data (28,131 words). The POS tagging set has 17 tags. Thedetails of the tagging set are shown in table 1.
The AncientChinese Corpus project began at the Nanjing Normal University in 2009. Theproject goal is to provide a large, part-of-speech tagged Ancient Chinesecorpus. In this first delivery, ACC 1.0, contained only one book Zuozhuan. We will continue to releasemuch more data.
There are twotext files in this release, containing 268 paragraphs, 10,560 lines. Each lineis one sentence or a statement of a person. Each paragraph is separated by oneempty line. Each word is tagged its part-of-speech and separated by a space.
Example: 夏/n 四月/t ,/w 費伯/nr 帥/v 師/n 城/v 郎/ns 。/w
We designed the POStagging set, which has 17 tags shown in table 1. The users could refer thefollowing paper or Chinese book for further information.
The data isprovided in the UTF-8 encoding. All files were automatically verified andmanually checked.
l Xiaohe Chen,Minxuan Feng, Runhua Xu, et al. Information Processing of Pre-Qin Chinese.World Publishing Corporation, Beijing, 2013. (陈小荷,冯敏萱,徐润华,等.先秦文献信息处理, 世界图书出版公司, 2013)
l Bin Li, MinxuanFeng, Xiaohe Chen. Corpus Based Lexical Statistics of Pre-Qin Chinese. LectureNotes in Computer Science Volume 7717, 2013, pp 145-153.
Please view thefollowing sample file:
example.txt
This work wassupported in part by the Ministry of Education of China (16YJC740034) NationalSocial Science Foundation of China (15ZDB127).
We will continueto release more annotated data of Ancient Chinese.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-9-27 07:30
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社