语料库翻译研究+认知空间分享 http://blog.sciencenet.cn/u/carldy 探索翻译研究新途径,反思语言认知研究

博文

The ZJU Corpus of Translational Chinese (ZCTC)

已有 7659 次阅读 2012-4-16 09:13 |个人分类:基于语料库的研究汇总corpus-based studies|系统分类:科研笔记| Chinese, translated, ZCTC

Here enclosed the introduction of the corpus--ZCTC--which can offer us some new light on the translation studies and investigating the linguistic features of translated Chinese.
the information comes from the following website by Richard Xiao (Prof. Xiao Zhonghua):
http://www.lancs.ac.uk/fass/projects/corpus/ZCTC/

The ZJU Corpus of Translational Chinese (ZCTC)

Since the 1990s, the rapid development of the corpus-based approach in linguistic investigation in general, and the development of multilingual corpora in particular, have brought even more vigour into descriptive translation studies. As Laviosa (1998) observes, "the corpus-based approach is evolving, through theoretical elaboration and empirical realisation, into a coherent, composite and rich paradigm that addresses a variety of issues pertaining to theory, description, and the practice of translation." Presently, corpus-based Descriptive Translation Studies (DTS) has primarily been concerned with describing translation as a product, by comparing corpora of translated and non-translational native texts in the target language, especially translated and native English. The majority of product-oriented translation studies attempt to uncover evidence to support or reject the so-called "translation universal" (TU) hypotheses that are concerned with features of translational language as the "third code" of translation  (Frawley 1984), which is supposed to be different from both source and target languages.

Presently a large part of product-oriented translation studies have been based on the Translational English Corpus (TEC), which was designed specifically for the purpose of studying English translated from a range of source languages. This is perhaps the only publicly available corpus of translational language. Most of the pioneering and prominent studies of translational English have been based on this corpus, which have so far focused on syntactic and lexical features of translated and original texts of English. Such studies have provided evidence to support the hypotheses of "translational universals" (TUs) in translated English, e.g. simplification, explicitation, sanitisation, and normalisation.

However, the term "translational universal" is highly debatable in the literature. Since research of features of translational language has so far been confined largely to English and closely related European languages, and the translational universals that have been proposed so far are identified on the basis of translational English ¨C mostly translated from European languages, there is a possibility that such linguistic features are not "universal" cross-linguistically but rather specific to English and/or genetically related languages that have been investigated. Clearly, if the features of translational language that have been reported are to be generalised as translation "universals", the language pairs involved must not be restricted to English and closely related languages. Evidence from "genetically" distinct language pairs such as English and Chinese is undoubtedly more convincing.

The ZJU Corpus of Translational Chinese (ZCTC) is created exactly with this aim in mind. It is designed as a translational counterpart of the Lancaster Corpus of Mandarin Chinese (LCMC), a one-million-word balanced corpus representing native Mandarin Chinese. The LCMC and ZCTC corpora have been built by following comparable sampling criteria and the same sampling techniques, and they have been processed using the same tools to ensure maximum comparability.

The ZCTC corpus is created on our ongoing project (07.2007-02.2010) A Corpus-Based Quantitative Study of Translational Chinese in English-Chinese Translation, which is funded by the China National Foundation of Social Sciences (grant reference 07BYY011).


Richard Xiao

2008

1. Corpus design

The ZJU Corpus of Translational Chinese (ZCTC) is created with the explicit aim of studying the features of translated Chinese in relation to non-translated native Chinese. It has modelled the Lancaster Corpus of Mandarin Chinese, a one-million-word balanced corpus which was designed to represent written Mandarin Chinese.

Both LCMC and ZCTC corpora have sampled five hundred 2,000-word text chunks from fifteen written text categories published in China, with each amounting to one million words. The text categories covered in the two corpora, together with their respective proportions, are given below:

Genre label Genre Number of samples Proportion
A Press: Reportage 44 8.8%
B Press: Editorial 27 5.4%
C Press: Review 17 3.4%
D Religious writing 17 3.4%
E Skill / trade / hobby 38 7.6%
F Popular lore 44 8.8%
G Biography and essay 77 15.4%
H Miscellaneous (report and official document) 30 6.0%
J Science (academic prose) 80 16.0%
K General fiction 29 5.8%
L Mystery and detective fiction 24 4.8%
M Science fiction 6 1.2%
N Adventure fiction 29 5.8%
P Romantic fiction 29 5.8%
R Humour 9 1.8%
Total 500 100.0%

Since the LCMC corpus was designed as a Chinese match for the FLOB / Frown corpora  of British / American English, with the specific aim of comparing and contrasting English and Chinese, it has also followed the sampling period of FLOB / Frown and sampled written Mandarin Chinese within three years around 1991. While it was relatively easy to find texts of native Chinese published in this sampling period, it would be much more difficult to get access to translated Chinese texts of some categories - especially in electronic format - published in this time frame. This pragmatic consideration of data collection has forced us to modify the LCMC model slightly by extending the sampling period by a decade, i.e. to 2001, when we built the ZJU Corpus of Translational Chinese. This extension has been particularly useful because the popularisation of the Internet and online publication in the 1990s have made it possible and easier to access a large amount of digitalised texts. Readers are reminded of this modification when they interpret the results based on a comparison of the LCMC and ZCTC corpora. Those who are interested in potential change during this decade in Mandarin Chinese are advised to use the UCLA Written Chinese Corpus, which models LCMC but samples texts one decade apart.

While English is the source language of the vast majority of the text samples included the ZCTC corpus, we have also included a small number of texts translated from other languages to mirror the reality of the world of translations in China.

As Chinese is written as running strings of characters without white spaces delimiting words, it is only possible to know the number of tokens in a text when the text has been tokenised (see corpus annotation). As such, the text chunks were collected at the initial stage by using our best estimate (1:1.67) between the number of characters and number of words based on our previous experience. Only textual data was included, with graphs and tables in the original texts replaced by placeholders. A text chunk included in the corpus can be a sample from a large text (article and book chapter etc) or an assembly of several small texts (e.g. for the press categories). When parts of large texts are selected, an attempt has been made to achieve a balance between initial, medial and ending samples. When the texts are tokenised, a computer program was used to cut large texts to approximately 2,000 tokens while keeping the final sentence complete. As a result, while some text samples may be slightly longer than others, they are typically around 2,000 words. The table below compares the actual numbers of tokens in different genres as well as their corresponding percentages in the ZCTC and LCMC corpora.*

Genre label ZCTC Percentage LCMC Percentage
A 88186 8.63 89201 8.74
B 54171 5.30 54432 5.33
C 34100 3.34 34354 3.36
D 35139 3.44 35199 3.45
E 76681 7.51 77484 7.59
F 89675 8.78 89823 8.80
G 155601 15.23 156433 15.32
H 60352 5.91 60983 5.97
J 168736 16.52 162856 15.95
K 60540 6.93 60183 5.89
L 48924 4.79 49244 4.82
M 12267 1.20 12367 1.21
N 59042 5.78 60197 5.90
P 59033 5.78 59665 5.84
R 19072 1.87 18643 1.83
Total 1021449 100.00 1021064 100.00

*Note: The number of tokens given here for the Lancaster Corpus of Mandarin Chinese (LCMC) may be different from earlier releases, because this edition of LCMC has been retagged using ICTCLAS2008, which was used to tag the ZCTC corpus.

2. Corpus annotation

The ZCTC corpus is annotated using ICTCLAS2008, the latest release of the Chinese Lexical Analysis System developed by the Institute of Computing Technology, the Chinese Academy of Sciences. This annotation tool, which relies on a large lexicon and the Hierarchical Hidden Markov Model, integrates word tokenisation, named entity identification, unknown word recognition, as well as part-of-speech tagging. The ICTCLAS2008 has been reported to achieve a precision rate of 98.54% for word tokenisation. Latest open tests have also given encouraging results, with a precision rate of 98.13% for tokenisation and 94.63% for part-of-speech tagging. The application programming interface (API) of ICTCLAS2008 is publicly available at www.ictclas.org while a compiled program is available at www.corpus4u.org.

In order to ensure maximum comparability, a new release of the LCMC corpus (version 2.0) has been produced, which is retagged using this same tool. The part-of-speech tagset applied on the ZCTC and the new release of LCMC is described as follows.

a         adjective

ad       adverbial use of adjective

ag       adjectival morpheme

an        nominal use of adjective

al         adjectival formulaic expression

b          modifier (non-predicate noun modifier)

bg        noun modifier morpheme

bl         noun modifying formulaic expression

c          conjunction

cc        coordinating conjunction

d         adverb

dg       adverbial morpheme

dl        adverbial formulaic expression

e         interjection

ew      sentence-final punctuation (full stop, semi-colon, question mark, exclamation mark)

f         space word

h        prefix

k        suffix

m       numeral and quantifier

mg     numeral and quantifier morpheme

mq     numeral-classifier

n        noun

ng      nominal morpheme

nl       nominal formulaic expression

nr       person name

nr1     Chinese surname

nr2     Chinese first name

nrf      transliterated foreign person name

nrj      Japanese name

ns      place name

nsf     transliterated foreign place name

nt       organisation name

nz      other proper noun

o        onomatopoeia

p        preposition

pba    preposition ba

pbei   preposition bei

q        classifier

qt       temporal classifier

qv       verbal classifier

r         pronoun

rg       pronominal morpheme

rr       personal pronoun

ry       interrogative pronoun

rys     place interrogative pronoun

ryt      temporal interrogative pronoun

ryv      verbal interrogative pronoun

rz       deictic pronoun

rzs      place pronoun

rzt       temporal pronoun

rzv      verbal pronoun

s         place word

t         time word

tg       time word morpheme

u        auxiliary

ude1   的

ude2   地

ude3   得

udeng 等

udh      的话

uguo  

ule     

ulian   

uls      来说、来讲、而言、说来

usuo  

uyy     一样、一般、似的、般

uzhe  

uzhi   

v        verb

vd      adverbial use of verb

vf       directional verb

vg      verbal morpheme

vi       intransitive verb

vl       verbal formulaic expression

vn      nominal use of verb

vshi   是

vx      pro-verb

vyou  有

w       symbols and punctuations

wb     percentage and permillle signs: % and ‰ of full length; % of half length

wd     full or half-length comma: ,,

wj      full stop of full length: 。

wky   closing brackets: ) 〕  ] } 》  】 〗 〉of full length;  ) ] } > of half length

wkz   opening brackets: ( 〔  [  {  《 【  〖 〈 of full length; ( [ { < of half length

wn     full-length enumeration mark: 、

wp    dash: ——  --  —— -  of full length; ---  ---- of half length

ws    full-length ellipsis: ……  …

wt     full or half-length exclamation mark: !of full length; ! of half length

wyy  full-length single or double closing quote: ” ’ 』

wyz  full-length single or double opening quote: “ ‘ 『

x      non-word character string

y      particle

z      descriptive word

3. Corpus markup

The ZCTC corpus is marked up in Extensible Markup Language (XML) which is complaint with the Corpus Encoding Standards (CES). Each of the 500 data files has two parts: a corpus header and a body.

The cesHeader gives general information about the corpus (publicationStmt) as well as specific attributes of the text sample (fileDesc). Details in the publicationStmt element include the name of the corpus in English and Chinese, authors, distributor, availability, publication date, and history. The fileDesc element shows the original title(s) of the text(s) from which the sample was taken, individuals responsible for sampling and text processing, the project that creates the corpus file, data of creation, language usage, writing system, character encoding, and mode of channel.

The body part of the corpus file contains the textual data, which is marked up for structural organisation such as paragraphs (p) and sentences (s). Sentences are consecutively numbers for easy reference. Part-of-speech annotation is also given in XML, with the POS attribute of the w element indicating its part-of-speech category (see corpus annotation for the tagset).

The XML markup of the ZCTC corpus is perfectly well-formed and has been validated using Altova XMLSpy 2008. The XML elements of the corpus are defined in the Document Type Definition.

The ZCTC corpus is encoded in Unicode, applying the Unicode Transformation Format 8-Bit (UTF-8), which is a lossless encoding for Chinese while keeping the XML files at a minimum size.

4. Data sources

The tables linked in this section list the sources of the text samples included in the ZCTC corpus. Each sample has approximately 2,000 words. Samples composed of short texts can have multiple sources. The following bibliographic details are given where such information is available and has been recorded in data collection: sample ID, title, author/source, translator, publisher/journal, year/volume, sample position, and URL.

A) Press reportage (44 text samples)

B) Press editorial (27 text samples)

C) Press review (17 text samples)

D) Religious writing (17 text samples)

E) Skill / trade / hobby (38 text samples)

F) Popular lore (44 text samples)

G) Biography and essay (77 text samples)

H) Miscellaneous - reports and official document (30 text samples)

J) Science - academic prose (80 text samples)

K) General fiction (29 text samples)

L) Mystery and detective fiction ( 24 text samples)

M) Science fiction (6 text samples)

N) Adventure fiction (29 text samples)

P) Romantic fiction (29 text samples)

R) Humour (9 text samples)







https://blog.sciencenet.cn/blog-331736-559775.html

上一篇:“知沟”理论假说
下一篇:Three levels of adequacy
收藏 IP: 161.64.97.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-12-27 07:10

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部