|||
Since the 1990s, the rapid development of the corpus-based approach in linguistic investigation in general, and the development of multilingual corpora in particular, have brought even more vigour into descriptive translation studies. As Laviosa (1998) observes, "the corpus-based approach is evolving, through theoretical elaboration and empirical realisation, into a coherent, composite and rich paradigm that addresses a variety of issues pertaining to theory, description, and the practice of translation." Presently, corpus-based Descriptive Translation Studies (DTS) has primarily been concerned with describing translation as a product, by comparing corpora of translated and non-translational native texts in the target language, especially translated and native English. The majority of product-oriented translation studies attempt to uncover evidence to support or reject the so-called "translation universal" (TU) hypotheses that are concerned with features of translational language as the "third code" of translation (Frawley 1984), which is supposed to be different from both source and target languages.
Presently a large part of product-oriented translation studies have been based on the Translational English Corpus (TEC), which was designed specifically for the purpose of studying English translated from a range of source languages. This is perhaps the only publicly available corpus of translational language. Most of the pioneering and prominent studies of translational English have been based on this corpus, which have so far focused on syntactic and lexical features of translated and original texts of English. Such studies have provided evidence to support the hypotheses of "translational universals" (TUs) in translated English, e.g. simplification, explicitation, sanitisation, and normalisation.
However, the term "translational universal" is highly debatable in the literature. Since research of features of translational language has so far been confined largely to English and closely related European languages, and the translational universals that have been proposed so far are identified on the basis of translational English ¨C mostly translated from European languages, there is a possibility that such linguistic features are not "universal" cross-linguistically but rather specific to English and/or genetically related languages that have been investigated. Clearly, if the features of translational language that have been reported are to be generalised as translation "universals", the language pairs involved must not be restricted to English and closely related languages. Evidence from "genetically" distinct language pairs such as English and Chinese is undoubtedly more convincing.
The ZJU Corpus of Translational Chinese (ZCTC) is created exactly with this aim in mind. It is designed as a translational counterpart of the Lancaster Corpus of Mandarin Chinese (LCMC), a one-million-word balanced corpus representing native Mandarin Chinese. The LCMC and ZCTC corpora have been built by following comparable sampling criteria and the same sampling techniques, and they have been processed using the same tools to ensure maximum comparability.
The ZCTC corpus is created on our ongoing project (07.2007-02.2010) A Corpus-Based Quantitative Study of Translational Chinese in English-Chinese Translation, which is funded by the China National Foundation of Social Sciences (grant reference 07BYY011).
Richard Xiao
2008
1. Corpus designThe ZJU Corpus of Translational Chinese (ZCTC) is created with the explicit aim of studying the features of translated Chinese in relation to non-translated native Chinese. It has modelled the Lancaster Corpus of Mandarin Chinese, a one-million-word balanced corpus which was designed to represent written Mandarin Chinese.
Both LCMC and ZCTC corpora have sampled five hundred 2,000-word text chunks from fifteen written text categories published in China, with each amounting to one million words. The text categories covered in the two corpora, together with their respective proportions, are given below:
Genre label | Genre | Number of samples | Proportion |
A | Press: Reportage | 44 | 8.8% |
B | Press: Editorial | 27 | 5.4% |
C | Press: Review | 17 | 3.4% |
D | Religious writing | 17 | 3.4% |
E | Skill / trade / hobby | 38 | 7.6% |
F | Popular lore | 44 | 8.8% |
G | Biography and essay | 77 | 15.4% |
H | Miscellaneous (report and official document) | 30 | 6.0% |
J | Science (academic prose) | 80 | 16.0% |
K | General fiction | 29 | 5.8% |
L | Mystery and detective fiction | 24 | 4.8% |
M | Science fiction | 6 | 1.2% |
N | Adventure fiction | 29 | 5.8% |
P | Romantic fiction | 29 | 5.8% |
R | Humour | 9 | 1.8% |
Total | 500 | 100.0% |
Since the LCMC corpus was designed as a Chinese match for the FLOB / Frown corpora of British / American English, with the specific aim of comparing and contrasting English and Chinese, it has also followed the sampling period of FLOB / Frown and sampled written Mandarin Chinese within three years around 1991. While it was relatively easy to find texts of native Chinese published in this sampling period, it would be much more difficult to get access to translated Chinese texts of some categories - especially in electronic format - published in this time frame. This pragmatic consideration of data collection has forced us to modify the LCMC model slightly by extending the sampling period by a decade, i.e. to 2001, when we built the ZJU Corpus of Translational Chinese. This extension has been particularly useful because the popularisation of the Internet and online publication in the 1990s have made it possible and easier to access a large amount of digitalised texts. Readers are reminded of this modification when they interpret the results based on a comparison of the LCMC and ZCTC corpora. Those who are interested in potential change during this decade in Mandarin Chinese are advised to use the UCLA Written Chinese Corpus, which models LCMC but samples texts one decade apart.
While English is the source language of the vast majority of the text samples included the ZCTC corpus, we have also included a small number of texts translated from other languages to mirror the reality of the world of translations in China.
As Chinese is written as running strings of characters without white spaces delimiting words, it is only possible to know the number of tokens in a text when the text has been tokenised (see corpus annotation). As such, the text chunks were collected at the initial stage by using our best estimate (1:1.67) between the number of characters and number of words based on our previous experience. Only textual data was included, with graphs and tables in the original texts replaced by placeholders. A text chunk included in the corpus can be a sample from a large text (article and book chapter etc) or an assembly of several small texts (e.g. for the press categories). When parts of large texts are selected, an attempt has been made to achieve a balance between initial, medial and ending samples. When the texts are tokenised, a computer program was used to cut large texts to approximately 2,000 tokens while keeping the final sentence complete. As a result, while some text samples may be slightly longer than others, they are typically around 2,000 words. The table below compares the actual numbers of tokens in different genres as well as their corresponding percentages in the ZCTC and LCMC corpora.*
Genre label | ZCTC | Percentage | LCMC | Percentage |
A | 88186 | 8.63 | 89201 | 8.74 |
B | 54171 | 5.30 | 54432 | 5.33 |
C | 34100 | 3.34 | 34354 | 3.36 |
D | 35139 | 3.44 | 35199 | 3.45 |
E | 76681 | 7.51 | 77484 | 7.59 |
F | 89675 | 8.78 | 89823 | 8.80 |
G | 155601 | 15.23 | 156433 | 15.32 |
H | 60352 | 5.91 | 60983 | 5.97 |
J | 168736 | 16.52 | 162856 | 15.95 |
K | 60540 | 6.93 | 60183 | 5.89 |
L | 48924 | 4.79 | 49244 | 4.82 |
M | 12267 | 1.20 | 12367 | 1.21 |
N | 59042 | 5.78 | 60197 | 5.90 |
P | 59033 | 5.78 | 59665 | 5.84 |
R | 19072 | 1.87 | 18643 | 1.83 |
Total | 1021449 | 100.00 | 1021064 | 100.00 |
*Note: The number of tokens given here for the Lancaster Corpus of Mandarin Chinese (LCMC) may be different from earlier releases, because this edition of LCMC has been retagged using ICTCLAS2008, which was used to tag the ZCTC corpus.
2. Corpus annotationThe ZCTC corpus is annotated using ICTCLAS2008, the latest release of the Chinese Lexical Analysis System developed by the Institute of Computing Technology, the Chinese Academy of Sciences. This annotation tool, which relies on a large lexicon and the Hierarchical Hidden Markov Model, integrates word tokenisation, named entity identification, unknown word recognition, as well as part-of-speech tagging. The ICTCLAS2008 has been reported to achieve a precision rate of 98.54% for word tokenisation. Latest open tests have also given encouraging results, with a precision rate of 98.13% for tokenisation and 94.63% for part-of-speech tagging. The application programming interface (API) of ICTCLAS2008 is publicly available at www.ictclas.org while a compiled program is available at www.corpus4u.org.
In order to ensure maximum comparability, a new release of the LCMC corpus (version 2.0) has been produced, which is retagged using this same tool. The part-of-speech tagset applied on the ZCTC and the new release of LCMC is described as follows.
a adjective
ad adverbial use of adjective
ag adjectival morpheme
an nominal use of adjective
al adjectival formulaic expression
b modifier (non-predicate noun modifier)
bg noun modifier morpheme
bl noun modifying formulaic expression
c conjunction
cc coordinating conjunction
d adverb
dg adverbial morpheme
dl adverbial formulaic expression
e interjection
ew sentence-final punctuation (full stop, semi-colon, question mark, exclamation mark)
f space word
h prefix
k suffix
m numeral and quantifier
mg numeral and quantifier morpheme
mq numeral-classifier
n noun
ng nominal morpheme
nl nominal formulaic expression
nr person name
nr1 Chinese surname
nr2 Chinese first name
nrf transliterated foreign person name
nrj Japanese name
ns place name
nsf transliterated foreign place name
nt organisation name
nz other proper noun
o onomatopoeia
p preposition
pba preposition ba 把
pbei preposition bei 被
q classifier
qt temporal classifier
qv verbal classifier
r pronoun
rg pronominal morpheme
rr personal pronoun
ry interrogative pronoun
rys place interrogative pronoun
ryt temporal interrogative pronoun
ryv verbal interrogative pronoun
rz deictic pronoun
rzs place pronoun
rzt temporal pronoun
rzv verbal pronoun
s place word
t time word
tg time word morpheme
u auxiliary
ude1 的
ude2 地
ude3 得
udeng 等
udh 的话
uguo 过
ule 了
ulian 连
uls 来说、来讲、而言、说来
usuo 所
uyy 一样、一般、似的、般
uzhe 着
uzhi 之
v verb
vd adverbial use of verb
vf directional verb
vg verbal morpheme
vi intransitive verb
vl verbal formulaic expression
vn nominal use of verb
vshi 是
vx pro-verb
vyou 有
w symbols and punctuations
wb percentage and permillle signs: % and ‰ of full length; % of half length
wd full or half-length comma: ,,
wj full stop of full length: 。
wky closing brackets: ) 〕 ] } 》 】 〗 〉of full length; ) ] } > of half length
wkz opening brackets: ( 〔 [ { 《 【 〖 〈 of full length; ( [ { < of half length
wn full-length enumeration mark: 、
wp dash: —— -- —— - of full length; --- ---- of half length
ws full-length ellipsis: …… …
wt full or half-length exclamation mark: !of full length; ! of half length
wyy full-length single or double closing quote: ” ’ 』
wyz full-length single or double opening quote: “ ‘ 『
x non-word character string
y particle
z descriptive word
3. Corpus markupThe ZCTC corpus is marked up in Extensible Markup Language (XML) which is complaint with the Corpus Encoding Standards (CES). Each of the 500 data files has two parts: a corpus header and a body.
The cesHeader gives general information about the corpus (publicationStmt) as well as specific attributes of the text sample (fileDesc). Details in the publicationStmt element include the name of the corpus in English and Chinese, authors, distributor, availability, publication date, and history. The fileDesc element shows the original title(s) of the text(s) from which the sample was taken, individuals responsible for sampling and text processing, the project that creates the corpus file, data of creation, language usage, writing system, character encoding, and mode of channel.
The body part of the corpus file contains the textual data, which is marked up for structural organisation such as paragraphs (p) and sentences (s). Sentences are consecutively numbers for easy reference. Part-of-speech annotation is also given in XML, with the POS attribute of the w element indicating its part-of-speech category (see corpus annotation for the tagset).
The XML markup of the ZCTC corpus is perfectly well-formed and has been validated using Altova XMLSpy 2008. The XML elements of the corpus are defined in the Document Type Definition.
The ZCTC corpus is encoded in Unicode, applying the Unicode Transformation Format 8-Bit (UTF-8), which is a lossless encoding for Chinese while keeping the XML files at a minimum size.
4. Data sourcesThe tables linked in this section list the sources of the text samples included in the ZCTC corpus. Each sample has approximately 2,000 words. Samples composed of short texts can have multiple sources. The following bibliographic details are given where such information is available and has been recorded in data collection: sample ID, title, author/source, translator, publisher/journal, year/volume, sample position, and URL.
A) Press reportage (44 text samples)
B) Press editorial (27 text samples)
C) Press review (17 text samples)
D) Religious writing (17 text samples)
E) Skill / trade / hobby (38 text samples)
F) Popular lore (44 text samples)
G) Biography and essay (77 text samples)
H) Miscellaneous - reports and official document (30 text samples)
J) Science - academic prose (80 text samples)
K) General fiction (29 text samples)
L) Mystery and detective fiction ( 24 text samples)
M) Science fiction (6 text samples)
N) Adventure fiction (29 text samples)
P) Romantic fiction (29 text samples)
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-27 07:10
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社