|||
自动语言处理和计算语言学
在过去的10年里,政府已经使用了,通过各种机构,约2000万美元用于机器翻译及其密切相关的科目(见附件16 )。这已经超过了政府1年翻译费用以上。其他资金已分配到信息检索,图书馆自动化,编程指令。
虽然分时操作的机械制造和编程技术,已经部分得到来自政府的支持,计算机行业也已经使用它自己的资源用于机器开发,自动语言处理相关的支出在计算机硬件进展中起着明显的次要角色。工业界也一直负责投入计算机对新闻字距和连字符调整及其相关排版方面的重要技术(见附录17 ) ,或许是因为这方面的市场较易确定。
相对于计算机硬件方面的较小影响,机器翻译,及其由此催生的计算语言学工作,对计算机软件(编程技术和系统)做出了显著贡献。这些贡献在附录18中有相当详细的讨论。
到目前为止,机器翻译最重要的结果在于其对语言学的影响,附录19中有更多细节的描述。
计算语言学的问世有望在自然语言的研究工作中引起一场革命。十年前,大多数语言学家认为,句法主要涉及调整词序、形态、功能词(如介词和连词) ,以及语调或标点符号。他们还认为,在普通环境下,多数以英语为母语的人说出的句子语法没有歧义。今天,他们知道,这两个信念相互不协。这个认识是计算机对普通的句子自动分析(parsing)的直接结果,他们使用的是迄今能设计的合理文法,利用程序让给定文法下的所有歧义完全暴露。
如今仍有理论语言学家对实证和计算都不感兴趣,也有应用语言学家对十年来的理论进展无动于衷,对计算机也很木纳。但是,比以往任何时候都有更多的语言学家尝试把微妙的语言理论与更丰富的数据相结合,他们中几乎所有人,无论在哪个国家,都渴望计算机的支持。前一代人需要一辈子来做的一些语言工作(譬如建立对照语库、词汇表、肤浅的文法),如今借助计算机几个星期即可完成。在对于作为人类交流工具的自然语言的理解方面,人类的确迈出了万里长征的第一步。
语言学的革命不完全是机器翻译和自动分析工作的结果,但没有这些尝试,语言学革命不可能如此广泛或重大。
我们看到计算机为语言学家预备了一系列新的挑战、见地和机会。我们相信,这些挑战可与粒子物理面临的挑战、问题和见地类比。毫无疑问,语言在所有现象中的重要性是首屈一指的。而且计算语言学所需要的工具成本,比起需要数十亿伏加速器的粒子物理小太多了。新的语言学提出了一个有吸引力而且极其重要的挑战。
我们完全有理由相信,面对这一挑战,最终将导致在许多领域的重要贡献。一个更深的语言知识可以帮助:
更有效地教外语。
教语言的本质更有效。
更有效地使用自然语言下指令和通信。
帮助我们构造为特殊用途(例如,飞行员控制塔通讯语言)的人工语言。
使我们能够在语言的使用以及人的沟通和思想方面做有意义的心理实验。除非我们知道语言是什么,我们不知道我们必须解释什么。
用机器辅助翻译和信息检索。
然而,语言学的状态是这样的,本身具有价值的优秀研究是必不可少的,如果语言学最终要做出这些贡献。
这样的研究必须使用电脑。我们必须研究以找出有关语言奥妙的数据是压倒性的,无论在数量还是复杂性上。电脑承诺帮助我们控制巨大的数据量问题,并在一定程度上对付数据的复杂性问题。但是,我们尚不具有明确而容易使用的电脑处理语言数据的好方法。
因此,下列重要的研究,是需要做的,应予以支持:(1)计算机处理语言的方法的基本开发研究,譬如帮助语言科学家发现并说明他的概括的工具,并作为工具帮助检查对数据的概括建议;(2)发展研究的方法,让语言的科学家用电脑来陈述他们的详细复杂的各种理论(例如,语法和意义理论),使他们生产的理论可以被检查细节。
改善翻译的道路
我们已经注意到,我们已经具有一般科学文献的机器辅助翻译,但是我们并不具有真正有用的机器翻译。此外,机器翻译也不具备直接的或可预见的前景。
我们已经指出,机器翻译的重要贡献主要在促进语言学以及计算机编程方面的进展。我们注意到,翻译本身虽然非常重要,但对翻译需求的满足只要一个不大但有能力的活动组织即可。当然,我们发现,翻译质量的改善还是有具备吸引力的机会,我们呼吁加强针对翻译改善方面的工作。我们也注意到为了保证翻译质量,成本会有显著变化。
因此,取得客观的对准确性和质量的评价非常重要。实际有用的测试,如附录10中所描述的努力,是最重要的。
机器辅助可能是人工翻译或机助翻译的一个重要的支持。美国空军外国技术部( FTD )的数字显示,生产成本(最终翻译的组装和再生产)是非常高的。看来,翻译期刊延误是由于生产,而不是翻译。编辑和生产采用机械化手段可能是可取的(见附件17 ) 。这方面研究和开发的主要成本最好可以由其他比翻译更大的领域来承担。
机器辅助翻译可能是朝着更好、更快、更便宜的翻译发展的一个重要途径。机器辅助翻译最需要的是良好的工程。什么对人最有帮助,是特殊词汇表,文中部分或全部词的词典查找,还是一个粗略的翻译,如由FTD产出的那样 ?延误往往由于许多步骤需要排队等候所致,怎样才能避免这些延误?如何削减生产成本?
自动字符识别经常被认为对机器辅助翻译很重要。 FTD的数字表明,自动字符识别可能对作业成本略有降低。自动字符识别的工作由下列几种活动资助(例如,信息检索,邮局),这些活动领域通过成功的字符识别将比机器辅助翻译要节省更多成本。因此,只要能节省钱就应采用字符识别。但这方面研发不需要机器翻译来资助。
最后,对改善翻译究竟应该花多少钱来研究和开发?对一个相对较小规模而且满意度的很好的翻译产业上花费大笔钱,是不合理的。
委员会无法判断改善翻译究竟应该需要在研究和开发上年度总投入多少为宜。然而,钱应该花在脚踏实地、重要而相对短期见效的目标上。
建议
委员会建议在两个不同的领域投入。
首先是作为语言学一部分的计算语言学研究,如自动文法分析、句子生成、结构、语义、统计以及定量的语言问题,包括带有机器辅助或不带机助的实验。应当支持作为科学来研究语言学,这种研究不应根据其在实际翻译的任何直接或可预见的贡献来判断。重要的是要找有能力的人来审批研究方案,评判人应该有能力审定现代语言学的工作,并根据方案的科学价值进行评估。
第二个方面是改善翻译。应该得到资助的工作包括
实用的翻译评价方法;
加快人类翻译过程的种种手段;
评估翻译的质量和成本的各种来源;
调查的翻译的利用率,防止生产无人使用的翻译;
考察翻译全过程的延误,以及消除延误的方法,无论是杂志翻译还是个别项目的翻译;
评价各种各样的机器辅助翻译的相对速度和成本;
现有机械化编辑和翻译生产过程的改造;
翻译全过程; 以及
生产足够的翻译工作参考资料,包括现在主要存在于机器翻译自动字典查找中的词汇表。
所有这些研究的目应当是增加翻译速度,降低翻译成本,并达到指定的可接受的质量。
~~~~~~~~~~~~~~~~~~~~~~~~~
Automatic Language Processing and Computational Linguistics
Over the past 10 years the government has spent, through various agencies, some $20 million on machine translation and closely related subjects (see Appendix 16). This is more than the government cost of translation for 1 year. Other moneys have been allocated to information retrieval, library automation, and programmed instruction.
Although techniques of machine construction and programming for time-shared operation have been developed with partial support from the government, the computer industry has spent its own resources in machine development, and expenditures in connection with automatic language processing have played a distinctly minor role in advances in computer hardware. Industry has also been responsible for the development of important techniques of computer justification and hyphenation of newsprint and related matters of composition (see Appendix 17), perhaps because the market was easy to determine.
As opposed to its small effect on computer hardware, work toward machine translation, together with the computational linguistic work that has grown out of it, has contributed significantly to computer software (programming techniques and systems). These contributions are discussed in considerable detail in Appendix 18.
By far the most important outcome of work toward machine translation has been its effect on linguistics, which is described in more detail in Appendix 19.
The advent of computational linguistics promises to work a revolution in the study of natural languages. A decade ago, most linguists believed that syntax had to do with word order, inflection, function words (e.g., prepositions and conjunctions), and intonation or punctuation. They also believed that most sentences uttered by native speakers in ordinary contexts were syntactically unambiguous. Today, they know that these two beliefs are mutually inconsistent. Their knowledge is the immediate result of computer parsing of ordinary sentences, using reasonable grammars as hitherto conceived and programs that expose all ambiguities under a fixed grammar.
Today there are linguistic theoreticians who take no interest in empirical studies or in computation. There are also empirical linguists who are not excited by the theoretical advances of the decade – or by computers. But more linguists than ever before are attempting to bring subtler theories into confrontation with richer bodies of data, and virtually all of them, in every country, are eager for computational support. The life's work of a generation ago (a concordance, a glossary, a superficial grammar) is the first small step of today, accomplished in a few weeks (next year, in a few days), the first of 10,000 steps toward an understanding of natural language as the vehicle of human communication.
The revolution in linguistics has not been solely a result of attempts at machine translation and parsing, but it is unlikely that the revolution would have been extensive or significant without these attempts.
We see that the computer has opened up to linguists a host of challenges, partial insights, and potentialities. We believe these can be aptly compared with the challenges, problems, and insights of particle physics. Certainly, language is second to no phenomenon in importance. And the tools of computational linguistics are considerably less costly than the multibillion-volt accelerators of particle physics. The new linguistics presents an attractive as well as an extremely important challenge.
There is every reason to believe that facing up to this challenge will ultimately lead to important contributions in many fields. A deeper knowledge of language could help
1. to teach foreign languages more effectively;
2. to teach about the nature of language more effectively;
3. to use natural language more effectively in instruction and communication;
4. to enable us to engineer artificial languages for special purposes (e.g., pilot-to-control tower languages);
5. to enable us to make meaningful psychological experiments in language use and in human communication and thought (unless we know what language is we do not know what we must explain); and
6. to use machines as aids in translation and in information retrieval.
However, the state of linguistics is such that excellent research, which has value in itself, is essential if linguistics is ultimately to make such contributions.
Such research must make use of computers. The data we must examine in order to find out about language is overwhelming both in quantity and in complexity. Computers give promise of helping us control the problems relating to the tremendous volume of data, and to a lesser extent the problems of data complexity. But, we do not yet have good, easily used, commonly known methods for having computers deal with language data.
Therefore, among the important kinds of research that need to be done and should be supported are (1) basic developmental research in computer methods for handling language, as tools for the linguistic scientist to use as a help to discover and state his generalizations, and as tools to help check proposed generalizations against data; and (2) developmental research in methods to allow linguistic scientists to use computers to state in detail the complex kinds of theories (for example, grammars and theories of meaning) they produce, so that the theories can be checked in detail.
Avenues to Improvement of Translation
We have already noted that, while we have machine-aided translation of general scientific text, we do not have useful machine translation. Further, there is no immediate or predictable prospect of useful machine translation.
We have noted that the important contributions of machine translation have been primarily to linguistics and secondarily to computer programming. We have noted that while translation itself is vital, needs for translation are being met by a small though capable activity. We find, however, that there are attractive opportunities for improvement in translation, and we urge work aimed at such improvement. We have noted the importance of quality in translations. We have noted that cost varies markedly with asserted quality.
It is important, therefore, to achieve some objective evaluation of accuracy and quality. Work toward practical useful tests, such as that described in Appendix 10, is of the greatest importance.
Machine aids may be an important adjunct to human or machine-aided translation. USAF Foreign Technology Division (FTD) figures show that production costs (assembly and reproduction of the final translations) are very high. It appears that delays in translated journals are attributable to production rather than to translation. Adoption of mechanized means of editing and production might be desirable (see Appendix 17). Here the main cost of research and development can best be borne by other, larger fields than translation.
Machine-aided translation may be an important avenue toward better, quicker, and cheaper translation. What machine-aided translation needs most is good engineering. What will help the human being most–special glossaries, dictionary look-up of some or all words in the text, or a rough translation such as that produced by FTD? How can the delays due to queues at many tandem steps be avoided? How can production costs be cut?
Automatic character recognition is often mentioned as important to machine-aided translation. FTD figures indicate that automatic character recognition could slightly decrease the cost of the operation. Automatic character recognition work is being supported heavily in connection with several kinds of activity (information retrieval, post office, for example) where the financial savings through successful character recognition would be much greater than in machine-aided translation. Hence, character recognition should be adopted when and if it will save money, but research and development need not be supported in connection with machine translation.
Finally, how much should be spent on research and development toward improving translation? It would be unreasonable to spend extravagantly on a relatively small business that is doing the job satisfactorily.
The Committee cannot judge what the total annual expenditure for research and development toward improving translation should be. However, it should be spent hardheadedly toward important, realistic, and relatively short- range goals.
Recommendations
The Committee recommends expenditures in two distinct areas.
The first is computational linguistics as a part of linguistics– studies of parsing, sentence generation, structure, semantics, statistics, and quantitative linguistic matters, including experiments in translation, with machine aids or without. Linguistics should be supported as science, and should not be judged by any immediate or foreseeable contribution to practical translation. It is important that proposals be evaluated by people who are competent to judge modern linguistic work, and who evaluate proposals on the basis of their scientific worth.
The second area is improvement of translation. Work should be supported on such matters as
1. practical methods for evaluation of translations;
2. means for speeding up the human translation process;
3. evaluation of quality and cost of various sources of translations;
4. investigation of the utilization of translations, to guard against production of translations that are never read;
5. study of delays in the over-all translation process, and means for eliminating them, both in journals and in individual items;
6. evaluation of the relative speed and cost of various sorts of machine- aided translation;
7. adaptation of existing mechanized editing and production processes in translation;
8. the over-all translation process; and
9. production of adequate reference works for the translator, including the adaptation of glossaries that now exist primarily for automatic dictionary look-up in machine translation.
All such studies should be aimed at increasing the speed and decreasing the cost of translations and at specifying degrees of acceptable quality.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please use the print version of this publication as the authoritative version for attribution.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-25 05:53
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社