博文

ALPAC 黑皮书 7/9：附录9-10

已有 3257 次阅读 2013-10-6 16:43 |个人分类:立委科普|系统分类:科研笔记| 机器翻译

Appendix 9

Cost Estimates of Various Types of Translation

Before attempting to determine the costs of various types of translation, it might be instructive to see what the costs would be for an operation that made no use of translations, that is, a system that utilized subject specialists who were also skilled in a second language.

Let us assume that we have an agency that employs 100 analysts and let us further assume the following:

1.that 50 of the analysts are competent in Russian in their subject field,

2.that each analyst earns $12,000 per year,

3.that each analyst reads 1,000 words of Russian per day in his work,

4.that each analyst works 220 days per year, and

5.that, therefore, the agency consumes a total of 11,000,000 Russian words a year.

Since the major effort in past work on machine translation (MT) has been to develop a program to translate Russian into English, let us now restrict our discussion to the 50 analysts who are proficient in Russian. Salaries for these 50 would amount to $600,000 per year. Other costs such as Social Security, annual and sick leave, and retirement could be calculated at approximately 33 1/3 percent of their gross salaries. Thus the cost for these analysts would be approximately $800,000 per year. Obviously, no duplication checks would be necessary to determine whether a translation of any given work was already in existence.

The Committee has no figures on the cost of maintaining facilities necessary for the making of checks to prevent the duplication of translation. If these costs could be determined and if they proved to be substantial, it might be the case that it would be more economical not to make duplication checks of documents less than some specific number of pages in length. In any event, the duplication checks would be superfluous for an agency employing persons proficient in a foreign language.

MAJOR COSTS OF ITEMS OF AN AGENCY UTILIZING 50 ANALYSTS PROFICIENT IN RUSSIAN

50 Analysts at $12,000 per annum$600,000

Direct cost overhead at 33 1/3 percent of the above200,000

Duplication checks0

Total$800,000

Figured at 220 working days per analyst the total volume of words of Russian read would amount to 11,000,000 or about $75 for each 1,000 words read.

Time lag after receipt of documentnone

Total Cost of Translation0

MONOLINGUALS

If the 50 analysts could not read Russian and had to rely on translation, a number of possibilities exist for providing them with English translation. The agency could

1.employ in-house translators in the conventional method,

2.employ translation using the dictation (or sight) method of translation,

3.employ contract translators,

4.utilize the services of JPRS,

5.provide the analysts with unedited “raw” (MT) output,

6.provide the analysts with postedited MT, or

7.use a system of machine-aided translation.

Throughout the subsequent discussion, the Committee has relied heavily on the cost figures developed by Arthur D. Little, Inc., and contained in An Evaluation of Machine-Aided Translation Activities at FTD [Contract AF 33 (657)-13616, May 1, 1965]. References to this study are indicated below by (ADL) followed by the appropriate page number.

IN-HOUSE TRANSLATORS

At the Foreign Technology Division, the in-house translators work at a rate of about 240 Russian words per hour (ADL, p. 29), yielding a daily output of approximately 2,000 words. Thus one translator can produce enough to keep two analysts in translations.

Since ADL estimates (ADL, p. 21) that the cost for in-house translation is

$22.97 per 1,000 Russian words, the cost for 11,000,000 Russian words would be $252,670. We assume that direct costs were included in this figure ($5.60 per hr) for translator time. Other costs that must be included in this type of operation are those of space, equipment, recomposition, and proofreading and review.

MAJOR COSTS FOR IN-HOUSE HUMAN TRANSLATION

25 Translators' salaries and direct cost overhead$252,670

Recomposition ($14.15 per 1,000 words, ADL, p. 21)155,650

Proofreading and review ($2.97 per 1,000 words, ADL, p. 21)32,670

Duplication checks?

Total$432,990

IN-HOUSE TRANSLATION EMPLOYING DICTATION

The Committee's study described in Appendix 14 revealed that the average typing speed of the translator was only 18 words a minute and that typing took approximately 25 percent of the total time needed to produce the translation. It would seem then to be advantageous to use the translator for translating and to use trained typists to do the typing. One agency (see Appendix 1, page 35) found that on suitable texts (those with few graphics to be inserted), the daily output of the translator was doubled. A typist trained in the use of dictating equipment can type about 8,000 words of English per day. To convert this to the number of Russian words one must employ a factor of 1.35 English words per Russian word. Thus the 8,000 English words would represent 6,000 words of original Russian text. If the over-all output of the translator were to be increased by as little as 25 percent, his output would amount to 2,500 words per day. At this rate of output, only 20 translators would be needed instead of 25, and about eight typists would be needed to keep up with the output of the translators.

Although some savings are realized from this type of system, owing to the fact that typists are paid at about half the rate of translators, such savings are offset to some extent by the additional space and equipment required. It seems likely, however, that the use of this system would result in a more attractive product, the copy having been prepared by well-trained typists. Furthermore, an estimated increase of only 25 percent, upon which we have based our computations, may be unduly conservative. If this is so– and the Committee would like to see studies made to determine more accurately the actual advantages of various systems–the dictation method would be even more attractive.

CONTRACT TRANSLATION

Since contract translation costs vary widely, we will once more base our computations on data in the Arthur D. Little, Inc., report. The ADL team found that the cost per 1,000 Russian words was $24.57 for the translation process,

$5.40 for insertion of graphics, and $2.97 for proofreading and review, or a total of $32.94 (ADL, p. 21).

The Committee has been told by a reliable and knowledgeable individual connected with the translation at FTD that the proofreading and review procedure was unnecessary since the translations produced by the contractor were of excellent quality. Trusting this individual's judgment, but at the same time being aware that the ADL report is a careful study of what practices were in force (regardless of their necessity or degree of efficiency) at FTD, the Committee conjectured that $1.50 per 1,000 Russian words, rather than $2.97, might be a reasonable cost for the proofreading and review procedure; therefore, our computation differs from the ADL study. It is a fact that contractors have a lower overhead than in-house translators, and it is hoped that the significance of this item will not be overlooked by the reader.

An annual production of 11,000,000 Russian words by contract would cost the using agency

$270,270 for translation

59,400 for graphics

16,500 for proofreading and review

$346,170 Total

Since the average document to be translated is about 8,000 (Russian) words in length (ADL, p. A-8), our hypothetical agency would have to handle and control only six or seven documents a day, and few or no additional personnel would be needed for this task. Thus the $346,170 estimated above would approximate the total cost.

THE JOINT PUBLICATIONS RESEARCH SERVICE (JPRS)

The JPRS (Appendix 3) utilizes subject matter specialists who work at home on a part-time, contract basis. Thus, JPRS is able to handle a large quantity of translations in many languages in many fields at low rates. Because it does handle a large quantity of translations, JPRS is able to charge the same price for all translations regardless of subject matter or language. The current price is $16 per 1,000 words of English. Applying the factor of 1.35 English words for each Russian word, one can see that 11,000,000 Russian words are the equivalent of 14,850,000 English words and that, therefore, the JPRS charge for such translation would amount to $237,600. Once again, as with any contract translation, the number of additional personnel would be minimal, and the cost above would be close to the true cost.

UNEDITED MACHINE TRANSLATION (MT)

The development of an MT program capable of producing translations of such a quality that they would be useful to the reader without requiring the intervention of a translator anywhere in the process has long been the goal of researchers in MT. As far as the Committee can determine, two attempts have been made to give analysts “raw” or unedited machine output. Neither proved to be satisfactory. The FTD experience is stated with admirable succinctness: “This [acceptance of postedited MT] marks a considerable change in attitude toward MT's which, in their earlier unedited form, were generally regarded as unsatisfactory” (ADL, p. F-5).

We have worked out a simple equation that shows how many dollars may be saved by using the unedited machine output.

Let

CH = cost of human translation (dollars/1000 words), CM = cost of MT (dollars/1000 words),

W = loaded salary of user of the translation (dollars/hr), TH = reading time for human translation (hr/1000 words), TM = reading time for MT (hr/1000 words),

N = number of people who read the translation,

S = saving by MT (dollars/1000 words). Then

S = CH − CM − WN (TM − TH).

Presumably the saving would be greatest if the reader merely read machine print-out, referring to the untranslated original for figures and equations. Here the cost of machine output could best be compared, not with the cost of JPRS translations, but with the cost of dictated and uncorrected human translations, either voice on tape, or a typewritten transcription of the tape. As we have pointed out in Appendix 1, such translation can be carried out several times as fast as “full translation.”

Unfortunately, we do not know what the costs are for translations that are dictated but not typed. It would seem likely, however, that savings would be substantial, since there would be no costs (a) for typist-transcriptionists or (b) for recomposition. Whether the savings involved would be offset by increased difficulty of use by the analyst is not known. Although the analyst would not be presented with a written translation, he would at least be assured of having all the words translated, unlike the raw MT output.

Most translations are apparently read by more than one reader. According to one agency, the preparation of 175 copies of a translation for distribution is standard for documents that appeared originally in the open literature and this distribution accounts for about 90 percent of the documents translated. For the remaining 10 percent (the classified documents) only one copy is prepared, but the requester has the privilege of making as many copies as he deems fit. Even more astonishing is the estimate of the Arthur D. Little, Inc., team that “about 615 members of the Air Force R & D community (40,000 members) would be expected to have a common interest in the average translated document” (ADL, p. F-9).

It was shown by John B. Carroll, in the study that he did for the Committee (see Appendix 10), that the average reader tested took twice as long to read raw MT as he did to read a human translation. The ADL team found that the average reading rate of those tested was 200 words per minute for well-written English (ADL, p. D-6) or 0.08 hr per 1,000 words. From these two studies we determined the reading rate for raw MT to be 100 words per minute or 0.16 hr per 1,000 words.

Raw MT should be compared, as has been mentioned, with an equally inelegant product. But the Committee has no idea of the cost of a comparable product or the time required to read (or listen to) it, and these factors are crucial in the calculation of savings according to our equation. Prudence demands that we compare raw MT with a product about which we have more certain knowledge concerning cost and reading rates even though such translations are of higher quality.

For the purposes of comparison, we have chosen the JPRS for the simple reasons that (1) it is relatively inexpensive and (2) the costs are known and stable. Applying our equation, we have CH = $21.60 (the JPRS cost per 1,000 Russian words, the conversion factor of

1.35 being applied to $16.00, the cost per 1,000 English words),

CM = $7.63 [input typing $4.09, machine costs $3.21, output typing $0.33 (ADL, p. 20)],

W = $10.00 [$12,000 salary per annum ÷ 220 working days = $60.00, $60.00 + (60/3) (direct costs) = $80.00 loaded salary per day, $80.00 ÷ 8 = $10.00 (loaded salary per hour)],

TH = 0.08,

TM = 0.16.

Utilizing the figures above, but varying N (the number of readers), we arrive at the savings made by the use of raw output.

If the number of readers is 1:

S = $21.60 − 7.63 − [(10 × 1) (0.16 − 0.08)], S = $21.60 − 7.63 − 0.80,

S = $13.17.

If the number of readers is 10: S = $5.97.

If the number of readers is 15: S = $1.97.

If the number of readers is 17: S = $0.37.

If the number of readers is 18: S = −$0.43.

If the number of readers is 20: S = −$2.03.

If the number of readers is 80: S = −$40.13.

If the number of readers is 175: S = −$127.03.

If the number of readers is 615: S = −$478.13.

Obviously, the break-even point occurs between 17 and 18 readers. But we have seen that, in one agency at least, about 90 percent of the translations are distributed to 175 readers, whereas only 10 percent are prepared for a single reader. By simple computation it can be determined that whereas the use of JPRS for all translation would result in a loss of $14,487, the use of MT for all translation would result in a loss of $1,257,597. It might be argued that MT is still economical when used to provide translations that are user-limited; but, since relatively few translations seem to be destined for use by less than 18 readers, the volume would probably be too small to warrant the maintenance of an elaborate computer facility with its attendant personnel.

To the Committee, machine output (such as that shown on pages 20-23) seems very unattractive. We believe that the only valid argument for its use would be a compelling economic argument. If it can be shown that the use of unedited machine output, taking proper account of increased reading time on the part of the readers, would result in worthwhile savings over efficient human translation of the most nearly comparable kind, then there is a cogent reason for using unedited MT. But, unless such a worthwhile saving can be convincingly demonstrated, we regard the use of unedited machine output as regressive and unkind to readers.

In considering the cost of producing unedited machine output we must use the real current cost. It is nice to think that savings may be made someday by using automatic character recognition, but actual savings should be demonstrated conclusively before machine output is inflicted on users in any operational manner.

POSTEDITED MACHINE TRANSLATION (MT)

To provide 11,000,000 words of postedited Russian-to-English MT per year would cost $397,980 [$36.18 per 1,000 Russian words (ADL, p. B-7)]. This estimate should be regarded as a very low one, since the ADL team did not include overhead costs (ADL, p. 3). ADL figures (ADL, p. E-5) that for 100,000 words per day, 44 individuals would be required; for input typing, 14; for machine operation, 1.6; for output typing, 1.4; and for postediting, 28. Since we are assuming a 50,000-word-per-day consumption, we will halve this estimate, giving a total of 22 personnel. The point the Committee would like to make in this connection is that since 22 personnel would be required, 14 of whom (the posteditors) have to be proficient in Russian, one might as well hire a few more translators and have the translations done by humans. Another, perhaps better, alternative would be to take part of the money spent on MT and use it either (1) to raise salaries in order to hire bilingual analysts–thus avoiding translation altogether–or, (2) to use the money to teach the analysts Russian.

MACHINE-AIDED TRANSLATION (M-AT)

We will call M-AT any system of human translation that utilizes the computer to assist the translator and that was designed originally for such a purpose. A system such as that at the FTD might properly be called human-aided machine translation, since the postediting process was added after it became apparent that raw output was unsatisfactory and since humans are employed essentially to make up for the deficiencies of the computer output.

Specific costs for the two types of M-AT systems in operation (see Appendix 12 and Appendix 13) are not known to the Committee, but from the given figures that show the proportion of translator time saved, it is possible to make some rough estimates. Both the Federal Armed Forces Translation Agency and the European Coal and Steel Community indicate that a saving of about 50 percent of the translator's time could be expected by the use of a machine-aided system. Since translators' salaries constitute the largest item in the budget for a human-translation facility, such savings would probably be substantial. Input typing costs would not be as great as those at FTD, where the entire document to be translated is keypunched, since only the individual words or sentences with which the translator desires help are keypunched. Furthermore, the programming involved is relatively simple and small, and inexpensive computers are adequate.

The relatively modest increases in staff, equipment, and money necessary for the production of translator aids are likely to be offset by the increase in quality of the product. It is possible, therefore, that the savings of an M-AT system might approach 50 percent of the cost of translator salaries in a conventional human-translation system. If this estimate is sound, then the cost for an M-AT system to produce 11,000,000 words of Russian-to-English translation would be $314,655 ($126,335 for salaries, $155,650 for recomposition, $32,670 for proofreading and review).

SUMMARY

Throughout our discussion of costs, we have been conscious of the fact that we were not in possession of all the necessary data. We present the following estimates with diffidence and would welcome any studies that would more precisely determine actual translation costs and quality, whether they affirm or deny the validity of our estimate.

ESTIMATES OF COSTS AND QUALITY FOR VARIOUS TYPES OF TRANSLATION

TypeQualityCost for 11,000,000 Russian Words

In-house (conventional translation)

Good$ 440,000

In-house (dictation)Good440,000 −

ContractFair to good350,000

JPRSFair240,000

Raw MTUnsatisfactory80,000 +

Postedited MTFair400,000

M-ATExcellent310,000 Analysts proficient in Russian-0

CONCLUSION

Since no one can be proficient in all languages, there will always be a need for translation. Yet, publication is not evenly distributed among the some 4,000 languages of the world, and this is especially so in the areas of science and technology. Russian-to-English translation constitutes a large part of the total translation done in the United States, and there are no signs that this situation is likely to change radically in the foreseeable future. This being the case, the present policy of using monolingual analysts and providing them with translations year after year seems lacking in foresight, particularly since the time required for a scientist to learn a foreign language well enough to read an article in his own field of specialization is not very long, and since the facilities are available to train him.

In our hypothetical agency, the costs of providing fair and good translations were from 30 to 55 percent greater than the estimated costs of a facility using analysts proficient in Russian. To allow heavy users of Soviet literature to continue to rely on translations seems unwise.

Appendix 10

An Experiment in Evaluating the Quality of Translations

This experiment* was designed to lay the foundations for a standard procedure for measuring the quality of scientific translations, whether human or mechanical. There have been other experiments on this problem [e.g., G. A. Miller and J. G. Beebe-Center, Mechan. Transl. 3, 73 (1958); S. M. Pfafflin, Mechan. Transl. 8, 2 (1965)], but their methods for evaluating translations have been too laborious, too subject to arbitrariness in standards, or too lacking in reliability and/or validity to become generally accepted. The measurement procedure developed here gives promise of being amenable to refinement to the point where it will meet the requirements of relative simplicity and feasibility, fixed standards of evaluation, and high validity and reliability.

A detailed report of this experiment will be submitted for publication elsewhere; the present brief report will serve to indicate the general nature of the measurement procedure and some of the chief results.

THE MEASUREMENT PROCEDURE

It was reasoned that the two major characteristics of a translation are (a) its intelligibility, and (b) its fidelity to the sense of the original text. Conceptually, these characteristics are independent; that is, a translation could be highly intelligible and yet lacking in fidelity or accuracy. Conversely, a translation could be highly accurate and yet lacking in intelligibility; this would be likely to occur, however, only in cases where the original had low intelligibility.

Essentially, the method for evaluating translations employed in this experiment involved obtaining subjective ratings for these two characteristics– intelligibility and fidelity–of sentences selected randomly from a translation and interspersed in random order among other sentences from the same translation and also among sentences selected at random from other translations of varying quality. When a translation sentence was being rated for intelligibility, it was rated without reference to the original. “Fidelity” was measured indirectly: the rater was asked to gather whatever meaning he could from the translation sentence and then evaluate the original sentence for its “informativeness” in relation to what he had understood from the translation sentence. Thus, a rating of the original sentence as “highly informative” relative to the translation sentence would imply that the latter was lacking in fidelity.

All ratings were made by persons who were specially selected and trained for this purpose. There were two sets of raters. The first set of raters (called here “monolinguals” for convenience) consisted of 18 native speakers of English who had no knowledge of the language of the original (Russian, in this case). They were all Harvard undergraduates with high tested verbal intelligence and with good backgrounds in science. In rating “informativeness” these raters were provided with carefully prepared English translations of the original sentences, so that in effect they were comparing two sentences in English–one the sentence from the translation being evaluated, and the other the carefully prepared translation of the original.

The second set of raters (“bilinguals”) consisted of 18 native speakers of English who had a high degree of competence in the comprehension of scientific Russian. Their ratings of the intelligibility of the translation sentences may well have been influenced by their knowledge of the vocabulary and syntax of Russian; at any rate, no attempt was made to prevent them from using such knowledge. To rate “informativeness,” they made a direct comparison between the translation sentence (in English) and the original version.

All ratings were made on nine-point scales that had been established by the writer prior to the experiment by an adaptation of a psychometric technique known as the method of equal-appearing intervals. Thus, points on these scales could be assumed to be equally spaced in terms of subjectively observed differences. In the case of the intelligibility scale, each of the nine points on the scale had a verbal description (see Table 4). The same was true of the “informativeness” scale except that verbal descriptions were omitted for a few of the points (see Table 5). In this way each degree on the scales could be characterized in a meaningful way. For example, point 9 on the intelligibility scale was described as follows: “Perfectly clear and intelligible. Reads like ordinary text; has no stylistic infelicities.” Point 5 (the midpoint of the scale): “The general idea is intelligible only after considerable study, but after this study one is fairly confident that he understands. Poor word choice, grotesque syntactic arrangement, untranslated words, and similar phenomena are present, but constitute mainly ‘noise' through which the main idea is still perceptible.

TABLE 4. Scale of Intelligibility

9–Perfectly clear and intelligible. Reads like ordinary text; has no stylistic infelicities. 8–Perfectly or almost clear and intelligible, but contains minor grammatical or stylistic infelicities, and/or midly unusual word usage that could, nevertheless, be easily “corrected.”

7–Generally clear and intelligible, but style and word choice and/or syntactical arrangement are somewhat poorer than in category 8.

6–The general idea is almost immediately intelligible, but full comprehension is distinctly interfered with by poor style, poor word choice, alternative expressions, untranslated words, and incorrect grammatical arrangements. Postediting could leave this in nearly acceptable form.

5–The general idea is intelligible only after considerable study, but after this study one is fairly confident that he understands. Poor word choice, grotesque syntactic arrangement, untranslated words, and similar phenomena are present, but constitute mainly “noise” through which the main idea is still perceptible.

4–Masquerades as an intelligible sentence, but actually it is more unintelligible than intelligible. Nevertheless, the idea can still be vaguely apprehended. Word choice, syntactic arrangement, and/or alternative expressions are generally bizarre, and there may be critical words untranslated.

3–Generally unintelligible; it tends to read like nonsense but, with a considerable amount of reflection and study, one can at least hypothesize the idea intended by the sentence.

2–Almost hopelessly unintelligible even after reflection and study. Nevertheless, it does not seem completely nonsensical.

1–Hopelessly unintelligible. It appears that no amount of study and reflection would reveal the thought of the sentence.

PREPARATION OF TEST MATERIALS AND COLLECTION OF DATA

The measurement procedure was tested by applying it to six varied English translations–three human and three mechanical–TABLE 5. Scale of Informativeness

(This pertains to how informative the original version is perceived to be after the translation has been seen mad studied. If the translation already conveys a great deal of information, it may be that the original can be said to be low in informativeness relative to the translation being evaluated. But if the translation conveys only a certain amount of information, it may be that the original conveys a great deal more, in which case the original is high in informativeness relative to the translation being evaluated.)

9–Extremely informative. Makes “all the difference in the world” in comprehending the meaning intended. (A rating of 9 should always be assigned when the original completely changes or reverses the meaning conveyed by the translation.)

8–Very informative. Contributes a great deal to the clarification of the meaning intended. By correcting sentence structure, words, and phrases, it makes a great change in the reader's impression of the meaning intended, although not so much as to change or reverse the meaning completely.

7–(Between 6 and 8.)

6–Clearly informative. Adds considerable information about the sentence structure and individual words, putting the reader “on the right track” as to the meaning intended.

5–(Between 4 and 6.)

4–In contrast to 3, adds a certain amount of information about the sentence structure and syntactical relationships; it may also correct minor misapprehensions about the general meaning of the sentence or the meaning of individual words.

3–By correcting one or two possibly critical meanings, chiefly on the word level, it gives a slightly different “twist” to the meaning conveyed by the translation. It adds no new information about sentence structure, however.

2–No really new meaning is added by the original, either at the word level or the grammatical level, but the reader is somewhat more confident that he apprehends the meaning intended.

1–Not informative at all; no new meaning is added, nor is the reader's confidence in his understanding increased or enhanced.

0–The original contains, if anything, less information than the translation. The translator has added certain meanings, apparently to make the passage more understandable.

of a Russian work entitled Mashina i Mysl' (Machine and Thought), by Z. Rovenskii, A. Uemov, and E. Uemova (Moscow, 1960). These translations were of five passages varying considerably in type of content. (All the passages selected for this experiment, with the original Russian versions, have now been published by the Office of Technical Services, U.S. Department of Commerce, Technical Translation TT 65-60307.) The materials associated with one of these passages were used for pilot studies and rater practice sessions; the experiment proper used the remaining four passages.

In preparing materials for the rating task, 36 sentences were selected at random from each of the four passages under study. Since six different translations were being evaluated, six different sets of materials were prepared (in two forms, one for the monolinguals and one for the bilinguals) in such a way that each set contained a different translation of a given sentence. In this way no rater evaluated more than one translation of a given sentence. Each set of materials was given to three monolinguals and to three bilinguals; thus, there were 18 monolinguals and 18 bilinguals. Each rater had 144 sentences to evaluate first for intelligibility and then for the informativeness of the original (or the standard translation of it) after the translation had been seen. The raters required three 90-min sessions to complete this task, dealing with 48 sentences in each session. The raters were not informed as to the source of the translations they were rating, although they were told that some had been made by machine.

Before undertaking this task, the raters attended a 1-hr session in which they were given instruction in the rating procedures and required to work through a 30-sentence practice set.

During the rendering of ratings for intelligibility, the raters held stopwatches on themselves to record the number of seconds it took them to read and rate each sentence.

RESULTS

The results of the experiment can be considered under two headings: (a) the average scores of the various translations, and (b) the variation in the scores as a function of differences in sentences, passages, and raters.

Table 6 gives the over-all mean ratings and time scores for the six translations, arranged in order of general excellence according to our data.

Consider first the mean ratings for intelligibility by the monolinguals. Translation 1, a published human translation that had presumably been carefully done, received the highest mean rating, 8.30, on the scale established in Table 4. But 8.30 is still appreciably different from the maximum possible mean rating of 9.00, and it is evident that not even this “careful” human translation was as good as one might have expected. Furthermore, the mean rating of Translation 1 is not significantly different from that of Translation 4 (8.21), a “quick” human translation made by rapid dictation procedures. The mean ratings of Translations 1 and 4 do, however, differ significantly from the mean rating (7.36) of Translation 2, another “quick” human translation. It may be concluded that the measurement procedure studied here is sensitive enough to differentiate among human translations.

A similar remark may be made about the sensitivity of this procedure to differences in the intelligibility of machine translations. Translations 7 and 5 were shown to be significantly more intelligible, on the average, than Translation 9.

Of most current interest, however, are the results having to do with the comparison of the human and the machine translations. Machine translations 7, 5, and 9 received mean ratings, respectively, of 5.72, 5.50, and 4.73. A scale value of 5 refers to a translation in which “the general idea is intelligible only after considerable study, but after this study one is fairly confident that he understands . . .” All these machine translations are significantly less intelligible, on the average, than any of the three human translations. As machine translations improve, it should be possible to scale them by the present rating procedure to determine how nearly they approach human translations in intelligibility.

The monolinguals' mean ratings on “informativeness” (reflecting the lack of fidelity of the translations) show an almost perfect inverse relationship to the mean ratings on intelligibility, and they differentiate the various translations in the same way and to the same extent. This result means that in practice, when ratings are averaged over sentences, passages, and raters, “intelligibility” and “fidelity” are very highly correlated. The detailed results of this study show that only in the case of a few particular sentences do the mean ratings of intelligibility and informativeness convey different information.

Furthermore, the mean reading times per sentence show almost precisely the same pattern of results as the ratings. In fact, the mean reading times are linearly related to the mean ratings, a result that supports the conclusion that the points on the rating scales are evenly spaced.

The results from the ratings by bilinguals contribute nothing more to the differentiation of the translations than is obtainable with the monolinguals' ratings. Bilinguals' intelligibility ratings of the translations are slightly (and significantly) higher, on the average, than those of the monolinguals, and correspondingly, their informativeness ratings are slightly lower. Yet, they took significantly longer to read and rate the sentences. Apparently their knowledge of Russian caused them to work harder on trying to understand the translations. One is inclined to give more credence to the results from the monolinguals because monolinguals are more representative of potential users of translations and are not influenced by knowledge of the source language. It is also to be noted that the data from the monolinguals differentiate the translations to a somewhat greater extent than do the data from the bilinguals.

The results concerning the differences in ratings due to differences in sentences, passages, and raters can now be considered. (The detailed tables of these results are omitted here to save space.) The more important results may be summarized as follows:

1.The results do not differ significantly from passage to passage; that is, on the average the various passages from a given translation receive highly similar ratings. For intelligibility ratings, however, there is a small but significant interaction between translation and passage, indicating that translations are to some extent differentially effective for different types of content. (This interaction effect is present both for human and for machine translations.)

2.There is a marked variation among the sentences. In fact, as may be seen from Figure 1, there is some overlap between sentences from human translations and from mechanical translations; or, in other words, there are some sentences translated by machine that have higher ratings than some other sentences translated by human translators, even though, on the average, the humantranslated sentences are better than the machine- translated ones. These results imply that in order to obtain reliable mean ratings for translations, a fairly large sample of sentences must be rated.

3.Variation among raters is relatively small, but it is large enough to suggest that ratings should always be obtained from several raters–say at least three or four.

CONCLUSION

This experiment has established the fact that highly reliable assessments can be made of the quality of human and machine translations. In the case of the six particular translations investigated in the study, all the human translations were clearly superior to the machine translations; further, some human translations were significantly superior to other human translations, and some machine translations were significantly superior to other machine translations. On the whole, the machine translations were found to fall about at the midpoint of a scale ranging from the best possible to the poorest possible translation.

What is still needed, however, is a system whereby any translation can be easily and reliably assessed. The present experiment has determined the necessary parameters of such a system.

FIGURE 1. Frequency distribution of monolinguals' mean intelligibility ratings of the 144 sentences in each of six translations. Translations 1, 4, and 2 are human translations; Translations 7, 5, and 9 are machine translations.

【置顶：立委科学网博客NLP博文一览（定期更新版）】

转载本文请联系原作者获取授权，同时请注明本文来自李维科学网博客。
链接地址：https://blog.sciencenet.cn/blog-362400-730535.html

上一篇：ALPAC 黑皮书 6/9：附录1-8
下一篇：ALPAC 黑皮书 8/9：附录 11-15

收藏 IP: 192.168.0.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

博文发布时间已经超过87600小时，评论已关闭。

李维

扫一扫，分享此博文

《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵，插队修地球，1991年去国离乡，不知行止。

博文

ALPAC 黑皮书 7/9：附录9-10

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

李维

全部作者的精选博文

全部作者的其他最新博文

全部精选博文导读

相关博文

《镜子大全》《朝华午拾》分享 http://blog.sciencenet.cn/u/liwei999 曾任红小兵，插队修地球，1991年去国离乡，不知行止。

博文

ALPAC 黑皮书 7/9： 附录9-10

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

李维

全部作者的精选博文

全部作者的其他最新博文

全部精选博文导读

相关博文

ALPAC 黑皮书 7/9：附录9-10

该博文允许注册用户评论请点击登录评论 (0 个评论)