查看: 1179|回复: 5
打印 上一主题 下一主题

[日语] 日英对译语料库Tanaka Corpus

[复制链接]
  • TA的每日心情
    开心
    2018-1-27 00:16
  • 签到天数: 1 天

    [LV.1]初来乍到

    99

    主题

    1477

    回帖

    3万

    积分

    翰林院孔目

    Rank: 10Rank: 10Rank: 10

    积分
    35504

    灌水大神章小蜜蜂章笑傲江湖章翰林院专用章管理组专用章

    跳转到指定楼层
    1
    发表于 2016-3-17 15:35:17 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
    链接: http://pan.baidu.com/s/1dEuvIGL 密码: 5ii7

    语料库格式如下:

    比如说【新しい】这个词,语料库中的相应例句如下:

    ¥1¥2450
    ¥2¥新しい法案では大気汚染を60%減少させることになっている。
    ¥3¥A new law is expected to cut air pollution by 60%.
    ¥1¥2451
    ¥2¥新しい法が成立した。
    ¥3¥A new low has come into existence.
    ¥1¥2452
    ¥2¥鉄道という新しい交通手段が開発された。
    ¥3¥A new means of communication was developed -the railway.
    ¥1¥2453
    ¥2¥新型だからといって旧型より良いとは限らない。
    ¥3¥A new model isn't necessarily any better than the older one.
    ¥1¥2454
    ¥2¥新しい月もでてきました…
    ¥3¥A new moon was coming up...
    ¥1¥2455
    ¥2¥市の中心地に新しい博物館が建造されつつある。
    ¥3¥A new museum is being built at the center of the city.
    ¥1¥2456
    ¥2¥新しいオイル・タンカーが進水した。
    ¥3¥A new oil tanker was launched.
    ¥1¥2457
    ¥2¥彼の潔白を証明する新しい証拠。
    ¥3¥A new piece of evidence to prove his innocence.
    ¥1¥2458
    ¥2¥新しい校長が学校を管理運営している
    ¥3¥A new principal is administering the school.
    ¥1¥2459
    ¥2¥新しい道路が建設中である。
    ¥3¥A new road is under construction.

    当然,语料库中存在部分重复的例句。

    关于语料库的介绍,请参考如下网站:
    http://www.edrdg.org/wiki/index.php/Tanaka_Corpus

    部分摘录如下:

    Introduction

    This page provides some brief documentation for the Tanaka Corpus of parallel Japanese-English sentences, and in particular the modification and editing that has been carried out to enable use of the corpus as a source of examples in the WWWJDIC dictionary server and other systems.
    The corpus was compiled by Professor Yasuhito Tanaka at Hyogo University and his students, as described in his Pacling2001 paper. At Pacling2001 Professor Tanaka released copies of the corpus, and stated that it is in the public domain. According to Professor Christian Boitet, Professor Tanaka did not think the collection was of a very good standard. (Sadly, Prof. Tanaka died in early 2003.)
    At the 2002 Papillon workshop in Tokyo, Professor Boitet included a copy of the corpus in a CD, distributed to participants, and suggested that it may serve as examples in a dictionary. Jim Breen realised it had the potential to be a source of example sentences in the WWWJDIC server. He edited, reformatted and indexed the corpus and linked it at the word level to the dictionary function in the server. (see below)
    The inclusion of the Corpus in the WWWJDIC server exposed it to a wide audience, and a number of other systems incorporated the corpus into their operation. It also began to be used in some research projects in natural language processing.
    In 2006 the Corpus was incorporated into the Tatoeba Project being developed by Trang Ho to provide a sentence-based multi-lingual resource. That project is now the "home" of the Corpus.

    Compilation

    Professor Tanaka's students were given the task of collecting 300 sentence pairs each. After several years, 212,000 sentence pairs had been collected.
    From inspection, it appears that many of the sentence pairs have been derived from textbooks, e.g. books used by Japanese students of English. Some are lines of songs, others are from popular books and Biblical passages.
    The original collection contained large numbers of errors, both in the Japanese and English. Many of the errors were in spelling and transcription, although in a significant number of cases the Japanese and English contained grammatical, syntactic, etc. errors, or the translations did not match at all.
    The original file can still be downloaded (see below.)


    Initial Modifications to the Corpus

    As mentioned above, the Tanaka Corpus was edited and adapted to be used within the WWWJDIC dictionary server as a set of example sentences associated with words in the dictionary. In order to adapt the corpus for this role, it was edited as follows:
    an initial regularization of the punctuation of the Japanese and English sentences was carried out, then duplicate pairs were removed, reducing the original file from 210,000 pairs to 180,000 pairs;
    sentences which differed only by differences in orthography (e.g. kana/kanji usage, okurigana differences), numbers, proper names, minor grammatical points such as plain/polite verb usage, etc. were reduced to single representative examples;
    sentences where the Japanese consisted of a short Japanese statement in kana were removed;
    sentences with spelling errors, kana-kanji conversion errors, etc. were corrected;
    sentences where the English version did not match the Japanese were edited to make the two versions agree;
    where the sentences contain gender-specific language or words, the English portion has been tagged with [M] or [F] respectively;
    sentences where the Japanese was too garbled to derive a valid English equivalent were removed.
    The process described above has continued, and at present the edited corpus has just over 150,000 sentence pairs.
  • TA的每日心情
    开心
    2021-9-10 00:49
  • 签到天数: 1103 天

    [LV.10]以坛为家III

    0

    主题

    1826

    回帖

    1万

    积分

    状元

    Rank: 9Rank: 9Rank: 9

    积分
    16147

    笑傲江湖章灌水大神章

    2
    发表于 2016-3-17 18:23:51 | 只看该作者
    谢谢楼主分享

    该用户从未签到

    0

    主题

    510

    回帖

    1317

    积分

    解元

    Rank: 5Rank: 5

    积分
    1317

    灌水大神章

    3
    发表于 2016-3-19 13:39:25 | 只看该作者
    谢谢楼主分享

    该用户从未签到

    1

    主题

    182

    回帖

    -160

    积分

    禁止发言

    积分
    -160
    4
    发表于 2016-5-12 18:26:33 | 只看该作者
    谢谢你的发言` 非常有意义

    该用户从未签到

    1

    主题

    182

    回帖

    -160

    积分

    禁止发言

    积分
    -160
    5
    发表于 2016-5-12 19:00:44 | 只看该作者
    楼主辛苦了,鼓励一下

    该用户从未签到

    0

    主题

    113

    回帖

    251

    积分

    童生

    Rank: 2

    积分
    251
    6
    发表于 2021-2-27 19:30:02 | 只看该作者
    谢谢楼主分享!