Login Signup
中文
Datasets Description
Introduction

With the application of deep learning in speech and natural language processing, the accuracy of speech recognition and machine translation are becoming better. We provide a dataset for Machine translation. We provide over 10 million parallel English-Chinese dataset. The data consists of conversational English extracted from English learning websites and movie subtitles, and all data have been checked by human annotators. All the parallel data is checked by human so that it is guaranteed in terms of data size, domain relevance and quality.

Training Set: 10,000,000 sentences
Validation(Simultaneous Interpretation) Set: 934 sentences
Validation(Machine Translation) Set: 8000 sentences

Data Description

An English-Chinese sentence pair includes an English sentence and a Chinese sentence, where the Chinese sentence is translated by human annotators from the English sentence. The dataset contains 2 files. The Chinese file contains Chinese sentences and English file contains corresponding English sentences, and sentences have a cross-file one-on-one matching relationship.


Data Preview

With fruit growing all year round, this is indeed a paradise for birds.
一年都有水果生长,这确实是鸟的天堂。

I dropped Henry at your office an hour ago.
一小时前我开车送海瑞去了你办公室。

Father and son, two bricklayers, are sitting in a cafe arguing about a car.
一对父子,都是泥水匠,他们坐在一家咖啡馆里为一辆汽车争吵不休。

A small, disciplined militia can not only hold out against a larger force but drive it back.
一小支训练有素的民兵不仅可以抵挡强大的军队还可以击退它。

I start to sweat when I worry about people noticing my sweat.
一担心别人发现我在出汗我就开始出汗了。

I fly to Florida a couple of times a year to visit the folks.
一年之中,我会飞去佛罗里达几次,去看望我的亲戚。

I just ended a five-month relationship an hour ago.
一小时前我刚结束了一个长达五个月的恋爱关系。

Downloads
Training Set 1G
sha1sum:
c9a53e3d5000b55b0abf1c509316705aa94d39f3
Validation Set 1.2M
sha1sum:
087d410bb73eadb16afd000d4a85fc69ebb92a8d
Test A 223K
sha1sum:
bdb6c7f19aa4aa744d3094625f7864eafe1614e2
Test B 232K
sha1sum:
1b49f9a5cd5ac0a62008d218727cae19d471ea2e
Evaluation & Baselines
For evaluation and baseline code, please click here.

For any copyright related inquiries, please contact hi@challenger.ai.