Chinese-English Parallel Sentences Extracted from Patents
| Item Name: | Chinese-English Parallel Sentences Extracted from Patents |
| Author(s): | Benjamin Tsou, Bin Lu, Kapo Chow |
| LDC Catalog No.: | LDC2016T22 |
| ISBN: | 1-58563-770-X |
| ISLRN: | 280-113-850-942-8 |
| DOI: | https://doi.org/10.35111/td6z-pv16 |
| Release Date: | October 19, 2016 |
| Member Year(s): | 2016 |
| DCMI Type(s): | Text |
| Data Source(s): | government documents |
| Application(s): | machine translation |
| Language(s): | English, Chinese |
| Language ID(s): | eng, zho |
| License(s): |
Chinese-English Parallel Sentences Extracted from Patents Agreement (For-profit) LDC User Agreement for Non-Members |
| Online Documentation: | LDC2016T22 Documents |
| Licensing Instructions: | Subscription & Standard Members, and Non-Members |
| Citation: | Tsou, Benjamin, Bin Lu, and Kapo Chow. Chinese-English Parallel Sentences Extracted from Patents LDC2016T22. Web Download. Philadelphia: Linguistic Data Consortium, 2016. |
Introduction
Chinese-English Parallel Sentences Extracted from Patents was developed by Chilin (HK) Limited and contains 500,000 sentence pairs of Chinese-English parallel text. This resource is based on the training corpus and test sets developed for the Tokyo-based NTCIR 2009 & 2010 tasks on Patent Machine Translation.
Data
The sentences in this release were selected from a larger corpus of than 300,000 Chinese-English parallel patents in different fields according to a number of filtering parameters including word alignment, sentence length and language modeling. They were then automatically segmented and aligned. All text is encoded as UTF-8.
Samples
Please view this Chinese sample and English sample.
Updates
None at this time.
Pricing
Not-for-profit organizations may license this data set for US$25.00 under the LDC Not-for-Profit Membership Agreement or under the LDC User Agreement for Non-Members for use in linguistic research, education and non-commercial technology development. For-profit organizations may license this data for US$5000, discounted to US$4000 for LDC for-profit members, under the Commercial License Agreement for Chinese-English Parallel Sentences Extracted from Patents (LDC2016T22).
Current fees in this catalog entry reflect those pertaining to a for-profit organization license. Not-for-profit organizations should contact LDC's Membership Office to license this data set.