This package contains 2 files of parallel bilingual Chinese-English text data. 1. train-500k-e.txt 2. train-500k-z.txt Please note the followings: (1) The files has the following format: DOCID-SENTID text i.e. a combined ID field which is described in note (2) below, followed by a tab character, and at last the sentence contents. train-500k-e.txt contains English sentences. train-500k-z.txt contains Chinese sentences. (2) The DOCID is the patent publication ID, whereas SENTID is the sentence ID which is a unique identifier given to that sentence. These two identifies are concatenated by a hyphen "-". (3) The texts are in UTF-8 encoding. (4) The sentences in both English and Chinese files are sorted in the same order such that the corresponding translated sentence is on the same position in the other file. The corresponding sentences have the same SENTID too. (5) The patent texts are segmented and aligned with automatic means, and selected according to a number of filtering parameters including word alignment, sentence length, language modelling, etc.