Hong Kong Laws Parallel Text
This FTP publication contains the Hong Kong Laws Parallel Text, produced by the
Linguistic Data Consortium (LDC), catalog number LDC2000T47, isbn
1-58563-170-1. The Hong Kong Laws Parallel Text was obtained during January
1999 from http://www.justice.gov.hk, the bilingual website of the Department of
Justice of the Hong Kong Special Administrative Region (HKSAR) of the People's
Republic of China. The retrieved files have been processed and sentence
aligned.
We wish to thank the Hong Kong Special Administrative Region of the People's
Republic of China for granting the LDC permission to distribute this data to
the research community.
STRUCTURE OF THE DATA:
This corpus is organized in nineteen parallel file pairs for a total of
thirty-eight files. Each parallel file pair is named hklaws.nn.[ec] where nn =
sequence number and the file extensions, c = Cantonese and e = English. Each
files holds up to 2,000 sequentially numbered sentences tagged with a sentence
index and sequence number as described below for a total of 37,807 sentence
indices across all nineteen file pairs. The sentence numbering spans the file
pairs such that the initial sentence index (in files hklaws.01.e and
hklaws.01.c) is "1", and the last sentence index (in files hklaws.19.e and
hklaws.19.c) is "37807". The sentence numbering establishes the sentence
parallelism; two sentences having the same index and sequence number are
purported to be parallel in content.
Each sentence index may contain one or more sequentially numbered sentences,
with corresponding files in English and Cantonese containing the corresponding
sets of sentences. The initial sequence number of each sentence is "1". The
sentence sequence number plus the sentence index number is sufficient to
uniquely identify parallel sentences. There are 313,659 sentences in the
corpus.
Each sentence is of the form:
...
...
...
...
where "#" represents a one to five digit sentence index or sequence number.
Automatic sentence alignment was done at the LDC. Additional information is
available at the LDC web site, http://www.ldc.upenn.edu/Catalog |by
year|2000|LDC2000T47.
The example.c and example.e files contains sample corresponding Cantonese and
English Law files from the corpus.
The Cantonese files are encoded in BIG5 with user-defined characters by
HKSAR. See http://www.info.gov.hk/gccs/ for details.
COPYING AND DISTRIBUTION
Permission has been granted to the Linguistic Data Consortium to make and
distribute copies of the laws, press releases and news of Hong Kong Special
Administrative Region provided this copyright notice and permission notice are
distributed with all copies.
USAGE
Permission has been given to reproduce the laws, press releases, and/or news
articles from the Hong Kong Special Administrative Region Government website
for research and educational purposes.
This permission is granted for the mentioned purposes only and prior permission
must be granted by "The Government of the Hong Kong Special Administrative
Region" if the materials are to be used for any other purposes.
The files, extracts from the files, and translations of the files must not be
sold as part of any commercial software package, nor can they be incorporated
in any printed document without the specific permission of the copyright
holders.
COPYRIGHT
Portions Copyright (C) 1999, The Government of the Hong Kong Special
Administrative Region (HKSAR)