1. Publication title CALLHOME Mandarin Chinese Transcripts ¨C XML Version 2. Authors Tony McEnery, Richard Xiao 3. Data type Text 4. Data source CALLHOME Mandarin Chinese Transcripts (LDC96T16) 5. Project Contrastive English and Chinese 6. Applications As this annotated XML version has retained all information encoded in the original lease, it is suitable for all applications of LDC96T16 in addition to grammatical study of spoken Mandarin. 7. Language cmn BN L Chinese, Mandarin 8. Licensing conditions Same as CALLHOME Mandarin Chinese Transcripts (LDC96T16) 9. Funding agency and grant number The UK Economic and Social Research Council, RES-000-23-0553 10. Copyright Lingusitic Data Consortium (LDC) 2008 11. Description of the corpus structure and data attributes Data type: text File formats: XML Character encoding: UTF-8 Number of unique words: 6,895 Total number of words: 300,767 Number of data files: 120 Size of data (uncompressed): 8.62 MB Contents of folders: Data: 120 data files in XML format Doc: user manual, part-of-speech tagset DTD: CALLHOME Mandarin DTD file 12. Quality control All of the 120 XML files have been checked for well-formedness and validated using XMLSpy v.2008. The DTD file is generated automatically by XMLSpy v.2008. This XML edition of CALLHOME Mandarin Chinese Trasncript corpus is recommended for use with Xaira - XML Aware Indexing and Retrieval Architecture released by Oxford University Computing Services, which is an open source package availale at http://www.oucs.ox.ac.uk/rts/xaira/. Tony McEnery Richard Xiao 05 May 2008