Japanese Business News Text Supplement

Item Name: Japanese Business News Text Supplement
Authors: Masato Kobayashi and Kevin Walker
LDC Catalog No.: LDC99T34
ISBN: 1-58563-143-4
Data Type: text
Data Source(s): newswire
Project(s): GALE, TIDES
Application(s): information retrieval, language modeling
Language(s): Japanese
Language ID(s): jpn
Distribution: 1 CD
Member fee: $0 for 1999 members
Non-member Fee: N/A (Members Only)
Reduced-License Fee: N/A
Extra-Copy Fee: US $150.00
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Masato Kobayashi and Kevin Walker
Japanese Business News Text Supplement
Linguistic Data Consortium, Philadelphia

This corpus consists of newswire text from Nihon Keizai Shimbun, Inc. (NIKKEI), the largest Japanese daily financial newspaper, and Telerate, Inc. (formerly known as Dow Jones/Kyodo News Service), published primarily for managers of Japanese-owned corporations or Japanese employees working in North American financial institutions.

The Telerate portion constitutes all newswire text collected by the LDC between December 1994 and September 1998. The Telerate data collected from June 1995 to September 1998 serves as a supplement to the original publication.

All NIKKEI data was collected from December 1993 to November 1994 and is also available on the 1995 release of the Japanese Business News Text.

The data, including SGML tags, breaks down as follows.

# of Files Daily Average Size Total Size -------------------------------------------------- NIKKEI 364 514K 188MB Telerate 1060 336K 357MB

The NIKKEI text was received on nine-track magnetic tape. The original character encoding was EBCDIC, but was converted to EUC encoding, which the LDC uses for its Japanese publications.

The Telerate text was received via a digital transmission service installed at the LDC by Telerate. Custom software was written by the LDC to poll a central database and download articles individually. The character encoding is EUC.

LDC added SGML tags automatically in order to identify individual stories within the daily collections.


The Reduced Licensing Fee for this corpus is US$150.