Czech Academic Corpus 2.0


Item Name: Czech Academic Corpus 2.0
Authors: Barbora Vidová Hladká , Jan Hajič , Jiří Hana , Jaroslava Hlaváčová , Jiří Mírovský, Jan Raab
LDC Catalog No.: LDC2008T22
ISBN: 1-58563-491-3
Release Date: Oct 17, 2008
Data Type: text
Data Source(s): broadcast news, news magazine, newswire
Application(s): cross-lingual information retrieval, information extraction, information retrieval, linguistic analysis, machine learning, machine translation, metadata extraction, natural language processing, topic detection and tracking
Language(s): Czech
Language ID(s): ces
Distribution: 1 CD
Member fee: $0 for 2008 members
Non-member Fee: US $300.00
Reduced-License Fee: US $150.00
Extra-Copy Fee: US $150.00
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Barbora Vidová Hladká , et al.
2008
Czech Academic Corpus 2.0
Linguistic Data Consortium, Philadelphia

Introduction

The Prague family of annotated corpora has a new member, the Czech Academic Corpus version 2.0 (CAC 2.0). CAC 2.0 consists of 650,000 words from various 1970s and 1980s newspapers, magazines and radio and television broadcast transcripts manually annotated for morphology and syntax.

The CAC 2.0 offers:

  • For linguists: language material reflecting the real usage of the language.
  • For computational linguists: tools and a considerable amount of data for natural language applications that are not feasible without morphological and syntactical text processing.
  • For TrEd annotation tool users: the possibility to use voice control for the tool.
  • For teachers and their students: an interesting didactic tool for practising Czech language morphology and syntax.

The CAC was created by a team from the Institute of the Czech Language, the Academy of Sciences of the Czech Republic, led by Marie T?itelová, during the period from 1971 to 1985. The original purpose of the corpus was to build a frequency dictionary of the Czech language. Researchers were aware, however, that in order to make the CAC useful for future users, whether linguists or natural language processing systems developers, it was necessary to design annotation schemes and to develop tools that would add as much linguistic information as possible to the data. In 1996, the Prague Dependency Treebank (PDT), which provided morphological and syntactic analytic layers of annotation to certain Czech media data, was launched independently of the CAC. During the work on the PDT's second version, its researchers decided to transfer PDT's internal format and annotation scheme to the CAC with the goals of making the CAC and the PDT fully compatible and of integrating the CAC into the PDT. To that end, the CAC was manually annotated for morphology and syntax. CAC 2.0 adds the surface syntax annotation; in the terminology of the PDT, this annotation is called an analytical layer.

The following PDT resources are available from LDC: Prague Dependency Treebank 1.0, LDC2001T10, Prague Dependency Treebank 2.0, LDC2006T01, Prague Arabic Dependency Treebank 1.0, LDC2004T23 and Prague Czech-English Dependency Treebank 1.0.

Annotation Description and Examples

A morphological layer of annotation provides the word tokens with further data (annotation), which characterizes the morphological properties of the word tokens (as apparent in the lemma which is the canonical form of a lexeme), the part of speech, and morphological categories (case, number, tense, person, etc.). Formally, part of speech classes combine together with values of morphological categories to represent morphological tags (or, simply, tags). In the CAC 2.0, tags are designed according to the PDT as strings of definite length (15 positions) where each position corresponds to a single category.

Example: The word form Prahu (a form of "Prague") is analysed as an affirmative (11th position) noun (1st and 2nd position), feminine (3rd position), singular (4th position), and accusative (5th position). All of the other positions are correctly filled with the symbol "-" that represents the irrelevance of the morphological category towards the part of speech. For example, one does not determine a person and tense with nouns (8th and 9th position).

Examples of lemmas and tags of particular word forms

Word token Lemma Tag Description
Prahu Praha NNFS4-----A---- Noun, feminine, singular, accusative, affirmative
123 123 C=------------- Digit token
) ) Z:------------- Punctuation mark (right parenthesis)

An a-layer annotation assigns each word unit the corresponding data characterising the syntactical features of the unit and therefore its relation to the other sentence elements along with its sentence function. Formally, the sentence relations are represented by a dependency tree.

Example: Syntactical annotation of the sentence Obecná odpov?? na tuto otázku je sotva mo-ná.(Lit.: A general response to this question is hardly possible.) Each word unit (word, number, punctuation mark) is represented by a single node in the resulting tree. Note that due to technical reasons each tree is rooted by one extra node - the tree in our example therefore consists of 9 nodes. The annotation approach builds on the tradition of the Prague linguistic school, where the predicate (usually verb) is understood to be the centre of the sentence. Therefore the predicate is placed as a direct daughter of the root. The final punctuation is also placed as a daughter of the root node. Two constituents of the sentence are dependent on the predicate - odpov?? (answer) and mo-ná (possible). Please note that each node in the tree is annotated with the word form, lemma, morphological tag and analytic function. Looking at the node representing the word odpov?? (answer), we can see its form is a feminine noun in nominative singular and that this unit stands in the role of subject of the sentence, which is expressed by the analytic function Subj.

Example of an a-layer annotation

The conception of the main internal format of the CAC 2.0 treats the annotation layers separately where each layer of annotation in the document corresponds to one file. (In the case of the CSTS format, all layers of annotation are contained in one file.) This relationship in the CAC 2.0 means that there are three instances (files) for every document, one for the w-layer, one for the m-layer and a third one for the a-layer. However, the distinction between layers does not restrict interconnection between groups for particular layers of annotation. In fact, the opposite is true as will be demonstrated later in this section.

The word layer does not reflect the segmentation of the text into sentences; this segmentation occurs on the m-layer. This means that unlike the w-layer, the m-layer contains final punctuation. Additionally, the number of word tokens in both layers may differ. The differences originate from the concatenation of the incorrectly split word into one word, or reversely, from the division of incorrectly connected words into more units. The correctly written text should be contained in the m-layer.

Example: The three following figures illustrate the w-layer and m-layer interconnection. Also the interconnection of the files in the sense of the number of word units is captured and denoted by arrows. All three examples were chosen from the CAC 2.0 deliberately so that the user can directly view the instances; the name of the document and number of the sentence is provided for every sentence. Figure 2.2 serves to illustrate the 1:1 ratio of the layers. The layers do not differ except for the final punctuation. Technical interconnection of the w-layer and m-layer: The insertion of a word token exemplifies the situation where a word token is inserted into the text - the year information was clearly missing. Since it is almost impossible for the corrector to add the missing year, the symbol "#" is used as this symbol has no counterpart on the w-layer. In contrast, Figure 2.4 illustrates the situation where more m-layer units corresponds to the same w-layer unit - the word unit pedagogicko-psychologické (E: psychological-pedagogical) has been divided into three separate units.

Technical interconnection of the w-layer and m-layer: No changes other than the final-sentence punctuation

Figure 2.3. Technical interconnection of the w-layer and m-layer: The insertion of a word tokend

Figure 2.4. Technical interconnection of the w-layer and m-layer: The division of a word token

The interconnection between the a-layer and m-layer means that each m-layer word unit corresponds exactly to one node of the dependency tree on the a-layer, and vice versa. The only exception is the technical root, which has no counterpart on the m-layer.

Corpus Tools

CAC 2.0 contains the following tools:

  • Bonito: a corpus manager that searches CAC 2.0 texts.
  • LAW: a morphological annotations editor.
  • TrEd: a syntactical annotations editor.
  • Negraph: a corpus viewer.
  • tool_chain: automatically processes Czech texts.

Content Copyright

Portions © 2004-2008 Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, © 1971-1985 The Institute of the Czech Language, Academy of Sciences of the Czech Republic, © 2008 Trustees of the University of Pennsylvania