---------------------------------------------------------------------- AMERICAN NATIONAL CORPUS SECOND RELEASE 09/07/05 Keith Suderman Nancy Ide CONTACT: anc@cs.vassar.edu ---------------------------------------------------------------------- Full documentation for the ANC Second Release is available at : http://AmericanNationalCorpus.org/SecondRelease ---------------------------------------------------------------------- PLEASE NOTE: Periodic updates to the annotations, headers, and tools in the ANC Second Release will be made freely available for download from the ANC website (http://AmericanNationalCorpus.org/). Also, additional annotations and derived data (e.g. frequency lists) will be downloadable from the site as they become available. Because all annotations and headers are linked to the original ANC data, downloaded updates and additional annotations can be immediately used with the Second Release data in this distribution. To receive notices of newly-available and updated materials, please join the ANC discussion list as described at http://AmericanNationalCorpus.org/contact.html#discuss ---------------------------------------------------------------------- Table of Contents 1. Overall file structure 2. Structure and contents for data and annotation files 3. Installing the ANC Second Release 4. Using the ANC Second Release 4.1. The ANC "merge" tool 4.2. Using GATE and the ANC GATE plug-ins ---------------------------------------------------------------------- 1. Overall File Structure ------------------------- The ANC second release as distributed through the LDC contains a single folder called "data". The data folder contains the complete ANC Second Release in two sub-directories: written : texts that were originally generated in written form. The data is included in sub-directories by genre as follows: fiction journal leisure letters newspaper non-fiction technical travel_guides spoken : transcriptions of spoken data. The spoken directory contains three sub-directories: academic_discourse face-to-face telephone Sub-directories under each written and spoken genre directory identify the contained data according to its source. Additional sub-directories may exist below this level if more than one distinct body of data is associated with the source. Contents of the data and annotations are described below in (2), Structure and contents for data and annotation files. 2. Structure and Contents for Data and Annotation Files ------------------------------------------------------- At the lowest level of the directory hierarchy, individual texts and their annotations are included in several files. All files except for the primary data (.txt file) are encoded in UTF-8. Annotation files provide linguistic information that is linked to the segments of the primary data to which it applies. File naming conventions are as follows: [filename].txt : the UTF-16 encoded data, as plain text with no internal markup [filename].anc : the header for the text, including information concerning provenance, domain, and sub-domain [filename]-logical.xml : logical markup for the document, including structural information down to the level of paragraph, as well as footnotes, titles, etc. [filename]-s.xml : sentence boundary markup for the document [filename]-biber.xml : automatically-generated tokenization plus lemma and part-of-speech annotation using Doug Biber's POS tagset [filename]-hepple.xml : automatically-generated tokenization plus lemma and part-of-speech annotation using the Penn POS tagset [filename]-np.xml : automatically-generated annotation identifying noun phrases, produced by the NP Chunker in GATE [filename]-vp.xml : automatically-generated annotation identifying verb phrases, produced by the VP Chunker in GATE A full description of each file type and its contents is available at http://AmericanNationalCorpus.org/SecondRelease/. Please note that some of the ANC Second Release data does not include part-of-speech annotation with Doug Biber's POS tagset. The files containing this annotation will be distributed via the ANC website as soon as they are available. 3. Installing the ANC Second Release ------------------------------------ All or part of the ANC Second Release data is installed by simply copying the desired files to your hard disk. 4. Using the ANC Second Release ------------------------------- The ANC is represented using "stand-off markup", whereby annotations are contained in separate documents linked to the original data, rather than being included in-line in the same file as the data itself. The ANC Project has adopted the most extreme approach to the use of stand-off markup, by regarding the primary data as "read-only" and including no markup or annotations of any kind in the primary data file. Annotations are contained in separate XML files that link specific annotations to the apropriate portions of the primary data. If annotation information is not required (e.g., for concordancing), the plain text data files can be used "as is". Many existing tools do not handle stand-off markup. The ANC Project therefore provides two means to create a "merged" version of the ANC and its annotations: 4.1. The simple way: The ANC "merge" tool (ANCTool) The ANC merge tool is a freely-downloadable, platform-independent Java application with a simple user interface that will produce a version of the ANC data merged with annotations of the user's choosing. Full instructions for downloading and using the ANCTool can be found at http://AmericanNationalCorpus.org/tools/index.html 4.2. The GATE and ANC GATE Plug-ins The GATE system[1] can merge stand-off annotations with primary data and provides means to modify or add annotations to the data. The ANC provides several plug-ins to the GATE system that enable loading and merging the ANC data and annotations, and generating the merged data in a variety of formats. While GATE use requires a measure of computational expertise, once the ANC data is loaded into GATE the user has access to the broad range of functions provided by GATE for sophisticated annotation and analysis of corpus data. Please see http://AmericannationalCorpus.org/GATE for more information about using the ANC in the GATE system. Please CONTACT anc@cs.vassar.edu with questions or problems. [1] General Architecture for Text Engineering: http://gate.ac.uk -------------------------------------------------------------------------- Copyright (c) 2005. American National Corpus Project. All rights reserved.