Annotated Gigaword API and Command Line Tools v1.0 - July 21, 2012 ------------------------------------------------------------------ This release includes a Java API and command line tools for reading the Annotated Gigaword dataset XML files. ------------------- Project Hosting : ------------------- For the latest version, go to: http://code.google.com/p/agiga ------------------- Command Line Tools: ------------------- The command line tools provide a convenient way to print human readable versions of the XML annotations. The entry point is edu.jhu.agiga.AgigaPrinter and it has the following usage. usage: java edu.jhu.agiga.AgigaPrinter where is one of: words (Words only, one sentence per line) lemmas (Lemmas only, one sentence per line) pos (Part-of-speech tags) ner (Named entity types) basic-deps (Basic dependency parses in CONNL-X format) col-deps (Collapsed dependency parses in CONNL-X format) col-ccproc-deps (Collapsed and propagated dependency parses in CONNL-X format) phrase-structure (Phrase structure parses) coref (Coreference resolution as SGML similar to MUC) stanford-deps (toString() methods of Stanford dependency parse annotations) stanford-phrase-structure (toString() method of Stanford phrase structure parses) for-testing-only (**For use in testing this API only**) and where is an .xml.gz file from Annotated Gigaword For example, to print part-of-speech tags for the file nyt_eng_199911.xml.gz, we could run: java -cp build/agiga-1.0.jar:lib/* edu.jhu.agiga.AgigaPrinter pos annotated_gigaword/nyt_eng_199911.xml.gz ------------------- Java API : ------------------- The Java API provides streaming access to the documents in the .xml.gz files. Two iterators are provided: StreamingDocumentReader and StreamingSentenceReader. Both of these take as input the path to an Annotated Gigaword file and an AgigaPrefs object. By default, the AgigaPrefs constructor will ensure that every annotation in the XML is read in and that the resulting objects are fully populated. However, by turning off certain options, it's possible to skip the reading and creation of objects corresponding to unused annotations. StreamingDocumentReader is an iterator over AgigaDocument objects. The AgigaDocument class gives access to the coreference resolution (via AgigaCoref objects) annotations and the sentences (via AgigaSentence objects). StreamingSentenceReader is an iterator over AgigaSentence objects. This bypasses the document level annotations such as coref and the document ids and provides direct access to the sentence annotations only. AgigaPrinter provides examples of how to use these iterators and set the AgigaPrefs object so that only the necessary annotations are read. Examples of how to use the Agiga objects can also be found in the AgigaDocument.write* and AgigaSentence.write* methods. ---------------------- One- vs. Zero-Indexing: ---------------------- In the XML, the sentences and tokens are given Ids that are one-indexed. However, we find it to be more convenient to work with zero-indexed **indices** in the Java API. Accordingly, the Java API does not provide access to these original Ids but instead provides access to indices. These indices are accessed via methods named get*Idx(), such as AgigaSentence.getIdx() and AgigaMention.getSentenceIdx() -- or AgigaToken.getIdx() and AgigaDependency.getGovIdx(). These indices also correspond to the ordered elements in the Lists used throughout the API. Of course, the original Ids from the XML can be recovered by adding one to the indices in the API. However, we didn't want to confuse the issue by providing API calls for both. ------------------- Building : ------------------- A build.xml is provided for building with Apache Ant. Example commands are below and should be run from the top level directory that contains the build.xml. # To compile: ant # To clean and compile ant clean compile # To build jars of classes and sources: ant jar