ACL Anthology Reference Corpus (ACL ARC)

README v20090801

- ABOUT

The ACL ARC is a corpus of Computational Linguistics papers.  It
covers most* of all of the papers that appear in the ACL Anthology
website up to February 2007.  

This is the second release of the ACL Corpus, which extends the
initial version (20080325) with additional sources of data for the
files in the original ACL ARC.  No new articles have been added to
this version (10,921 articles are represented, as in the initial
version).

- ANTHOLOGY STRUCTURE

The directory structure and naming convention in the ACL Anthology is
standardized and followed here.  An Anthology identifier (ID
henceforth) conforms to the regular expression:

 [A-Z]            [0-9]{2}      \-  [0-9]{4}   (spaces added for clarity)
 1 letter Series  2-digit Year      Offset

An ID fully specifies the location of resources related to the paper.

- ACL ARC STRUCTURE

This version of the ACL ARC contains the Portable Document Format
(PDF) of each file in the Feb 2007 snapshot and a corresponding plain
text version of that file.  The ACL ARC also contains a more current,
Apr 2008 copy of the metadata used in the ACL Anthology as XML files.

Directory structure:
pdf/                    - PDFs of all files in the snapshot
   /anthology-PDF       - original PDFs from the Anthology
txt/                    - plain text directory
   /pdfbox-0.72         - pdfbox converted versions of the pdf files
                          in UTF-8 encoding where possible
   /omnipage/           - text output derived from running Omnipage 
                          version 15 or 16 on the articles in the ACL 
                          ARC.
            /xml        - text encloded in Omnipage's XML format, which
                          also encodes page position, font size and 
                          style.
            /normal     - text in reading order, saved as UTF-8.
            /formatted  - text saved in UTF-16 in an approximation of
                          the formatted page, with layout preserved.
                          Two column text in this format will appear as
                          two columned text.
img/                    - image files that represent each printed page
                          of an article in the ACL ARC 
   /omnipage/           - images derived from the Omnipage 15 or 16
                          OCR software versions
            /png        - images from the Omnipage software saved in 
                          .png format.
metadata/               - metadata that describes this paper
        /anthology-XML  - XML files that hold the metadata.  Usually
                          collected from authors during the submission
                          process and used in the ACL Anthology to
                          construct the HTML browsing structure 
        /parscit-090316 -
interlink/              - metadata placeholder directory for citation
                          network data
intralink/              - metadata placeholder directory for citation
                          to reference string data (currently empty)
software/
        /parscit/

- ACL ARC STATISTICS

There are currently:

  10,921 PDF files in the pdf/anthology-PDF tree.
  10,921 txt files in the txt/pdfbox-0.72 tree.  These correspond
         exactly to the PDFs.
  13,551 files with metadata described in the metadata/anthology-XML
         tree.  The 10,921 files above are a proper subset of this.
         The additional files correspond to additional PDFs that have
         yet to be added to the ACL ARC since the February 2007
         snapshot and front matter and author indices that are not 
         proper scholarly articles.  Specific directories that have
         not been added yet include C98, D07, E97, N07, P07, W07 and
 	 any newer files from 08 and 09.
  84,542 pages in the PDF files.  pageList.txt gives the number of
	 pages per file, per venue year, and per venue

- EXCEPTIONS

With any large corpus comes legacy problems and the Anthology is no
exception.  The ACL ARC inherits some of these problems.

  - J79-xxxx: a few of these files are actually entire journal
    issues, with multiple articles.  The metadata for these papers
    includes all authors and titles as the title field.
  - Relation to the ACL Anthology Network: The work in the AAN uses a
    slightly different version of IDs, mostly corresponding to using
    the D series for EMNLP/WVLC (SIGDAT) conferences.  This is not
    used by the ACL Anthology, so we have retained the Anthology
    format.
  - A few files (e.g., W01-1018.pdf) were only in HTML format in the
    original Anthology.  They have been converted to PDF format for
    standardization.
  - A few files (e.g., W01-1310.pdf, which was an invited talk) do not
    have data other than the metadata available in the Anthology XML
    files.  We have created a PDF and txt files to correspond to just
    this metatada
  - Some documents which converted poorly have only the metadata and
    reference string data present (entered manually).

- PROCESSING EXCEPTIONS
  - Processing exceptions in incorporating the Anthology Author
    Network (AAN) are listed in doc/annIncorporationErrors.txt
  - Processing exceptions in extracting text (formatted and in the
    normal reading order), images in PNG and XML by Omnipage are
    documented in doc/omnipageExtractionErrors.txt