ACL Anthology Reference Corpus (ACL ARC) README v20090801 - ABOUT The ACL ARC is a corpus of Computational Linguistics papers. It covers most* of all of the papers that appear in the ACL Anthology website up to February 2007. This is the second release of the ACL Corpus, which extends the initial version (20080325) with additional sources of data for the files in the original ACL ARC. No new articles have been added to this version (10,921 articles are represented, as in the initial version). - ANTHOLOGY STRUCTURE The directory structure and naming convention in the ACL Anthology is standardized and followed here. An Anthology identifier (ID henceforth) conforms to the regular expression: [A-Z] [0-9]{2} \- [0-9]{4} (spaces added for clarity) 1 letter Series 2-digit Year Offset An ID fully specifies the location of resources related to the paper. - ACL ARC STRUCTURE This version of the ACL ARC contains the Portable Document Format (PDF) of each file in the Feb 2007 snapshot and a corresponding plain text version of that file. The ACL ARC also contains a more current, Apr 2008 copy of the metadata used in the ACL Anthology as XML files. Directory structure: pdf/ - PDFs of all files in the snapshot /anthology-PDF - original PDFs from the Anthology txt/ - plain text directory /pdfbox-0.72 - pdfbox converted versions of the pdf files in UTF-8 encoding where possible /omnipage/ - text output derived from running Omnipage version 15 or 16 on the articles in the ACL ARC. /xml - text encloded in Omnipage's XML format, which also encodes page position, font size and style. /normal - text in reading order, saved as UTF-8. /formatted - text saved in UTF-16 in an approximation of the formatted page, with layout preserved. Two column text in this format will appear as two columned text. img/ - image files that represent each printed page of an article in the ACL ARC /omnipage/ - images derived from the Omnipage 15 or 16 OCR software versions /png - images from the Omnipage software saved in .png format. metadata/ - metadata that describes this paper /anthology-XML - XML files that hold the metadata. Usually collected from authors during the submission process and used in the ACL Anthology to construct the HTML browsing structure /parscit-090316 - interlink/ - metadata placeholder directory for citation network data intralink/ - metadata placeholder directory for citation to reference string data (currently empty) software/ /parscit/ - ACL ARC STATISTICS There are currently: 10,921 PDF files in the pdf/anthology-PDF tree. 10,921 txt files in the txt/pdfbox-0.72 tree. These correspond exactly to the PDFs. 13,551 files with metadata described in the metadata/anthology-XML tree. The 10,921 files above are a proper subset of this. The additional files correspond to additional PDFs that have yet to be added to the ACL ARC since the February 2007 snapshot and front matter and author indices that are not proper scholarly articles. Specific directories that have not been added yet include C98, D07, E97, N07, P07, W07 and any newer files from 08 and 09. 84,542 pages in the PDF files. pageList.txt gives the number of pages per file, per venue year, and per venue - EXCEPTIONS With any large corpus comes legacy problems and the Anthology is no exception. The ACL ARC inherits some of these problems. - J79-xxxx: a few of these files are actually entire journal issues, with multiple articles. The metadata for these papers includes all authors and titles as the title field. - Relation to the ACL Anthology Network: The work in the AAN uses a slightly different version of IDs, mostly corresponding to using the D series for EMNLP/WVLC (SIGDAT) conferences. This is not used by the ACL Anthology, so we have retained the Anthology format. - A few files (e.g., W01-1018.pdf) were only in HTML format in the original Anthology. They have been converted to PDF format for standardization. - A few files (e.g., W01-1310.pdf, which was an invited talk) do not have data other than the metadata available in the Anthology XML files. We have created a PDF and txt files to correspond to just this metatada - Some documents which converted poorly have only the metadata and reference string data present (entered manually). - PROCESSING EXCEPTIONS - Processing exceptions in incorporating the Anthology Author Network (AAN) are listed in doc/annIncorporationErrors.txt - Processing exceptions in extracting text (formatted and in the normal reading order), images in PNG and XML by Omnipage are documented in doc/omnipageExtractionErrors.txt