ACL Anthology Reference Corpus (ACL ARC) README v20080325 This is the initial release of the ACL ARC. Please read the EXCEPTIONS carefully if you are going to use this corpus for bibliographic studies. - ABOUT The ACL ARC is a corpus of Computational Linguistics papers. It covers most* of all of the papers that appear in the ACL Anthology website up to February 2007. - ANTHOLOGY STRUCTURE The directory structure and naming convention in the ACL Anthology is standardized and followed here. An Anthology identifier (ID henceforth) conforms to the regular expression: [A-Z] [0-9]{2} \- [0-9]{4} (spaces added for clarity) 1 letter Series 2-digit Year Offset e.g., W01-0501, C92-2065 The ID also fully specifies the location of the pdf file and the xml file containing the metadata that describes the paper. In the first example these values are: W/W01/W01-0501.pdf <== the PDF file for W01-0501 W/W01/W01.xml <== all of the metadata for all papers published in workshops in 2001 While there are exceptions (see EXCEPTIONS below), the Anthology generally follows some rules about these three components. Series - There are generally three types of series, journals, conferences and workshops. Currently only Computational Linguistics (and its former names) are indexed as a journal series (as "J"). Workshops are Special Interest Group (SIG) related (marked as "W"). All other series are conferences, and make up the bulk of the Anthology. Year - 65 (1965) is the first year in the in the ARC, and currently goes up to 07 (2007). Identifier - This portion depends on the series type. For the J journal series, each 1000 represents an issue of the journal. For the Conference series, each 1000 represents a different conference or a different volume in the same conference. Typically, for both conferences and journals, the series starts at 1000 and not 0000. For W workshop series, each 100 represents a workshop. In the workshop series, the series typically starts with 100 and not 000. - ACL ARC STRUCTURE This version of the ACL ARC contains the Portable Document Format (PDF) of each file in the Feb 2007 snapshot and a corresponding plain text version of that file. The ACL ARC also contains a more current, Apr 2008 copy of the metadata used in the ACL Anthology as XML files. Directory structure: pdf/ - PDFs of all files in the snapshot /anthology-PDF - original PDFs from the Anthology txt/ - plain text directory /pdfbox-0.72/ - pdfbox converted versions of the pdf files in UTF-8 encoding where possible metadata/ - metadata that describes this paper /anthology-XML - XML files that hold the metadata. Usually collected from authors during the submission process and used in the ACL Anthology to construct the HTML browsing structure interlink/ - metadata placeholder directory for citation network data (currently empty) intralink/ - metadata placeholder directory for citation to reference string data (currently empty) As above, there are five main directories that each hold a specific type of data and subdirectories for different versions of these types of data. As the ACL ARC is only in its first release most of these structures are trivial or empty, but have been structured this way to cater for future expansion. As an example, as conversion from PDF to plain text is known to be difficult we anticipate additional versions of plain text extractions, hence an extra directory level within txt. Automatic methods may generate metadata automatically in future versions of the ACL ARC, hence an extra directory level within metadata. - ACL ARC STATISTICS There are currently: 10,921 PDF files in the pdf/anthology-PDF tree. 10,921 txt files in the txt/pdfbox-0.72 tree. These correspond exactly to the PDFs. 13,551 files with metadata described in the metadata/anthology-XML tree. The 10,921 files above are a proper subset of this. The additional files correspond to additional PDFs that have yet to be added to the ACL ARC since the February 2007 snapshot and front matter and author indices that are not proper scholarly articles. Specific directories that have not been added yet include C98,D07,E97,N07,P07 and W07. - EXCEPTIONS With any large corpus comes legacy problems and the Anthology is no exception. The ACL ARC inherits some of these problems. - J79-xxxx: a few of these files are actually entire journal issues, with multiple articles. The metadata for these papers includes all authors and titles as the title field. - Relation to the ACL Anthology Network: The work in the AAN uses a slightly different version of IDs, mostly corresponding to using the D series for EMNLP/WVLC (SIGDAT) conferences. This is not used by the ACL Anthology, so we have retained the Anthology format. - A few files (e.g., W01-1018.pdf) were only in HTML format in the original Anthology. They have been converted to PDF format for standardization. - A few files (e.g., W01-1310.pdf, which was an invited talk) do not have data other than the metadata available in the Anthology XML files. We have created a PDF and txt files to correspond to just this metatada - Some documents which converted poorly have only the metadata and reference string data present (entered manually). - FUTURE An update that includes citation network information (interlink) and citation to reference string (intralink) is underway.