Main Documentation File for the JURIS Corpus ============================================ 0. Introduction This file describes the contents and organization of the JURIS text corpus, with regard to the following points: - Organization of the corpus - File naming conventions - SGML formatting - categories of documents in JURIS - additional tables describing JURIS content 1. Organization of the corpus The data contained on this two-CD-ROM set represents a release of the JURIS (Justice Department Retrieval and Inquiry System) data collection that has been made available to the Linguistic Data Consortium (LDC) by the U.S. Department of Justice. Each CD-ROM contains data files in SGML format that have been compressed using "Gnu Zip" (gzip) utility which is the Free Software Foundation's compression tool. It is available for most computers and operating systems. You can get the gzip utility from MIT's gnu repository: ftp://prep.ai.mit.edu/pub/gnu/ and many other mirror sites. In the Microsoft Windows environment, WinZip utility from Nico Mak Computing Inc. can be used for uncompressing the files. You can download WinZip utility from the Web site: http://www.winzip.com The total uncompressed size of the data is 3206.6 megabytes (MB). Each CD-ROM contains a "doc" directory (containing documentation and tables) and a "juris" directory (containing the compressed text files). There are 1664 individual text files in the corpus, 1011 on the first CD-ROM, and 653 on the second. 2. File naming conventions All text files are named according to the following pattern: jNNNN_II.gz where: "j" is a constant initial character NNNN is a four-digit "file-set" number II is a two-digit sequence number within the file-set "gz" is a constant file name extension (indicating that the file has been compressed with "gzip") The organization of data into "file-sets" was drawn directly from the original form of the archive as provided to the LDC by the Department of Justice. The motivation for this partitioning of the data has not been fully explained. The original archive consisted of 219 files named by distinct four-digit numbers; these files ranged between less than 1 MB and nearly 70 MB in size. In order to make the data more accessible for research use, we chose to divide the larger files into pieces, such that the average file size was about 2 MB when uncompressed (the largest uncompressed file size is about 4.5 MB). Divisions of the files were done at document boundaries, so all files contain whole documents. In cutting the larger files into smaller pieces, we added the two-digit sequence number to the file names to preserve the order of the pieces; if the original file was not too large, we left it as a single file, and added "_00" for the two-digit sequence number in the file name. 3. SGML Formatting The text files are all formatted using a set of SGML tags to mark document boundaries, and to mark major structural features within documents. As with file organization, the markup is derived from the document structures as provided by the Justice Department. There is a functional Document Type Definition (DTD) file for use with an SGML parsing utility (such as James Clark' "nsgmls" program and related utilities, available from http://www.jclark.com/sp/index.htm); the DTD file ("juris.dtd") is in the "doc" directory. In summary, the following SGML structure is used in the texts:

All tags are presented one tag per line, and actual text content is kept on separate lines. Each text file begins with a "" tag (and ends with ""), the FILE unit contains one or more documents, each of which is bounded by "" and "" tags. The other tags may occur freely within each DOC unit, interspersed with text data. The names and locations of these tags represent a simple and direct "re-spelling" of typographic and structural markup that was found in the original archive. As such, there may be some variability in the usages of these tags, reflecting different practices among the people who created the original data base. Again, as with the partitioning of the data into file-sets, the full meaning of the markup (e.g. the values provided with the "" tags) has not been fully explained. Within the text content of each file, there is frequent use of the ampersand character ("&"). Since this character has special meaning for SGML parsers, it has been systematically replaced in all files by the common SGML entity reference "&" -- for example, all occurrences of "AT&T" in the original text have been rendered in this publication as "AT&T", and similarly for all other uses of "&". 4. Categories of documents in JURIS There are a total of 694,667 document units in the corpus, and these can be categorized (to some extent) with regard to their content. The following is a partial list of categories and their descriptions (drawn from one of the documents contained in the corpus): * ADMINISTRATIVE LAW Published Comptroller General Decisions; Unpublished Comptroller General Decisions; Opinions of the Attorney General; Office of Legal Counsel (US Dept. of Justice Board of Contract Appeals; ADP Protest Report (Summary of ADP Procurement Protests before the GSBCA); Federal Labor Relations Authority Case Decisions; FLRA Administrative Law Judge Decisions; Federal Service Impasses Decisions; Decisions and Reports on Rulings of the Assistant Sec. of Labor for Labor Management Relations; Federal Labor Relations Council Rulings on Requests of the Asst. Sec. of Labor for Labor Management Relations; HUD Administrative Law Decisions; Merit System Protection Board Decisions; Decisions under Immigration and Nationality Laws; Environmental Protection Agency General Counsel Opinions; Equal Opportunity Commission Decisions; Equal Employment Opportunity Commission Policy Statements; US Office of Government Ethics Decisions; HHS Department Appeals Board Decisions. * DEPARTMENT OF JUSTICE BRIEFS Office of the Solicitor General; Civil Division; Civil Division Trial; Environmental and Natural Resources Division; Tax Division Criminal Appellate; US Attorney's Offices; US Trustees' Offices. * CASE LAW U.S. Supreme Court; Federal Reporter, 2nd Series; Court of Appeals Unpublished Decisions; Federal Supplement; Federal Rules Decisions; Atlantic 2nd Reporter (DC only); Bankruptcy Reporter; Courts of Military Review; Military Justice Reporter; Court of Claims. * FREEDOM OF INFORMATION ACT FOIA Update Newsletter; DOJ Guide to the FOIA Case List Publications. * FEDERAL REGULATIONS Code of Federal Regulations; Unified Agenda of Federal Regulations; Defense Acquisition Regulations. * TREATIES AND OTHER INTERNATIONAL AGREEMENTS United States Treaties and Other International Agreements; Department of Defense Unpublished International Agreements. * INDIAN LAW Opinions of the Solicitor (Dept. of Interior); Ratified Treaties; Unratified Treaties; Presidential Proclamations; Executive Orders and Other Orders Pertaining to Indians. * IMMIGRATION AND NATURALIZATION LAW Decisions Under Immigration and Nationality Law; Title 8 - Code of Federal Regulations; Immigration Reform and Control Act of 1988, Legislative History; Equal Access to Justice Act, Legislative History. * STATUTORY LAW Public Laws; United States Code; Executive Orders; Anti-Drug Abuse Act of 1988; Section-by-section analysis of anti-drug abuse act of 1988; Criminal Division Handbook on CCCA; The Organic Laws of the United States. * TAX LAW US Tax Court Decisions; US Board of Tax Appeals Decisions; Tax Division's Summons Enforcement Decisions; Tax Division's Tax Protester Case List; Tax Division's Criminal Tax Manual; Tax Division's Criminal Tax Indictment/Information Forms; Tax Division's Standardized Criminal Tax Jury Instructions; Tax Division's Criminal Section Newsletter; Tax Court Memorandum Decisions; IRS Cumulative Bulletin; Tax International Acts; IRS News Releases; IRS General Counsel Memoranda; IRS Actions on Decisions; IRS Technical Memoranda. * MANUALS United States Attorney's Manual; United States Trustees' Manual; Federal Personnel Manual; Federal Acquisition Regulations; Federal Acquisition Circulars; Federal Travel Regulation; Federal Information Resources Management Regulation; Federal Property Management Regulations; Principles of Federal Appropriations Law; Justice Department Acquisition Regulation; Justice Property Management Regulations. * DEPARTMENT OF JUSTICE WORKPRODUCTS Civil Division Monographs; Civil Division Torts Branch Handbook on damages under FTCA; Criminal Division Monographs; Criminal Division Forms; Criminal Division Guidelines for Drafting Indictments; Criminal Division Narcotics; Forfeiture, Prosecution Manual; Criminal Division Directory of Services; Asset Forfeiture Manuals; Obscenity Enforcement Reporter; Environmental and Natural Resources Division Monographs; US Sentencing Commission's Guidelines Manual; Sentencing Guidelines Updates. 5. Additional tables describing JURIS content In preparing the corpus for publication, we have compiled a couple of tables to help summarize the content of the various file-sets. Two sorts of tabulations were made: counting documents in each file set according to category of content, and counting documents according to dates found in the document text. * Tabulation of document categories The tabulation of document categories is given in "j_categ.tbl"; this table contains one line for each file-set, with four fields on each line, separated by commas. The fields are as follows: file-set number, range of dates found within the file set, abbreviated category code, additional commentary or document titles The second, third and fourth fields are sometimes empty in the table, indicating that the information was not readily extractable for the given file-set. The abbreviations of categories are as follows: AL Administrative Law BR Briefs CL Case Law EX Executive Orders FOIA Freedom of Information Act and related documents FR Federal Regulations IA International Agreements IL Indian Law PL (?) RE Regulations SL Statutory Law TAX Tax Law (Note that there is not a perfect correspondence between these category codes and the descriptions of categories in the previous section; but most of the categories are accounted for here, and additional relations may be found in the "additional commentary" field of the table.) * Tabulation of document dates In hopes of providing a better sense of the time epochs covered by the various file sets, we tabulated all occurrences of 4-digit numeric strings in the text data, whose values ranged between 1700 and 1990. These occurrences were grouped into bins of about 20 years, on average, and their frequency of occurrence is tabulated in the file "j_dates.tbl". This table contains one line for each file-set, and 11 fields on each line, separated by commas and space characters. The first field is the 4-digit file-set name, the second field shows the total number of documents in that file set, and the remaining fields show how many documents contained an apparent reference to a year within each of nine time ranges. The first line of the table contains column headings, which give the time ranges for each column; the ranges are: 1700-1799, 1800-39, 1840-69, 1870-99, 1900-19, 1920-39, 1940-59, 1960-69, and 1970 forward. The last line of the table provides column totals.