Penn Parsed Corpora of Historical English, version 2016 The Penn Parsed Corpora of Historical English (PPCHE) are running texts and text samples of British English prose from the earliest Middle English documents up to the First World War. The PPCHE include three subcorpora covering traditionally recognized periods of English: - the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2) - the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) - the Penn Parsed Corpus of Modern British English, second edition (PPCMBE2) The texts come in three forms: plain text, part-of-speech tagged text, and syntactically annotated text. The annotations have all been carefully reviewed by expert human annotators for accuracy and consistency. (Remaining errors should be reported to beatrice AT sas DOT upenn DOT edu.) Each text also has an associated file with philological information. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language. They have also been used by computational linguists for domain adaptation. The corpora are available for non-commercial use in accordance with the Distribution Agreement. The data in the three subcorpora are unchanged from the 2016 release, which was distributed on CD-ROM by Anthony Kroch, but the directory structure and names have been changed to conform to LDC conventions, as described below. Each of the three subcorpora has its own directory and should be cited individually as follows: Kroch, Anthony, and Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English (PPCME2), second edition, release 4, LDC2020T16. Web download file. Philadelphia: Linguistic Data Consortium. Kroch, Anthony, Beatrice Santorini, and Lauren Delfs. 2004. The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), first edition, release 3, LDC2020T16. Web download file. Philadelphia: Linguistic Data Consortium. Kroch, Anthony, Beatrice Santorini, and Ariel Diertani. 2016. The Penn-Helsinki Parsed Corpus of Modern British English (PPCMBE2), second edition, release 1, LDC2020T16. Web download file. Philadelphia: Linguistic Data Consortium. Each of the three subcorpora have three data subdirectories with the plain and annotated text files (text, pos-tagged, parsed). The subcorpora's docs directories contain a general description of each subcorpus and a philological_info_files directory with further philological information for each text. The distribution also includes directories for the annotation guidelines and for a search program called CorpusSearch 2, which allows users to search the syntactically annotated corpora for syntactic structure (not just for words and word sequences). The directory for CorpusSearch 2 includes the java code itself as well as documentation for installing and using the program. Authors: Anthony Kroch, Beatrice Santorini, Ann Taylor, Ariel Diertani (Lauren Delfs) Languages: Middle English (1100-1500) (enm), 20.2% Early Modern English (1500-1700) (eng), 31.3% Modern British English (1700-1914) (eng) 48.3% Expected use of corpus: Linguistic research on historical English; domain adaption for computational linguists Collection procedure: The Penn Parsed Corpora of Historical English are based in part on the the Helsinki Corpus of English Texts. The PPCME2 includes most of the Middle English texts the Helsinki Corpus and adds some not included in that corpus. Details are available in the documentation for the PPCME2. The PPCEME includes all of the Early Modern English texts from the Helsinki Corpus as well as additional texts selected to give the same genre balance as the original Helsinki Corpus texts; the additional texts are twice the size of the original texts. The PPCMBE2 covers a later time period than that covered by the Helsinki Corpus, but the texts were selected to give the same genre balance as the Early Modern English part. Data Format Specific Details: UTF-8 encoding. The parsed data are in Penn Treebank format (matching parentheses indicating syntactic structure, with unlabeled parentheses around each sentence token).