Corpus of Law, Academic, and News
Item Name: | Corpus of Law, Academic, and News |
Author(s): | Ariana Negar Mohammadi |
LDC Catalog No.: | LDC2020T23 |
ISBN: | 1-58563-947-8 |
ISLRN: | 903-821-836-195-4 |
DOI: | https://doi.org/10.35111/wcbv-pj21 |
Release Date: | October 15, 2020 |
Member Year(s): | 2020 |
DCMI Type(s): | Text |
Data Source(s): | newswire, journal articles, legal documents |
Application(s): | discourse analysis, language teaching |
Language(s): | Persian |
Language ID(s): | fas |
License(s): |
Corpus of Law, Academic, and News Agreement |
Online Documentation: | LDC2020T23 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Mohammadi, Ariana Negar. Corpus of Law, Academic, and News LDC2020T23. Web Download. Philadelphia: Linguistic Data Consortium, 2020. |
Related Works: | View |
Introduction
Corpus of Law, Academic, and News consists of 400 Persian documents divided into three genres: legal, academic, and news.
The legal section contains texts from official publications, including the civil penal code, the criminal penal code, and the constitution of the Islamic Republic of Iran. The academic sub-corpus is comprised of published academic abstracts in various disciplinary areas, such as Art and Humanities, Social Sciences, and Natural Sciences. The news sub-corpus was extracted from an archive of ten Iranian news outlets spanning the period 2010- 2020.
Data
The document and token counts are as follows: 48 legal documents, 88,170 tokens; 274 academic documents, 85,765 tokens; and 78 news documents, 101,055 tokens.
Each document contains metadata in the file's header with information such as specific text type, dates and source, and also contains annotations marking title and body paragraphs.
All documents are presented as UTF-8 encoded XML with internal DTDs.
Samples
Please view this sample (XML).
Updates
None at this time.