Corpus of Law, Academic, and News

Item Name: Corpus of Law, Academic, and News
Author(s): Ariana Negar Mohammadi
LDC Catalog No.: LDC2020T23
ISBN: 1-58563-947-8
ISLRN: 903-821-836-195-4
Release Date: October 15, 2020
Member Year(s): 2020
DCMI Type(s): Text
Data Source(s): newswire, journal articles, legal documents
Application(s): discourse analysis, language teaching
Language(s): Persian
Language ID(s): fas
License(s): Corpus of Law, Academic, and News Agreement
Online Documentation: LDC2020T23 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Mohammadi, Ariana Negar. Corpus of Law, Academic, and News LDC2020T23. Web Download. Philadelphia: Linguistic Data Consortium, 2020.
Related Works: View


Corpus of Law, Academic, and News consists of 400 Persian documents divided into three genres: legal, academic, and news.

The legal section contains texts from official publications, including the civil penal code, the criminal penal code, and the constitution of the Islamic Republic of Iran. The academic sub-corpus is comprised of published academic abstracts in various disciplinary areas, such as Art and Humanities, Social Sciences, and Natural Sciences. The news sub-corpus was extracted from an archive of ten Iranian news outlets spanning the period 2010- 2020.


The document and token counts are as follows: 48 legal documents, 88,170 tokens; 274 academic documents, 85,765 tokens; and 78 news documents, 101,055 tokens.

Each document contains metadata in the file's header with information such as specific text type, dates and source, and also contains annotations marking title and body paragraphs.

All documents are presented as UTF-8 encoded XML with internal DTDs.


Please view this sample (XML).


None at this time.

Available Media

View Fees

Login for the applicable fee