TS Wikipedia
| Item Name: | TS Wikipedia |
| Author(s): | Taner Sezer, Türker Sezer |
| LDC Catalog No.: | LDC2015T15 |
| ISBN: | 1-58563-723-8 |
| DOI: | https://doi.org/10.35111/mem6-4951 |
| Release Date: | July 15, 2015 |
| Member Year(s): | 2015 |
| DCMI Type(s): | Text |
| Data Source(s): | web collection |
| Application(s): | part of speech tagging, information extraction, morphology |
| Language(s): | Turkish |
| Language ID(s): | tur |
| License(s): |
Creative Commons-Attribution-Share-Alike 3.0 (NFP, Non-Member)
LDC For-Profit Membership Agreement |
| Online Documentation: | LDC2015T15 Documents |
| Licensing Instructions: | Subscription & Standard Members, and Non-Members |
| Citation: | Sezer, Taner, and Türker Sezer. TS Wikipedia LDC2015T15. Web Download. Philadelphia: Linguistic Data Consortium, 2015. |
Introduction
TS Wikipedia is a collection of approximately 1.6 million processed Turkish Wikipedia pages. The data is tokenized and includes part-of-speech tags, morphological analysis, lemmas, bi-grams and tri-grams.
Data
The data is in a word-per-line format with five tab-separated columns: token, part-of-speech tag, morphological analysis, lemma and corrected token spelling if needed. All data is presented in UTF-8 XML files and was selected and filtered to reduce non-Turkish characters, mathematical formulas and non-Turkish entries.
Samples
Please view this sample.
Updates
None at this time.