TS Wikipedia

Item Name: TS Wikipedia
Author(s): Taner Sezer, Türker Sezer
LDC Catalog No.: LDC2015T15
ISBN: 1-58563-723-8
Release Date: July 15, 2015
Member Year(s): 2015
DCMI Type(s): Text
Data Source(s): web collection
Application(s): part of speech tagging, information extraction, morphology
Language(s): Turkish
Language ID(s): tur
License(s): LDC For-Profit Membership Agreement
Creative Commons-Attribution-Share-Alike 3.0 (NFP, Non-Member)
Online Documentation: LDC2015T15 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Sezer, Taner, and Türker Sezer. TS Wikipedia LDC2015T15. Web Download. Philadelphia: Linguistic Data Consortium, 2015.

Introduction

TS Wikipedia is a collection of approximately 1.6 million processed Turkish Wikipedia pages. The data is tokenized and includes part-of-speech tags, morphological analysis, lemmas, bi-grams and tri-grams.

Data

The data is in a word-per-line format with five tab-separated columns: token, part-of-speech tag, morphological analysis, lemma and corrected token spelling if needed. All data is presented in UTF-8 XML files and was selected and filtered to reduce non-Turkish characters, mathematical formulas and non-Turkish entries.

Samples

Please view this sample.

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee