README FILE FOR LDC CATALOG ID: LDC2019Txc
TITLE: LORELEI Vietnamese Representative Language Pack
AUTHORS: Jennifer Tracey, Stephanie Strassel, Dave Graff, Jonathan
Wright, Song Chen, Neville Ryant, Seth Kulick, Kira Griffitt,
Dana Delgado, Michael Arrigo
1.0 Introduction
This corpus was developed by the Linguistic Data Consortium for the
DARPA LORELEI Program and consists of over 172 million words of
monolingual Vietnamese text, approximately 325,000 words of which are
translated into English. Another 106,000 Vietnamese words are also
translated from English data, and 1.9 million words of found parallel
text are includeed. Approximately 75,000 words are annotated for named
entities, and up to 25,000 words with several additional types of
annotation (full entity including nominals and pronouns, simple
semantic annotation, situation frame annotation, entity linking, and
noun phrase chunking). Details of data volumes for each type of
annotation are provided in section 3 of this README.
The LORELEI (Low Resource Languages for Emergent Incidents) Program is
concerned with building Human Language Technology for low resource
languages in the context of emergent situations like natural disasters
or disease outbreaks. Linguistic resources for LORELEI include
Representative Language Packs for over 2 dozen low resource languages,
comprising data, annotations, basic natural language processing tools,
lexicons and grammatical resources. Representative languages are
selected to provide broad typological coverage, while Incident
Languages are selected to evaluate system performance on a language
whose identity is disclosed at the start of the evaluation, and for
which no training data has been provided.
This corpus comprises the complete set of monolingual and parallel
text, lexicon, annotations, and tools from the LORELEI Vietnamese
Representative Language Pack.
For more information about LORELEI language resources, see
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2016-lorelei-language-packs.pdf.
2.0 Corpus organization
2.1 Directory Structure
The directory structure and contents of the package are summarized
below -- paths shown are relative to the base (root) directory of the
package:
./dtds/
./dtds/ltf.v1.5.dtd
./dtds/psm.v1.0.dtd
./dtds/sentence_alignment.v1.0.dtd
./dtds/cstrans_tab.v1.0.dtd
./dtds/laf.v1.2.dtd
./dtds/llf.v1.6.dtd
./docs/ -- various tables and listings (see section 9 below)
./docs/README.txt -- this file
./docs/cstrans_tab/ -- supplemental data regarding crowd-source translations
./docs/annotation_guidelines/ -- guidelines for all annotation tasks included in this corpus
./docs/grammatical_sketch/ -- grammatical sketch of Vietnamese
./docs/vietnamese_namebook.pdf -- a report on Vietnamese personal name usage
./tools/ -- see section 8 below for details about tools provided
./tools/ldclib/
./tools/ltf2txt/
./tools/sent_seg/
./tools/tokenization_parameters.v5.0.yaml
./tools/vie/
./data/monolingual_text/zipped/ -- zip-archive files containing
monolingual "ltf" and "psm" data
./data/translation/
found/{vie,eng,sentence_alignment} -- found parallel text
with sentence alignments between the
Vietnamese and English documents
from_vie/{vie,eng}/ -- translations from Vietnamese to English
from_eng/ -- translations from English to Vietnamese
{elicitation,news,phrasebook}/ for each of three types of English data:
{vie,eng}/ for each language in each directory,
"ltf" and "psm" directories contain
corresponding data files
./data/annotation/ -- see section 5 below for details about annotation
./data/annotation/entity/
./data/annotation/np_chunking/
./data/annotation/sem_annotation/
./data/annotation/situation_frame/
./data/annotation/twitter_tokenization/
./data/lexicon/
2.2 File Name Conventions
There are 93 *.ltf.zip files in the monolingual_text/zipped directory,
together with the same number of *.psm.zip files. Each {ltf,psm}.zip file
pair contains an equal number of corresponding data files. The "file-ID"
portion of each zip file name corresponds to common substrings in the file
names of all the data files contained in that archive. For example:
./data/monolingual_text/zipped/VIE_DF_G00200.ltf.zip contains:
ltf/VIE_DF_001562_20080916_G00200RCR.ltf.xml
ltf/VIE_DF_001562_20080919_G00200RI6.ltf.xml
...
./data/monolingual_text/zipped/VIE_DF_G00200.psm.zip contains:
psm/VIE_DF_001562_20080916_G00200RCR.psm.xml
psm/VIE_DF_001562_20080919_G00200RI6.psm.xml
...
The file names assigned to individual documents within the zip archive
files provide the following information about the document:
Language 3-letter abbrev.
Genre 2-letter abbrev.
Source 6-digit numeric ID assigned to data provider
Date 8-digit numeric: YYYYMMDD year, month, day)
Global-ID 9-digit alphanumeric assigned to this document
Those five fields are joined by underscore characters, yielding a
32-character file-ID; three portions of the document file-ID are used
to set the name of the zip file that holds the document: the Language
and Genre fields, and the first 6 digits of the Global-ID.
The 2-letter codes used for genre are as follows:
DF -- discussion forum
NW -- news
RF -- reference (e.g. Wikipedia)
SN -- social network (Twitter)
WL -- web-log
3.0 Content Summary
Vietnamese orthographic usage typically puts spaces between all
syllables, even within multi-syllabic words. In processing the
harvested text data for this release, we used an existing algorithm
for identifying word boundaries in the stream of space-separated
(syllabic) tokens (see section 6.0 below for details). The summary
table below for monolingual text shows both token counts and word
counts; all other summary tables provide word counts.
3.1 Monolingual Text
Genre #Docs #Tokens #Words
DF 199,199 198,593,054 158,509,037
NW 29,070 11,225,575 8,076,756
SN 11,589 241,097 188,242
WL 5,715 8,114,108 5,959,737
Total 245,573 218,173,834 172,733,772
Note that the SN (Twitter) data cannot be distributed directly by LDC, due to
the Twitter Terms of Use. The file "docs/twitter_info.tab" (described in
Section 8.2 below) provides the necessary information for users to fetch the
particular tweets directly from Twitter.
3.2 Parallel Text
Type Genre #Docs #Segs #Words
Found NW 1,386 35,329 682,137
Found WL 1,422 84,738 1,259,131
FromEng EL 2 3,721 21,295
FromEng NW 190 3,911 84,864
ToEng DF 299 27,563 260,393
ToEng NW 115 1,968 36,545
ToEng WL 46 1,764 26,500
Total 3,460 158,994 2,370,865
3.3 Annotation
AnnotType Genre #Docs #Segs #Words
---
SimpleSemantic DF 17 537 6,998
SimpleSemantic NW 33 625 12,094
SimpleSemantic SN 109 109 1,797
SimpleSemantic WL 6 124 2,188
---
SituationFrame DF 14 256 4,460
SituationFrame NW 39 630 12,284
SituationFrame SN 108 108 1,854
SituationFrame WL 12 251 4,492
---
EntityFull DF 8 128 1,946
EntityFull NW 35 671 12,984
EntityFull SN 109 109 1,939
EntityFull WL 6 124 2,188
---
EntitySimp DF 45 1,604 18,227
EntitySimp NW 127 2,372 43,764
EntitySimp SN 344 344 5,908
EntitySimp WL 26 615 9,954
---
EntityLinking DF 8 128 1,946
EntityLinking NW 35 671 12,984
EntityLinking SN 109 109 1,939
EntityLinking WL 6 124 2,188
---
NPChunking DF 9 235 3,069
NPChunking NW 15 259 5,156
NPChunking SN 57 57 994
NPChunking WL 4 70 1,302
---
4.0 Data Collection and Parallel Text Creation
Both monolingual text collection and parallel text creation involve a
combination of manual and automatic methods. These methods are
described in the sections below.
4.1 Monolingual Text Collection
Data is identified for collection by native speaker "data scouts," who
search the web for suitable sources, designating individual documents
that are in the target language and discuss the topics of interest to
the LORELEI program (humanitarian aid and disaster relief). Each
document selected for inclusion in the corpus is then harvested, along
with the entire website when suitable. Thus the monolingual text
collection contains some documents which have been manually selected
and/or reviewed and many others which have been automatically
harvested and were not subject to manual review.
4.2 Parallel Text Creation
Parallel text for LORELEI was created using three different methods,
and each LORELEI language may have parallel text from one or all of
these methods. In addition to translation from each of the LORELEI
languages to English, each language pack contains a "core" set of
English documents that were translated into each of the LORELEI
Representative Languages. These documents consist of news documents, a
phrasebook of conversational sentences, and an elicitation corpus of
sentences designed to elicit a variety of grammatical structures. All
translations are aligned at the sentence level. For professional and
crowdsourced translation, the segments align one-to-one between the
source and target language (i.e. segment 1 in the English aligns with
segment 1 in the source language). For found parallel text, automatic
alignment is performed and a separate alignment file provides
information about how the segments in the source and translation are
aligned.
Professionally translated data has one translation for each source
document, while crowdsourced translations have up to four translations
for each source document, designated by A, B, C, or D appended to the
file name on the multiple translation versions.
5.0 Annotation
Six types of annotation are present in this corpus. Simple Named
Entity tags names of persons, organizations, geopolitical entities,
and locations (including facilities), while Full Entity also tags
nominal and pronominal mentions of entities. Entity Discovery and
Linking provides cross-document coreference of named entities via
linking to an external knowledge base (the knowledge base used for
LORELEI is released separately as LDC2020T10). Simple Semantic
Annotation provides light semantic role labeling, capturing acts and
states along with their arguments. Situation Frame annotation labels
the presence of needs and issues related to emergent incidents such as
natural disasters (e.g. food need, civil unrest), along with
information such as location, urgency, and entities involved in
resolving the needs. Finally, noun phrase chunking marks the maximal
extents of noun phrases in the text. Details about each of these
annotation tasks can be found in docs/annotation_guidelines/.
6.0 Data Processing and Character Normalization for LORELEI
Most of the content has been harvested from various web sources using
an automated system that is driven by manual scouting for relevant
material. Some content may have been harvested manually, or by means
of ad-hoc scripted methods for sources with unusual attributes.
All harvested content was initially converted from its original HTML
form into a relatively uniform XML format; this stage of conversion
eliminated irrelevant content (menus, ads, headers, footers, etc.),
and placed the content of interest into a simplified, consistent
markup structure.
The "homogenized" XML format then served as input for the creation of
a reference "raw source data" (rsd) plain text form of the web page
content; at this stage, the text was also conditioned to normalize
white-space characters, and to apply transliteration and/or other
character normalization, as appropriate to the given language.
For Vietnamese, the conversion from rsd.txt to ltf.xml involved a language-
specific process of tokenization, to identify sequences of two or more
syllabic tokens that comprise grammatical "words". This was done using
"JVNSegmenter: A Java-based Vietnamese Word Segmentation Tool", created by C-T
Nguyen and X-H Phan (http://jvnsegmenter.sourceforge.net/). The tokenization
parameters file has values for :shell_script and :log_file that are
illustrative, but may have to be changed depending on your environment.
7.0 Overview of XML Data Structures
7.1 PSM.xml -- Primary Source Markup Data
The "homogenized" XML format described above preserves the minimum set
of tags needed to represent the structure of the relevant text as seen
by the human web-page reader. When the text content of the XML file
is extracted to create the "rsd" format (which contains no markup at
all), the markup structure is preserved in a separate "primary source
markup" (psm.xml) file, which enumerates the structural tags in a
uniform way, and indicates, by means of character offsets into the
rsd.txt file, the spans of text contained within each structural
markup element.
For example, in a discussion-forum or web-log page, there would be a
division of content into the discrete "posts" that make up the given
thread, along with "quote" regions and paragraph breaks within each
post. After the HTML has been reduced to uniform XML, and the tags
and text of the latter format have been separated, information about
each structural tag is kept in a psm.xml file, preserving the type of
each relevant structural element, along with its essential attributes
("post_author", "date_time", etc.), and the character offsets of the
text span comprising its content in the corresponding rsd.txt file.
7.2 LTF.xml -- Logical Text Format Data
The "ltf.xml" data format is derived from rsd.txt, and contains a
fully segmented and tokenized version of the text content for a given
web page. Segments (sentences) and the tokens (words) are marked off
by XML tags (SEG and TOKEN), with "id" attributes (which are only
unique within a given XML file) and character offset attributes
relative to the corresponding rsd.txt file; TOKEN tags have additional
attributes to describe the nature of the given word token.
The segmentation is intended to partition each text file at sentence
boundaries, to the extent that these boundaries are marked explicitly
by suitable punctuation in the original source data. To the extent
that sentence boundaries cannot be accurately detected (due to
variability or ambiguity in the source data), the segmentation process
will tend to err more often on the side of missing actual sentence
boundaries, and (we hope) less often on the side of asserting false
sentence breaks.
The tokenization is intended to separate punctuation content from word
content, and to segregate special categories of "words" that play
particular roles in web-based text (e.g. URLs, email addresses and
hashtags). To the extent that word boundaries are not explicitly
marked in the source text, the LTF tokenization is intended to divide
the raw-text character stream into units that correspond to "words" in
the linguistic sense (i.e. basic units of lexical meaning). In the
case of Vietnamese, this often involves grouping two or more space-
separated tokens in the raw text into a single TOKEN (linguistic word)
element in the LTF markup, owing to the normal Vietnamese orthographic
convention of using spaces at word-internal syllable boundaries.
7.3 CSTRANS_TAB.xml -- Crowd-source Translation Tables
The "./docs/cstrans_tab/" directory contains one "*.cstrans_tab.xml"
file for each English source file that was submitted to translation via crowd
sourcing. Each file contains a DOC element (with "id" and "lang" attributes),
which in turn contains a "SEG" element for each "SEG" in the corresponding
English ltf.xml file. Each "SEG" element may either be an empty tag (if no
usable translations were submitted for the given segment), or contain one or
more "TR" elements, each of which is an alternative translation for the given
source segment. In either case, the "SEG" tag has an "id" attribute (unique
within the given xml file, matching the SEG "id" value in ltf.xml), and an
"ntrs" attribute (whose value is the number of "TR" elements present. For
example:
...
...
The attributes in the "TR" elements are as follows:
- translatorid -- an alphanumeric string unique to each contributor; note
that each translation "version" (_A, _B, etc) is likely to contain segments
from different translators
- avg_gold_ter may be floating-point numeric or "Unk"; it represents the
"term error rate" relative to a "gold-standard" manual translation (lower
value == better match)
- score may be floating-point numeric or "None"
- mt_ter is always floating-point numeric; it represents the "machine
translation error rate" relative to a "google-translate" reference (lower
value == better match)
- nonwhitesp and odd_ch are always integer numerics: the count of
non-whitespace characters in the string, and the count of characters that
are "not in the expected language" (this can include emoticons,
non-printing characters, and characters in foreign scripts).
7.4 LAF.xml -- Logical Annotation Format Data
The "laf.xml" data format provides a generic structure for presenting
annotations on the text content of a given ltf.xml file; see the associated
DTD file in the "dtds" directory. Note that each type of annotation (simple
named entity, full entity, semantic structure, NP chunking) uses the basic XML
elements of LAF in different ways.
NB: For Twitter data, the LDC interprets the Twitter Terms of Use to
mean that no original text content from tweets may be redistributed as
part of an LDC corpus. Therefore, the EXTENT elements of
*_SN_000370_*.laf.xml files are presented here with underscore
characters ('_') in place of all non-white-space characters in
annotated strings. In order to get the actual text content for these
strings, users must download and process each tweet into plain-text
format (using the software provided in the "tools" directory or
equivalent), and use the character-offset information in the EXTENT
tag to acquire the annotated string. A correct result from this
process can only be assured if the user's plain-text file for the
given tweet has an MD5 signature that matches the one given in the
corresponding ltf.xml file. In order to ensure that users can match
the tokenization present in the annotated version of any Twitter data,
a version of the ltf.xml files for annotated tweets with underscore
characters ('_') in place of all non-white-space characters is
provided in the data/annotation/twitter_tokenization/ directory.
7.5 LLF.xml -- LORELEI Lexicon Format Data
The "llf.xml" data format is a simple structure for presenting citation-form
words (headwords or lemmas) in Vietnamese, together with Part-Of-Speech (POS)
labels and English glosses. Each ENTRY element contains a unique combination
of LEMMA value (citation form in native orthography) and POS value, together
with one or more GLOSS elements. Each ENTRY has a unique ID, which is
included as part of the unique ID assigned to each GLOSS.
For Vietnamese, the data/lexicon directory also contains a tab-delimited
plain-text table file of supplemental lexical data; each row of this table has
four columns, whose names are given in the first line of the file:
1. lemma_id -- numeric portion of the associated ENTRY ID in llf.xml
2. gloss_id -- numeric portion of the associated GLOSS ID in llf.xml
3. tag -- closed set of category labels
4. value -- value assigned to the tag for the given ENTRY/GLOSS
7.6 Situation Frame Annotation Tables
Situation frame annotation consists of three parts, each presented as a
separate tab-delimited file: entities, needs, and issues. The details of each
table are described below.
Entities, mentions, need frames, and issue frames all have IDs that follow a
standard schema consisting of a prefix designating the type of ID ('Ent' for
entities, 'Men' for mentions, and 'Frame' for both need and issue frames), an
alphanumeric string identifying the annotation "kit", and a numeric string
uniquely identifying the specific entity, mention, or frame within the
document.
7.6.1 Mentions
The grouping of entity mentions into "selectable entities" for situation frame
annotation is provided in the mentions/ subdirectory. The table has 8 columns
with the following headers and descriptions:
column 1: doc_id -- doc ID of source file for the annotation
column 2: entity_id -- unique identifier for each grouped entity
column 3: mention_id -- unique identifier for each entity mention
column 4: entity_type -- one of PER, ORG, GPE, LOC
column 5: mention_status -- 'representative' or 'extra';
representative mentions are the ones which have been chosen by the
annotator as the representative name for that entity. Each entity
has exactly one representative mention.
column 6: start_char -- character offset for the start of the mention
column 7: end_char -- character offset for the end of the mention
column 8: mention_text -- mention string
7.6.2 Needs
Annotation of need frames is provided in the needs/ subdirectory. Each row in
the table represents a need frame in the annotated document. The table has 13
columns with the following headers and descriptions:
column 1: user_id -- user ID of the annotator
column 2: doc_id -- doc ID of source file for the annotation
column 3: frame_id -- unique identifier for each frame
column 4: frame_type -- 'need'
column 5: need_type -- exactly one of 'evac' (evacuation), 'food' (food
supply), 'search' (search/rescue), 'utils' (utilities, energy, or
sanitation), 'infra' (infrastructure), 'med' (medical assistance),
'shelter' (shelter), or 'water' (water supply)
column 6: place_id -- entity ID of the LOC or GPE entity identified as the
place associated with the need frame; only one place value per
need frame, must match one of the entity IDs in the corresponding
ent_output.tsv or be 'none' (indicating no place was named)
column 7: proxy_status -- 'True' or 'False'
column 8: need_status -- 'current', 'future'(future only), or 'past' (past only)
column 9: urgency_status -- 'True' (urgent) or 'False' (not urgent)
column 10: resolution_status -- 'sufficient' or 'insufficient' (insufficient /
unknown sufficiency)
column 11: reported_by -- entity ID of one or more entities reporting
the need; multiple values are comma-separated, must match entity IDs
in the corresponding ent_output.tsv or be 'none'
column 12: resolved_by -- entity ID of one or more entities resolving
the need; multiple values are comma-separated, must match entity IDs
in the corresponding ent_output.tsv or be 'none'
column 13: description -- string of text entered by the annotator as
memory aid during annotation, no requirements for content or language,
may be 'none'
7.6.3 Issues
Annotation of issue frames is provided in the issues/ subdirectory. Each row
in the table represents an issue frame in the annotated document. The table has
9 columns with the following headers and descriptions:
column 1: user_id -- user ID of the annotator
column 2: doc_id -- doc ID of source file for the annotation
column 3: frame_id -- unique identifier for each frame
column 4: frame_type -- 'issue'
column 5: issue_type -- exactly one of 'regimechange' (regime change),
'crimeviolence' (civil unrest or widespread crime), or 'terrorism'
(terrorism or other extreme violence)
column 6: place_id -- entity ID of the LOC or GPE entity identified as
the place associated with the issue frame; only one place value per
issue frame, must match one of the entity IDs in the corresponding
ent_output.tsv or be 'none'
column 7: proxy_status -- 'True' or 'False'
column 8: issue_status -- 'current' or 'not_current'
column 9: description -- string of text entered by the annotator as
memory aid during annotation, no requirements for content or
language, may be 'none'
7.7 EDL Table
The "data/annotation/entity/" directory contains the file "vie_edl.tab", which
has an initial "header" line of column names followed by data rows with 8
columns per row. The following shows the column headings and a sample value
for each column:
column 1: system_run_id LDC
column 2: mention_id Men-NW_AFP_ENG_0012_20030419.vie-47
column 3: mention_text Các Tiểu Vương Quốc Ả Rập Thống Nhất
column 4: extents NW_AFP_ENG_0012_20030419.vie:199-234
column 5: kb_id 290557
column 6: entity_type GPE
column 7: mention_type NAM
column 8: confidence 1.0
When column 5 is fully numeric, it refers to a numbered entity in the
LORELEI Entity Detection and Linking Knowledge Base (distributed separately
as LDC2020T10). Note that a given mention may be ambiguous as to the particular
KB element it represents; in this case, two or more numeric KB_ID values will
appear in column 5, separated by the vertical-bar character (|).
When column 5 consists of "NIL" plus digits, it refers to an entity that is
not present in the Knowledge Base, but this label is used consistently for all
mentions of the particular entity.
8.0 Software tools included in this release
8.1 "ltf2txt" (source code written in Perl)
A data file in ltf.xml format (as described above) can be conditioned
to recreate exactly the "raw source data" text stream (the rsd.txt
file) from which the LTF was created. The tools described here can be
used to apply that conditioning, either to a directory or to a zip
archive file containing ltf.xml data. In either case, the scripts
validate each output rsd.txt stream by comparing its MD5 checksum
against the reference MD5 checksum of the original rsd.txt file from
which the LTF was created. (This reference checksum is stored as an
attribute of the "DOC" element in the ltf.xml structure; there is also
an attribute that stores the character count of the original rsd.txt
file.)
Each script contains user documentation as part of the script content;
you can run "perldoc" to view the documentation as a typical unix man
page, or you can simply view the script content directly by whatever
means to read the documentation. Also, running either script without
any command-line arguments will cause it to display a one-line
synopsis of its usage, and then exit.
ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data)
ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives
8.2 ldclib -- general text conditioning, twitter harvesting
The "bin/" subdirectory of this package contains three executable scripts
(written in Ruby):
create_rsd.rb -- convert general xml or plain-text formats to "raw source
data" (rsd.txt), by removing markup tags and applying
sentence segmentation
token_parse.rb -- convert rsd.txt format into ltf.xml
get_tweet_by_id.rb -- download and condition Twitter data
Due to the Twitter Terms of Use, the text content of individual tweets
cannot be redistributed by the LDC. As a result, users must download
the tweet contents directly from Twitter and condition/normalize the
text in a manner equivalent to what was done by the LDC, in order to
reproduce the Vietnamese raw text that was used by LDC for annotation (to
be released separately). The twitter-processing software provided in
the tools/ directory enables users to perform this normalization and
ensure that the user's version of the tweet matches the version used
by LDC, by verifying that the md5sum of the user-downloaded and
processed tweet matches the md5sum provided in the twitter_info.tab
file. Users must have a developer account with Twitter in order to
download tweets, and the tool does not replace or circumvent the
Twitter API for downloading tweets.
The ./docs/twitter_info.tab file provides the twitter download id for each
tweet, along with the LORELEI file name assigned to that tweet and the
md5sum of the processed text from the tweet.
The file "README.md" in this directory provides details on how to install and
use the source code in this directory in order to condition text data that the
user downloads directly from Twitter and produce both the normalized raw text
and the segmented, tokenized LTF.xml output.
All LDC-developed supporting files (models, configuration files, library
modules, etc.) are included, either in the "lib" subdirectory (next to
"bin"), or else in the parent ("tools") directory.
Please refer to the README.md file that accompanies this software package.
8.3 sent_seg -- apply sentence segmentation to raw text
The Python tools in this directory are used as part of the conditioning done
by "create_rsd.rb" in the "ldclib" package. Please refer to the README.rst
file included with the package.
8.4 ne_tagger -- Named-Entity tagger for Vietnamese
Please refer to the tools/vie/ne_tagger/README.rst file for information about
usage and performance.
9.0 Documentation included in this release
The ./docs folder (relative to the root directory of this release)
contains four files documenting various characteristics of the source
data:
char_tally.{lng}.tab - contains tab separated columns: doc uid, number of
non-whitespace characters, number of non-whitespace characters in the
expected script, and number of anomalous (non-printing) characters for
each document in the release
source_codes.txt - contains tab-separated columns: genre, source code,
source name, and base url for each source in the release
twitter_info.tab - contains tab-separated columns: doc uid, tweet id,
normalized md5 of the tweet text, and tweet author id for all tweets in
the release
urls.tab - contains tab-separated columns: doc uid and url. Note that
the url column is empty for documents from older releases for which the url
is not available; they are included here so that the uids column can
serve as a document list for the package.
In addition, the grammatical sketch, namebook, annotation guidelines,
and cs_trans contents described in earlier sections of this README are
found in this directory.
10.0 KNOWN ISSUES
10.1 Some details are absent or misrepresented in some psm.xml files
After collection and processing of the monolingual text collection for
Vietnamese, quality control checks were applied prior to release of
the data, and these revealed that about 23% of the psm.xml files did
not pass XML validation. The problem involved the handling of anchor
tags and their "href" attributes as extracted from original HTML data.
The psm.xml files are supposed to identify the string extents that
were used to present the link to readers of the HTML page, and also
preserve the href value in an "attribute" tag. While the majority of
anchor tags have been preserved and presented correctly, many ended up
in badly formed "attribute" tags, with only partial strings for the
"href" values; however, there was never any problem in preserving the
correct character offsets for the strings used to present the links in
the HTML text.
For the present release, the affected psm.xml files were patched to
ensure that every file conforms to the psm DTD and causes no XML parse
errors. The repair involved replacing the faulty "attribute" elements
with well-formed tags, in which the "href" value is set to the string
from the raw text content that served as the extent for the link. In
many cases, this string turns out to be a well-formed URL, but
sometimes it contains a URL with other text or a URL fragment, and
other times it contains just a word, phrase or other non-URL string.
In applying the patch, the last condition is marked in the "attribute"
tag with the string "url_lost:" as part of the "href" value. Whenever
the raw text extent of the link contained initial "http", "www" or
"mailto:", that string extent was used as the "href" value without
further ado, even though in some cases this string is not a complete
URL, or may contain other text following the URL.
10.2 Some double-escaped characters in ltf.xml data files
There are 15 ltf.xml data files in the "data/translation" directory (mostly in
"found" and "from_vie/eng") in which the text content includes strings like
"'", ">", etc. When these are converted to raw text
(rsd.txt), the data will still contain strings like "'", ">", etc.
10.3 Some annotations crossing LTF.xml SEGMENT boundaries
In Entity and Noun-Phrase Chunking annotations, annotators were allowed to
select regions of text that extended from the end of one "segment" unit
(putative "sentence") to the beginning of the next segment. In the rare cases
where this occurred, automatic sentence segmentation had asserted a false
sentence boundary due to unexpected or ambiguous punctuation patterns.
10.4 Quality of found and crowdsourced translations
For found parallel text, an independent review of a sample of the
translations found that about 80% of translation pairs are good. The
remainder contains varying level of misalignment and/or partial
translation. Some of the misalignment and partial translations are
caused by issues inherent in the raw data we collected where the
translation is partial or slightly different from the source text. We
plan to make some adjustment to the alignment tool and we expect to
see some improvement in future releases.
For crowd translation, the same review found that about half of the
pairs are good. The remaining pairs include some obvious use of
machine translation, partial translations, or significantly inaccurate
translations.
11.0 Acknowledgements
The authors would like to acknowlege the following contributors to
this corpus: Brian Gainor, Ann Bies, Justin Mott, Neil Kuster,
University of Maryland Applied Research Laboratory for Intelligence
and Security (ARLIS), formerly UMD Center for Advanced Study of
Language (CASL), and our team of Vietnamese annotators.
This material is based upon work supported by the Defense Advanced
Research Projects Agency (DARPA) under Contract
No. HR0011-15-C-0123. Any opinions, findings and conclusions or
recommendations expressed in this material are those of the author(s)
and do not necessarily reflect the views of DARPA.
12.0 Copyright
Portions © 2002-2007, 2009-2010 Agence France Presse, © 2015 AloBacsi, © 2000 American Broadcasting Company, © 2015-2016 ASIANET, © 2015 BAN BIÊN TẬP TIN KINH TẾ, TTXVN, © 2015 Báo Bắc Giang, © 2015 Báo Bắc Ninh & Trung tâm Tin học Hành chính, © 2013 Báo Bình Định, © 2014-2016 BaoCalitoday.com, © 2015 Báo Dân Việt, © 2007-2016 Báo Diễn Đàn Doanh Nghiệp điện tử, © 2013 Báo điện tử Dân Việt, © 2014-2015 Báo điện tử Pháp Luật thành phố Hồ Chí Minh, © 2011 BÁO DOANH NHÂN SÀI GÒN ĐIỆN TỬ - DNSG Online, © 2015-2016 Báo Đời sống và Pháp luật, © 2013, 2015 Báo Lao Động - Cơ quan của Tổng Liên đoàn Lao động Việt Nam, © 2015 Báo Mới, © 2014-2016 Báo Nhân Dân thiết kế và giữ bản quyền, © 2016 Báo Phú Yên Online, © 2014 BÁO QUẢNG TRỊ ĐIỆN TỬ, © 2013 Báo SÀI GÒN GIẢI PHÓNG, © 2015 Báo Tài Nguyên và Môi trường, © 2014 Baotinnhanh.vn, © 2013 Báo Tin tức – TTXVN, © 2011, 2016 Báo VietNamNet, © 2013-2016 Báo Thanh Niên, © 2015 Baoxaydung.com.vn, © 2011-2016 BBC, © 2013, 2015-2016 BizLIVE.vn, © 2000 Cable News Network LP, LLLP, © 2008 Central News Agency (Taiwan), © 2013 Cơ quan chủ quản: Ủy ban nhân dân Tỉnh Quảng Ninh, © 2015 Công an TPHCM, © 2010-2016 Công ty Cổ phần VCCorp, © 2013 Công Ty Cổ Phần Kết Nối Y Tế , © 2015 Đại Kỷ Nguyên, © 2014 KHÁM PHÁ, © 2015-2016 Đài Tiếng nói nhân dân Thành phố Hồ Chí Minh, © 2016 Dantri.com.vn, © 1989 Dow Jones & Company, Inc., © 2016 Duong Bo, © 2011 Enternews.vn, © 2015 Gio Bao, © 2014 go.vn, © 2014-2016 Infonet, © 2005-2016 KhoaHoc.tv, © 2012-2015 Kieu.com, © 2008-2016 Lac Viet Computing Corporation, © 2005 Los Angeles Times - Washington Post News Service, Inc., © 2016 Microsoft, © 2000 National Broadcasting Company, Inc., © 1999, 2005, 2006, 2010 New York Times, © 2015 Người Việt Daily News, © 2012-2016 nguoiduatin.vn, © 2011 Nhan Hieu Viet, © 2008 Nongnghiep.vn, © 2014-2016 Nuathegioi.com, © 2015 Phương Đông Times, © 2000 Public Radio International, © 2008 SaigonTimesGroup, © 2015 Saigon Tin, © 2016 SNH, © 2015 SongKhoe.vn, © 2015 Sống Mới, © 2012 Sputnik, © 2015-2016 SVHUAF, © 2013 Tạp chí Nhịp Cầu Đầu Tư, © 2015 Tạp Chí Thực Phẩm Chức Năng Health+, © 2014 Tây Ninh Online, © 2013-2016 Thanh Nien News, © 2003, 2005-2008, 2010 The Associated Press,© 2011, 2015 The Voice of Vietnam Online, © 2015 THST.vn, © 2016 Thời Báo Inc., © 2013, 2015 ThoitietVietnam.vn, © 2013-2016 Tiền Phong, © 2007, 2015-2016 Tin 247, © 2015 Tin Tuc Viet Nam, © 2015-2016 Trang Tin Tức Người Thanh Hóa, © 2012, 2015 Tin tức VTC News, © 2012-2016 Tinmoi, © 2013 Tri Thức Thời Đại,© 2012-2016 Trương Tấn Sang, © 2005, 2013-2016 TUOITRE.VN, © 2015 VGT Media Co, Ltd., © 2006-2016 VIỄN ĐÔNG DAILY NEWS, © 2013 VIỆN SỐT RÉT - KÝ SINH TRÙNG - CÔN TRÙNG TRUNG ƯƠNG, © 2007, 2013, 2014 Viet Bao, © 2016 Vietbao.com, © 2001 Vietbao.vn, © 2015 VietNamNet, © 2015 VietnamPlus, TTXVN, © 2014 Viet Times, © 2010 Vinanet, © 2001, 2014 VnExpress.net, © 2014 Xaluan.com, © 2003, 2005-2008 Xinhua News Agency, © 2011-2016 Zing.vn, © 2016, 2020 Trustees of the University of Pennsylvania
13.0 CONTACTS
Jennifer Tracey - LORELEI Project Manager
Stephanie Strassel - LORELEI PI