Corpus Title:    AIDA Scenario 1 Practice Topic Source Data
LDC Catalog-ID:  LDC2023T11

Authors: Jennifer Tracey, Stephanie Strassel, Jeremy Getman, Ann Bies, Kira Griffitt, David Graff, Chris Caruso

1.0 Introduction

This corpus was developed by the Linguistic Data Consortium for the DARPA AIDA
Program and consists of 1511 documents including text, image, and video
from English, Russian, and Ukrainian web sources.  Details of data volumes for
each language and media type are provided in section 3 of this README.

The AIDA (Active Interpretations of Disparate Alternatives) Program is
designed to support development of technology that can assist in cultivating
and maintaining understanding of events when there are conflicting accounts of
what happened (e.g. who did what to whom and/or where and when events
occurred).  AIDA systems must extract entities, events, and relations from
individual multimedia documents, aggregate that information across documents
and languages, and produce multiple knowledge graph hypotheses that
characterize the conflicting accounts that are present in the corpus (see
https://www.darpa.mil/program/active-interpretation-of-disparate-alternatives
for more information about the program).

Each phase of the AIDA program focused on a different scenario, or broad topic
area. The scenario for Phase 1 was political relations between Russia and
Ukraine in the 2010s. In addition, each scenario had a set of specific
subtopics within the scenario that were designated as either "practice topics"
(released as for use in system development) or "evaluation topics" (reserved
for use in the AIDA program evaluations for each phase).

Data collection for this program included both topic-focused data (containing
information about specific subtopics of interest within the larger scenario)
as well as background data (a large volume of data in the target languages and
media types with no topic focus or requirements). This corpus comprises the
full set of topic-focused documents for the practice topics within the Phase 1
Russia-Ukraine scenario.

1.1 AIDA Scenario 1 Topics

R103 - Who Started the Shooting at Maidan?
R105 - Ukrainian War Ceasefire Violations in Battle of Debaltseve (Jan-Feb 2015)
R107 - Donetsk and Luhansk Referendum, aka Donbas Status Referendum (May 2014)
T101 - Crash of Malaysian Air Flight MH17
T102 - Flight of Deposed Ukrainian President Viktor Yanukovych
T106 - Humanitarian Crisis in Eastern Ukraine (July-August 2014)


2.0 Directory Structure

The directory structure and contents of the package are summarized below --
paths shown are relative to the base (root) directory of the package:

  ./data/ -- contains zip files subdivided by data type (see below)
  ./docs/ -- contains tab-delimited table files (see descriptions in section 7)
  ./tools/ -- contains software for text data manipulation

The "data" directory has a separate subdirectory for each of the following
data types, and each directory contains one or more zip archives with data
files of the given type; the list shows the archive-internal directory and
file-extension strings used for the data files of each type:

    gif/*.gif.zip -- contains "gif/*.gif.ldcc" files (image data)
    jpg/*.jpg.zip -- contains "jpg/*.jpg.ldcc" files (image data)
    mp3/*.mp3.zip -- contains "mp3/*.mp3.ldcc" files (typically audio)
    mp4/*.mp4.zip -- contains "mp4/*.mp4.ldcc" files (typically video)
    png/*.png.zip -- contains "png/*.png.ldcc" files (image data)

    ltf/*.ltf.zip -- contains "ltf/*.ltf.xml" (segmented/tokenized text data)
    psm/*.psm.zip -- contains "psm/*.psm.xml" files (companion to ltf.xml)

Data types in the first group consist of original source materials presented
in "ldcc wrapper" file format (see section 4.2 below).  The latter group (ltf
and psm) are created by LDC from source HTML data, by way of an intermediate
XML reduction of the original HTML content for "root" web pages (see section
4.1 for a description of the process, and section 5 for details on the LTF and
PSM file formats).

The 6-character file-ID of the zip archive matches the first 6 characters of
the 9-character file-IDs of the data files it contains.  For example:

  zip archive file ./data/png/HC000S.png.zip contains:

    png/HC000SOL3.png.ldcc
    png/HC000SXAF.png.ldcc
    ...
    png/HC000SY42.png.ldcc
    png/HC000SXTY.png.ldcc


(The "ldcc" file format is explained in more detail in section 4.2 below.)
Note that the number of data files per zip archive varies. In the present
release, the largest single zip archive has over 2400 files.


3.0 Content Summary

Throughout the AIDA data sets, the concept of "root" or "parent" documents is
used to denote the original content of a web page, which may include any
combination of "document elements" or "child assets".  The parent or root
document refers to the entire collection of text, image, video, and audio
presented on a single page on the Internet.  The child assets refer to each
individual text, image, video, or audio file collected, processed, and
presented in the corpus as a part of the parent document.

All documents in this corpus were manually identified by annotators as
relevant to one or more of the Phase 1 Scenario topics.

The number of parent documents and text, image, video, and audio child assets
in this corpus are listed below.

Source/Prov. #RootDocs  #Texts   #Images  #Videos  #Audios
----------------------------------------------------------
Twitter/diy    444        444*     470*      6*     0
Other/zipped  1067        850     4029     285      1
Total         1511       1294     4499     291      1

"#RootDocs" refers to the number of root HTML pages or Tweets that were
scouted and harvested.  "#Texts" refers to text content (if any was
successfully harvested) converted to LTF and PSM formats; the discrepancy
relative to "#RootDocs" represents the number of web pages where text content
was either non-existent or not readily extactable from the HTML markup.  The
other columns indicate the total number of data files of the various types
extracted from those root pages.  The child-asset quantities in the
"Twitter/diy" row are marked by asterisks because those data items are not
included in this release -- users will need to use the software provided here
to harvest and process those pieces of data (to the extent that they are still
accessible via the twitter.com service).


4.0 Data Processing and Character Normalization

Most of the content has been harvested from various web sources using an
automated system that is driven by manual scouting for relevant material.
Some content may have been harvested manually, or by means of ad-hoc scripted
methods for sources with unusual attributes.

4.1 Treatment of original HTML text content

All harvested HTML content was initially converted from its original form into
a relatively uniform XML format; this stage of conversion eliminated
irrelevant content (menus, ads, headers, footers, etc.), and placed the
content of interest into a simplified, consistent markup structure.

The "homogenized" XML format then served as input for the creation of
a reference "raw source data" (rsd) plain text form of the web page
content; at this stage, the text was also conditioned to normalize
white-space characters, and to apply transliteration and/or other
character normalization, as appropriate to the given language.

This processing creates the ltf.xml and psm.xml files for each harvested
"root" web page; these file formats are described in more detail in section 5
below.

4.2 Treatment of non-HTML data types: "ldcc" file format

To the fullest extent possible, all discrete resources referenced by a given
"root" HTML page (style sheets, javascript, images, media files, etc.) are
stored as separate files of the given data type, and assigned separate
9-character file-IDs (the same form of ID as is used for the "root" HTML
page).

In order to present these attached resources in a stable and consistent way,
the LDC has developed a "wrapper" or "container" file format, which presents
the original data as-is, together with a specialized header block prepended to
the data.  The header block provides metadata about the file contents,
including the MD5 checksum (for self-validation), the data type and byte count,
url, and citations of source-ID and parent (HTML) file-ID.

The LDCC header block always begins with a 16-byte ASCII signature, as shown
between double-quotes on the following line (where "\n" represents the ASCII
"newline" character 0x0A):

"LDCc   \n1024   \n"

Note that the "1024" on the second line of the signature represents the exact
byte count of the LDCC header block.  (If/when this header design needs to
accommodate larger quantities of metadata, the header byte count can be
expanded as needed in increments of 1024 bytes.  Such expansion does not arise
in the present release.)

Immediately after the 16-byte signature, a YAML string presents a data
structure comprising the file-specific header content, expressed as a set of
"key: value" pairings in UTF-8 encoding.

The YAML string is padded at the end with space characters, such that when the
following 8-byte string is appended, the full header block size is exactly
1024 bytes (or whatever size is stated in the initial signature):

"endLDCc\n"

In order to process the content of an LDCC header:

 - read the initial block of 1024 bytes from the *.ldcc data file
 - check that it begins with "LDCc   \n1024   \n" and ends with "endLDCc\n"
 - strip off those 16- and 8-byte portions
 - pass the remainder of the block to a YAML parser.

In order to access the original content of the data file, simply skip or
remove the initial 1024 bytes.


5.0 Overview of XML Data Structures

5.1 PSM.xml -- Primary Source Markup Data

The "homogenized" XML format described above preserves the minimum set of tags
needed to represent the structure of the relevant text as seen by the human
web-page reader.  When the text content of the XML file is extracted to create
the "rsd" format (which contains no markup at all), the markup structure is
preserved in a separate "primary source markup" (psm.xml) file, which
enumerates the structural tags in a uniform way, and indicates, by means of
character offsets into the rsd.txt file, the spans of text contained within
each structural markup element.

For example, in a discussion-forum or web-log page, there would be a division
of content into the discrete "posts" that make up the given thread, along with
"quote" regions and paragraph breaks within each post.  After the HTML has
been reduced to uniform XML, and the tags and text of the latter format have
been seprated, information about each structural tag is kept in a psm.xml
file, preserving the type of each relevant structural element, along with its
essential attributes ("post_author", "date_time", etc.), and the character
offsets of the text span comprising its content in the corresponding rsd.txt
file.

5.2 LTF.xml -- Logical Text Format Data

The "ltf.xml" data format is derived from rsd.txt, and contains a fully
segmented and tokenized version of the text content for a given web page.
Segments (sentences) and the tokens (words) are marked off by XML tags (SEG
and TOKEN), with "id" attributes (which are only unique within a given XML
file) and character offset attributes relative to the corresponding rsd.txt
file; TOKEN tags have additional attributes to describe the nature of the
given word token.

The segmentation is intended to partition each text file at sentence
boundaries, to the extent that these boundaries are marked explicitly by
suitable punctuation in the original source data.  To the extent that sentence
boundaries cannot be accurately detected (due to variability or ambiguity in
the source data), the segmentation process will tend to err more often on the
side of missing actual sentence boundaries, and (we hope) less often on the
side of asserting false sentence breaks.

The tokenization is intended to separate punctuation content from word
content, and to segregate special categories of "words" that play particular
roles in web-based text (e.g. URLs, email addresses and hashtags).  To the
extent that word boundaries are not explicitly marked in the source text, the
LTF tokenization is intended to divide the raw-text character stream into
units that correspond to "words" in the linguistic sense (i.e. basic units of
lexical meaning).


6.0 Software tools included in this release

6.1 ltf2txt

A data file in ltf.xml format (as described above) can be conditioned to
recreate exactly the the "raw source data" text stream (the rsd.txt file) from
which the LTF was created.  The tools described here can be used to apply that
conditioning, either to a directory or to a zip archive file containing
ltf.xml data.  In either case, the scripts validate each output rsd.txt stream
by comparing its MD5 checksum against the reference MD5 checksum of the
original rsd.txt file from which the LTF was created.  (This reference
checksum is stored as an attribute of the "DOC" element in the ltf.xml
structure; there is also an attribute that stores the character count of the
original rsd.txt file.)

Each script contains user documentation as part of the script content; you can
run "perldoc" to view the documentation as a typical unix man page, or you can
simply view the script content directly by whatever means to read the
documentation.  Also, running either script without any command-line arguments
will cause it to display a one-line synopsis of its usage, and then exit.

   ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data)

   ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives

6.2 twitter-processing

Due to the Twitter Terms of Use, the content of individual tweets
cannot be redistributed by the LDC.  As a result, users must download
the tweet contents directly from Twitter.  The twitter-processing
software provided in the tools/ directory enables users to perform the
same normalization applied by LDC and ensure that the user's version
of the tweet matches the version used by LDC, by verifying that the
md5sum of the user-downloaded and processed tweet matches the md5sum
provided in the twitter_info.tab file.  Users must have a developer
account with Twitter in order to download tweets, and the tool does
not replace or circumvent the Twitter API for downloading tweets.

The ./docs/twitter_info.tab file provides the twitter download id for each
tweet, along with the AIDA file name assigned to that tweet and the md5sum of
the processed text from the tweet.

The file "README.md" in this directory provides details on how to
install and use the source code in this directory in order to
condition text data that the user downloads directly from Twitter and
produce both the normalized raw text and the segmented, tokenized
LTF.xml output.

All LDC-developed supporting files (models, configuration files,
library modules, etc.) are included, either in the "lib" subdirectory
(next to "bin"), or else in the parent ("tools") directory.

The executable get_tweet_by_id.rb is located under tools/bin/ and can
be used to download and condition twitter text to match the version
used by LDC for annotation.

Please refer to the README.md file that accompanies this software
package.


7.0 Documentation included in this release

The ./docs folder (relative to the root directory of this release) contains a
set of tab-delimited table files; each of these is described in a subsection
below.

In the following, the term "asset" refers to any single "primary" data file of
any given type.  Each asset has a distinct 9-character identifier.  If two or
more files appear with the same 9-character file-ID, this means that they
represent different forms or derivations created from the same, single primary
data file (e.g. this is how we mark corresponding LTF.xml and PSM.xml file
pairs).

Data scouting, annotation and related metadata are all managed with regard to
a set of "root" HTML pages (harvested by the LDC for a specified set of
topics); therefore the tables and annotations make reference to the asset-IDs
assigned to those root pages.

However, the present release does not include the original HTML text streams,
or any derived form of data corresponding to the full HTML content.  As a
result, the "root" asset-IDs cited in tables and annotations are not to be
found among the inventory of data files presented in zip archives in the
"./data" directory.

Each root asset is associated with one or more "child" assets (inlcuding
images, media files, style sheets, text data presented as ltf.xml, etc.); each
child asset gets it own distinct 9-character ID.  The root-child relations are
provided in "parent_files.tab" table (7.3), the "structure schema" xml files
(7.5), and as part of the LDCC header content in the various "wrapped" data
file formats (as listed in section 2).

7.1 "parent_children.tab" -- relation of child assets to root HTML pages

Each data file-ID in the set of zip archives is represented by the combination
of child_uid and child_asset (columns 4 and 6), along with its root UID in
column 3.

 Col.#  Content
 1. parent_uid
 2. child_uid
 3. url
 4. child_asset type (e.g. ".jpg.ldcc")
 5. topic
 6. lang_id (automatically detected language)
 7. lang_manual (manually annotated language, if available)
 8. rel_pos (position of this asset relative to other child assets on page)
 9. wrapped_md5 (md5 checksum of .ldcc formatted asset file)
 10. unwrapped_md5 (md5 checksum of original asset data file)
 11. download_date (download date of asset)
 12. content_date (creation date of asset, or n/a)
 13. status_in_corpus ("present", or "diy" for Twitter assets)

Notes:

- Because ltf and psm files have the same "child" uid and differ only in the file
  extension (.ltf.xml or .psm.xml), only the ltf files are listed in the
  parent_children.tab document.

- The URL provided for each .ltf.xml entry in the table is the "full-page" URL
  for root document associated with the "parent_uid" value. (For other types of
  child data -- images and media -- the "url" field contains the specific url
  for that specific piece of content.)

- Because the harvesting of some root URLs yielded no text content (hence no
  ltf/psm data files), the table includes "placeholder" .ltf.xml entries for
  those parent_uids, in order to provide the full-page URL for every root. The
  "status_in_corpus" and "child_uid" fields for these entries is set to "n/a"; in
  the present release, this applies to 218 of the 1512 root URLs in the table.

- Some child_uids (for images or videos) may appear multiple times in the table,
  if they were found to occur identically in multiple root web pages.


7.2 "twitter_info.tab" -- summary of Twitter assets

For each tweet collected, a row listing asset uid, tweet ID, user ID, and
topic UID is included:

 Col.#  Content
 1. uid
 2. tweet_id (Twitter-provided tweet ID)
 3. user_id (Twitter-provided user ID)
 4. topic_uid (one or more topic IDs relatable to this Tweet, comma-separated)

The tweet_id can be used to download the tweet directly from Twitter through
their API.  In future releases, we will provide tools to ensure that the
downloaded tweet contents match those retrieved by LDC so that any annotations
can be correctly aligned with the tweet.

8.0 Acknowledgements

The authors would like to acknowlege the following contributors to
this corpus: Justin Mott, Alex Shelmire, Seth Kulick, MITRE
Corporation, especially Lisa Ferro, and our team of AIDA annotators.

This material is based upon work supported by Air Force Research
Laboratory (AFRL) and the Defense Advanced Research Projects Agency
(DARPA) under Contract No. FA8750-18-C-0013. 

9.0 Copyright

© 2014 0342.ua, © 2014 ABC News Internet Ventures, © 2018 About the
West, © 2014 Agency for Information and Analytics, © 2015 Al Jazeera
Media Network, © 2017-2018 ANO Creative Team Expert, © 2014, 2017 ANO
RID Novaya Gazeta, © 2014 ANTIKOR,© 2017 Apostrophe, © 2014
“ARGUMENT,” © 2014, 2017 Arguments and Facts, © 2014 Associated
Newspapers Ltd, © 2015 Athens News, © 2014-2017 Autonomous Nonprofit
Organization “TV-Novosti,” © 2014-2017 BBC, © 2015, 2017-2018
Bellingcat, © 2014 Bessarabia INFORM, © 2016 Bird In Flight, © 2014
BIZNESGRUPP TOV, © 2014 Boston Globe Media Partners, LLC, © 2018
Business capital, © 2014 BuzzFeed, Inc., © 2014, 2016 Cable News
Network. A Warner Bros. Discovery Company, © 2015 Carnegie Endowment
for International Peace, © 2015 CBS Interactive Inc., © 2014, 2017
Censor.NET, © 2015 Channel 5, © 2017 Charter ’97, © 2014 CJSC
Moskovsky Komsomolets, MK.ru, © 2014 CNBC LLC, © 2014 Colta.com, ©
2018 Conflicts and laws, © 2014, 2016 Consortium News, © 2016 Crime
NO, © 2015 Dawn of Novorossiya, © 2016 Depo.ua, © 2014 Dicasterium pro
Communicationene, © 2016 DutchNews, © 2017 "Echo of the Planet," ©
2017 EN.News Front, © 2014, 2017 ESPRESO.TV, © 2015 Eubulletin.com, ©
2015-2016 Euromaidan Press, © 2017 Express, © 2014 "Facts and
Comments," © 2015 FAN, © 2022 Federal State Budgetary Institution
"Editorial Office of Rossiyskaya Gazeta," © 2014-2015 First Channel, ©
2014 Focus, © 2015 Forbes Media LLC, © 2017 Future Publishing Limited,
Quay House, The Ambury, Bath BA1 1UA, © 2018 Geopoliticalmonitor
Intelligence Corp., © 2014 Haaretz Daily Newspaper Ltd., © 2014
Infowars, © 2014 Interlocutor, © 2014-2015, 2017 Golden Mean LLC, ©
2017 GolosIslam.RU, © 2014 GORDON, © 2014 Gorlovka.ua, © 2015 Graphic
News Ltd, © 2014 Guardian News & Media Limited or its affiliated
companies,© 2014 High Castle Online, ©2014 HotAir.com/Salem Media, ©
2014 Hürriyet Daily News, © 2017 HVILYA, © 2018 IA "InfoResist," ©
2015-2016 IA "Russia Today," © 2014 IBTimes Co., Ltd, © 2017-2018 InA
"Ukrainian News," © 2014 InfoKava.com, © 2015 Information agency
LIGABusinessInform, © 2014 Information and analytical publication "One
Motherland," © 2014 InoSMI.ru, © 2014 INSIDER, © 2014 Insider Inc., ©
2016 Interfax-Ukraine, © 2014 Internet Television "Piter.TV," © 2014
IP Filin M.S., © 2014, 2017-2018 JSC “Gazeta.Ru,” © 2014-2015, 2017
JSC "Kommersant," © 2014, 2017 JSC ROSBUSINESSCONSULTING,© 2014-2015,
2017 JSC TRK AF RF ZVEZDA, © 2017 Korrespondent.net, © 2017 Lenta.Ru
LLC, © 2014 LLC “Kurs,” © 2014 LLC "National Information Systems," ©
2015 LLC "Rusevik," © 2017 LLC “UKRAINIAN PRESS GROUP,” © 2014 Los
Angeles Times, © 2014 M24, © 2014 Mashable, Inc., © 2015 Max Park, ©
2015, MEDIA-DK PUBLISHING HOUSE LLC, © 2014 “MEDIASAPIENS,” ©
2015-2016 Meduza, © 2016 mirnews.su, © 2014 “Mirror of the
Week. Ukraine," © 2014 Moscow Digital Media LLC, © 2014 Naharnet, ©
2018 National Post, a division of Postmedia Network Inc., © 2017
National Bank of News, © 2015 Nationwide News Pty Ltd, © 2014 NBC
Universal, © 2014 NDTV Convergence Limited, © 2017 News24 Today, ©
2015 News of Ukraine on Rivnist.In.Ua, © 2016 NEWSWEEK DIGITAL LLC, ©
2014 NGO "Transcarpathian Free Media," © 2014 Nine Digital Network, ©
2014 npr, © 2014-2015, 2017 Online edition "Vesti.Ru," © 2014
ONLINE.UA, © 2014, 2017 Organization for Security and Cooperation in
Europe, © 2014 OstroV, © 2014 Paris Match, © 2014 PE "Ukraine Young,"
© 2017 Politeka, © 2017 "Politic.Kiev.Ua," © 2017 PolitRussia,© 2017
POWER NET, © 2015-2016 Present Time, © 2015-2016 Public Television, ©
2014, 2016-2017 Publishing House <Komsomolskaya Pravda> JSC, ©
2014-2015, 2017 Radio Liberty,© 2014 Rakurs, © 2016 Rambler, © 2017
Replyua.net, © 2014 Reuters, © 2014-2015 RFE/RL, Inc., © 2015 Russia
Insider, © 2014-2015 segodnya.ua, © 2015 sevascom, © 2017 Spiegel
Group, © 2014, 2016 Sputnik, © 2014 SVIT24.NET, © 2014, 2017 TASS,
Russian news agency, © 2014-2015 Telegraph Media Group Limited, ©
2014-2017 Television and Radio Company Lux, TV Channel 24, © 2014,
2016 Television news service, © 2014 The Atlantic Monthly Group, ©
2014 The Christian Science Monitor, © 2014, 2017 The Daily Beast
Company LLC, © 2014 The Economist Newspaper Limited, © 2017 The
EurAsian Times, © 2017-2018 THE FINANCIAL TIMES LTD, © 2014 The Globe
and Mail Inc., © 2015 THE IRISH TIMES, © 2017 The Moscow Times, ©
2014-2015 The New York Times Company, © 2014 The Slate Group, © 2018
The Times of Israel, © 2014 The Washington Post, © 2014 The World from
PRX, © 2014, 2016 TIME USA, LLC, © 2014 TOPNEWS.RU, © 2014 Toronto
Star Newspapers Ltd., © 2014-2016 TOV "KEPRATE PARTNERS," © 2014
Transcarpathia online Beta, © 2014-2015, 2017 TV Center JSC, ©
2014–2015, 2017 Tyzhden.ua, © 2014 uapress, © 2014-2016 “Ukrainian
media systems,”© 2017 Ukrainian National News Agency, © 2017 Ukrainian
Truth, © 2014 Ukrinform, © 2014-2017 UNIAN.NET, © 2014-2017 USA TODAY,
a division of Gannett Satellite Information Network, LLC, © 2017
Vchasno LLC, © 2017 Vchasno news agency - Donbass news, © 2014 VGOS
INFORMATION AGENCY, © 2014 vinnitsa.info, © 2014 Volyn News Agency, ©
2014 VolynPost, © 2014 Voskanapat.info, © 2014 Vox Media, LLC, © 2014
Vysokyi Zamok Publishing House LLC, © 2014 WINDOWS, © 2017 "Word and
Deed,"© 2014 worldnewsage.com, © 2015 XINHUANET.com, © 2017-2018
YouTube, LLC, © 2014 ZAKHID.NET LLC, © 2015 zbruc.eu, © 2014
Zhitomir-Online, © 2017 Znaj.ua, © 2022 Trustees of the University of
Pennsylvania


10.0 Contacts

Stephanie Strassel <strassel@ldc.upenn.edu> - AIDA PI


------
README created by Chris Caruso on April 5, 2022
       updated by Jennifer Tracey on April 26, 2022