Corpus Title:    AIDA Scenario 2 Practice Topic Source Data
LDC Catalog-ID:  LDC2024T04

Authors: Jennifer Tracey, Stephanie Strassel, Jeremy Getman, Ann Bies,
Kira Griffitt, David Graff, Chris Caruso

1.0 Introduction

This corpus was developed by the Linguistic Data Consortium for the
DARPA AIDA Program and consists of 10352 multimedia data files (text,
image, and video) from English, Spanish, and Russian web sources.
Details of data volumes for each language and media type are provided
in section 3 of this README.

The AIDA (Active Interpretations of Disparate Alternatives) Program is
designed to support development of technology that can assist in
cultivating and maintaining understanding of events when there are
conflicting accounts of what happened (e.g. who did what to whom
and/or where and when events occurred).  AIDA systems must extract
entities, events, and relations from individual multimedia documents,
aggregate that information across documents and languages, and produce
multiple knowledge graph hypotheses that characterize the conflicting
accounts that are present in the corpus (see
https://www.darpa.mil/program/active-interpretation-of-disparate-alternatives
for more information about the program).

Each phase of the AIDA program focused on a different scenario, or
broad topic area. The scenario for Phase 2 was the socioeconomic and
political crisis in Venezuela since 2010. In addition, each scenario
had a set of specific subtopics within the scenario that were
designated as either "practice topics" (released as for use in system
development) or "evaluation topics" (reserved for use in the AIDA
program evaluations for each phase).

Data collection for this program included both topic-focused data
(containing information about specific subtopics of interest within
the larger scenario) as well as background data (a large volume of
data in the target languages and media types with no topic focus or
requirements). This corpus comprises the full set of topic-focused
documents for the practice topics within the Phase 2 Venezuela
scenario.

1.1 AIDA Scenario 2 Topics

T201 - 2014 Disease Outbreak in Venezuela
T202 - 2017 Venezuelan Constituent Assembly Election
T203 - Drone Explosions in Caracas


2.0 Directory Structure

The directory structure and contents of the package are summarized below --
paths shown are relative to the base (root) directory of the package:

  ./data/ -- contains zip files subdivided by data type (see below)
  ./docs/ -- contains tab-delimited table files (see descriptions in section 7)
  ./tools/ -- contains software for text data manipulation

The "data" directory has a separate subdirectory for each of the following
data types, and each directory contains one or more zip archives with data
files of the given type; the list shows the archive-internal directory and
file-extension strings used for the data files of each type:

    gif/*.gif.zip -- contains "gif/*.gif.ldcc" files (image data)
    jpg/*.jpg.zip -- contains "jpg/*.jpg.ldcc" files (image data)
    mp3/*.mp3.zip -- contains "mp3/*.mp3.ldcc" files (typically audio)
    mp4/*.mp4.zip -- contains "mp4/*.mp4.ldcc" files (typically video)
    png/*.png.zip -- contains "png/*.png.ldcc" files (image data)
    svg/*.svg.zip -- contains "svg/*.svg.ldcc" files (image data)

    ltf/*.ltf.zip -- contains "ltf/*.ltf.xml" (segmented/tokenized text data)
    psm/*.psm.zip -- contains "psm/*.psm.xml" files (companion to ltf.xml)

Data types in the first group consist of original source materials presented
in "ldcc wrapper" file format (see section 4.2 below).  The latter group (ltf
and psm) are created by LDC from source HTML data, by way of an intermediate
XML reduction of the original HTML content for "root" web pages (see section
4.1 for a description of the process, and section 5 for details on the LTF and
PSM file formats).

The 6-character file-ID of the zip archive matches the first 6 characters of
the 9-character file-IDs of the data files it contains.  For example:

  zip archive file ./data/png/HC000S.png.zip contains:

    png/HC000SOL3.png.ldcc
    png/HC000SXAF.png.ldcc
    ...
    png/HC000SY42.png.ldcc
    png/HC000SXTY.png.ldcc


(The "ldcc" file format is explained in more detail in section 4.2 below.)
Note that the number of data files per zip archive varies. In the present
release, the largest single zip archive has over 2400 files.


3.0 Content Summary

Throughout the AIDA data sets, the concept of "root" or "parent" documents is
used to denote the original content of a web page, which may include any
combination of "document elements" or "child assets".  The parent or root
document refers to the entire collection of text, image, video, and audio
presented on a single page on the Internet.  The child assets refer to each
individual text, image, video, or audio file collected, processed, and
presented in the corpus as a part of the parent document.

All documents in this corpus were manually identified by annotators as
relevant to one or more of the Phase 2 Scenario topics.

The number of parent documents and text, image, video, and audio child assets
in this corpus are listed below.

#RootDocs  #Texts   #Images  #Videos  #Audios
---------------------------------------------
1500       1327     8619     337      1

"#RootDocs" refers to the number of root HTML pages that were
scouted and harvested.  "#Texts" refers to text content (if any was
successfully harvested) converted to LTF and PSM formats; the discrepancy
relative to "#RootDocs" represents the number of web pages where text content
was either non-existent or not readily extractable from the HTML markup.  The
other columns indicate the total number of data files of the various types
extracted from those root pages.  


4.0 Data Processing and Character Normalization

Most of the content has been harvested from various web sources using an
automated system that is driven by manual scouting for relevant material.
Some content may have been harvested manually, or by means of ad-hoc scripted
methods for sources with unusual attributes.

4.1 Treatment of original HTML text content

All harvested HTML content was initially converted from its original form into
a relatively uniform XML format; this stage of conversion eliminated
irrelevant content (menus, ads, headers, footers, etc.), and placed the
content of interest into a simplified, consistent markup structure.

The "homogenized" XML format then served as input for the creation of
a reference "raw source data" (rsd) plain text form of the web page
content; at this stage, the text was also conditioned to normalize
white-space characters, and to apply transliteration and/or other
character normalization, as appropriate to the given language.

This processing creates the ltf.xml and psm.xml files for each harvested
"root" web page; these file formats are described in more detail in section 5
below.

4.2 Treatment of non-HTML data types: "ldcc" file format

To the fullest extent possible, all discrete resources referenced by a given
"root" HTML page (style sheets, javascript, images, media files, etc.) are
stored as separate files of the given data type, and assigned separate
9-character file-IDs (the same form of ID as is used for the "root" HTML
page).

In order to present these attached resources in a stable and consistent way,
the LDC has developed a "wrapper" or "container" file format, which presents
the original data as-is, together with a specialized header block prepended to
the data.  The header block provides metadata about the file contents,
including the MD5 checksum (for self-validation), the data type and byte count,
url, and citations of source-ID and parent (HTML) file-ID.

The LDCC header block always begins with a 16-byte ASCII signature, as shown
between double-quotes on the following line (where "\n" represents the ASCII
"newline" character 0x0A):

"LDCc   \n1024   \n"

Note that the "1024" on the second line of the signature represents the exact
byte count of the LDCC header block.

Immediately after the 16-byte signature, a YAML string presents a data
structure comprising the file-specific header content, expressed as a set of
"key: value" pairings in UTF-8 encoding.

The YAML string is padded at the end with space characters, such that when the
following 8-byte string is appended, the full header block size is exactly
1024 bytes (or whatever size is stated in the initial signature):

"endLDCc\n"

In order to process the content of an LDCC header:

 - read the initial block of 1024 bytes from the *.ldcc data file
 - check that it begins with "LDCc   \n1024   \n" and ends with "endLDCc\n"
 - strip off those 16- and 8-byte portions
 - pass the remainder of the block to a YAML parser.

In order to access the original content of the data file, simply skip or
remove the initial 1024 bytes.


5.0 Overview of XML Data Structures

5.1 PSM.xml -- Primary Source Markup Data

The "homogenized" XML format described above preserves the minimum set of tags
needed to represent the structure of the relevant text as seen by the human
web-page reader.  When the text content of the XML file is extracted to create
the "rsd" format (which contains no markup at all), the markup structure is
preserved in a separate "primary source markup" (psm.xml) file, which
enumerates the structural tags in a uniform way, and indicates, by means of
character offsets into the rsd.txt file, the spans of text contained within
each structural markup element.

For example, in a discussion-forum or web-log page, there would be a division
of content into the discrete "posts" that make up the given thread, along with
"quote" regions and paragraph breaks within each post.  After the HTML has
been reduced to uniform XML, and the tags and text of the latter format have
been separated, information about each structural tag is kept in a psm.xml
file, preserving the type of each relevant structural element, along with its
essential attributes ("post_author", "date_time", etc.), and the character
offsets of the text span comprising its content in the corresponding rsd.txt
file.

5.2 LTF.xml -- Logical Text Format Data

The "ltf.xml" data format is derived from rsd.txt, and contains a fully
segmented and tokenized version of the text content for a given web page.
Segments (sentences) and the tokens (words) are marked off by XML tags (SEG
and TOKEN), with "id" attributes (which are only unique within a given XML
file) and character offset attributes relative to the corresponding rsd.txt
file; TOKEN tags have additional attributes to describe the nature of the
given word token.

The segmentation is intended to partition each text file at sentence
boundaries, to the extent that these boundaries are marked explicitly by
suitable punctuation in the original source data.  To the extent that sentence
boundaries cannot be accurately detected (due to variability or ambiguity in
the source data), the segmentation process will tend to err more often on the
side of missing actual sentence boundaries, and (we hope) less often on the
side of asserting false sentence breaks.

The tokenization is intended to separate punctuation content from word
content, and to segregate special categories of "words" that play particular
roles in web-based text (e.g. URLs, email addresses and hashtags).  To the
extent that word boundaries are not explicitly marked in the source text, the
LTF tokenization is intended to divide the raw-text character stream into
units that correspond to "words" in the linguistic sense (i.e. basic units of
lexical meaning).


6.0 Software tools included in this release

6.1 ltf2txt

A data file in ltf.xml format (as described above) can be conditioned to
recreate exactly the the "raw source data" text stream (the rsd.txt file) from
which the LTF was created.  The tools described here can be used to apply that
conditioning, either to a directory or to a zip archive file containing
ltf.xml data.  In either case, the scripts validate each output rsd.txt stream
by comparing its MD5 checksum against the reference MD5 checksum of the
original rsd.txt file from which the LTF was created.  (This reference
checksum is stored as an attribute of the "DOC" element in the ltf.xml
structure; there is also an attribute that stores the character count of the
original rsd.txt file.)

Each script contains user documentation as part of the script content; you can
run "perldoc" to view the documentation as a typical unix man page, or you can
simply view the script content directly by whatever means to read the
documentation.  Also, running either script without any command-line arguments
will cause it to display a one-line synopsis of its usage, and then exit.

   ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data)

   ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives


7.0 Documentation included in this release

The ./docs folder (relative to the root directory of this release) contains a
set of tab-delimited table files; each of these is described in a subsection
below.

In the following, the term "asset" refers to any single "primary" data file of
any given type.  Each asset has a distinct 9-character identifier.  If two or
more files appear with the same 9-character file-ID, this means that they
represent different forms or derivations created from the same, single primary
data file (e.g. this is how we mark corresponding LTF.xml and PSM.xml file
pairs).

Data scouting, annotation and related metadata are all managed with regard to
a set of "root" HTML pages (harvested by the LDC for a specified set of
topics); therefore the tables and annotations make reference to the asset-IDs
assigned to those root pages.

However, the present release does not include the original HTML text streams,
or any derived form of data corresponding to the full HTML content.  As a
result, the "root" asset-IDs cited in tables and annotations are not to be
found among the inventory of data files presented in zip archives in the
"./data" directory.

Each root asset is associated with one or more "child" assets (including
images, media files, style sheets, text data presented as ltf.xml, etc.); each
child asset gets it own distinct 9-character ID.  The root-child relations are
provided in "parent_files.tab" table (7.3), the "structure schema" xml files
(7.5), and as part of the LDCC header content in the various "wrapped" data
file formats (as listed in section 2).

7.1 "parent_children.tab" -- relation of child assets to root HTML pages

Each data file-ID in the set of zip archives is represented by the combination
of child_uid and child_asset (columns 4 and 6), along with its root UID in
column 3.

 Col.#  Content
 1. parent_uid
 2. child_uid
 3. url
 4. child_asset type (e.g. ".jpg.ldcc")
 5. topic
 6. lang_id (automatically detected language)
 7. lang_manual (manually annotated language, if available)
 8. rel_pos (position of this asset relative to other child assets on page)
 9. wrapped_md5 (md5 checksum of .ldcc formatted asset file)
 10. unwrapped_md5 (md5 checksum of original asset data file)
 11. download_date (download date of asset)
 12. content_date (creation date of asset, or n/a)
 13. status_in_corpus ("present", or "diy" for Twitter assets)

Notes:
 
- Because ltf and psm files have the same "child" uid and differ only in the file 
  extension (.ltf.xml or .psm.xml), only the ltf files are listed in the 
  parent_children.tab document.
 
- The URL provided for each .ltf.xml entry in the table is the "full-page" URL 
  for root document associated with the "parent_uid" value. (For other types of 
  child data -- images and media -- the "url" field contains the specific url for 
  that specific piece of content.)
 
- Because the harvesting of some root URLs yielded no text content (hence no ltf/psm 
  data files), the table includes "placeholder" .ltf.xml entries for those parent_uids, 
  in order to provide the full-page URL for every root. The "status_in_corpus" and 
  "child_uid" fields for these entries is set to "n/a".
 
- Some child_uids (for images or videos) may appear multiple times in the table, if 
  they were found to occur identically in multiple root web pages.


7.2 "video_data.msb" -- summary of video shot boundary segments

For each video included in the release, a set of segments was
generated with the video shot boundary detector and is listed in this
file.

Col.#  Content
1. random video ID generated by cineast tool (e.g. "v_8iMHEy1DQzRUi7Ts")
2. AIDA video UID + segment number (e.g. "IC001V5WQ_1")
3. Shot start frame
4. Shot end frame
5. Shot start time in seconds
6. Shot end time in seconds
7. Representative frame number

8.0 Known Issues

The files JC002YEO3.psm.xml and JC002YEO3.ltf.xml have been removed due to technical issues.


9.0 Acknowledgements

The authors would like to acknowledge the following contributors to
this corpus: Justin Mott, Alex Shelmire, Seth Kulick, MITRE
Corporation, especially Lisa Ferro, and our team of AIDA annotators.

This material is based upon work supported by Air Force Research
Laboratory (AFRL) and the Defense Advanced Research Projects Agency
(DARPA) under Contract No. FA8750-18-C-0013. 


10.0 Copyright

Portions © 2015 21st Century Wire, © 2020 ABC, © 2013 ABC News
Internet Ventures, © 2014, 2017-2018 Alba Ciudad 96.3 FM, © 2017 AL
DÍA NEWS Media, © 2017-2018 Al Jazeera Media Network, © 2018
AméricaEconomía, © 2019 American Association for the Advancement of
Science, © 2019 Americas Society/Council of the Americas, © 2020 AMX
Content SA de CV, © 2014, 2017 Arguments and Facts JSC, © 2014
ARMENPRESS, © 2018 Authorized by the Chief Agent, CPC, © 2014,
2017-2018 Autonomous Nonprofit Organization “TV-Novosti”, © 2013-2014,
2018-2019 BBC, © 2015, 2017-2018 Bellingcat, © 2019 Breitbart, © 2018
Business capital, © 2020 business/media bureau ekonomika,© 2019-2020
C.A. IBERONEWS LIMITED, © 2018-2020 C.A. The Universe, © 2013, 2017
Cable News Network. Turner Broadcasting System, Inc., © 2017 Caracas
Chronicles, © 2018 Caracol SA, © 2018 CARACOL TELEVISIÓN SA, © 2013,
2017 CBC/Radio-Canada, © 2013 CBS Interactive Inc., © 2020 CDN, © 2017
Center for Democracy in the Americas, © 2014-2015 Channel One, © 2017
Chicago Tribune, © 2020 China Daily Information Co, © 2014 CJSC
Editorial office of the newspaper Moskovsky Komsomolets, © 2014 CNBC
LLC, © 2020 COHA, © 2014 Colombia Reports, © 2015, 2012 Comments, ©
2018 COMUNICAN SA, © 2018 Condé Nast, © 2019-2020 CounterPunch, © 2020
Crisis Group, © 2019 Dailymotion, © 2020 Daily News of Vladivostok, ©
2018 DiarioContraste.com, © 2017 Diariocorreo.pe, © 2014 Diario La
Voz, © 2018, 2020 Diario las Americas, © 2018 Dicasterium pro
Communicatione © 2019 Dixi Media Digital, SL, © 2014 DolarToday.com, ©
2014, 2017 Dow Jones & Company, Inc., © 2020 EADaily, © 2014,
2017-2018 EDICIONES EL PAÍS SL, © 2018 Ediciones Prensa Libre SL, ©
2019 Editions CDR, © 2020 Editorial Ecoprensa, S.A., © 2017-2018
Editorial Office of Rossiyskaya Gazeta, © 2018 Editorial Prensa
Alicantina SAU, © 2018 Efecto Cocuyo CA, © 2020 EL COLOMBIANO S.A.S, ©
2014 Elcomercio.pe, © 2018, 2020 EL HERALDO S.A., © 2019 El Impulso, ©
2018 El Nuevo Herald, © 2019 EL PERIÓDICO DE CATALUNYA, SLU, ©
2019-2020 el Popular, © 2020 EL TERRITORIO, © 2017 EL TIEMPO Casa
Editorial, © 2017 El Tiempo Latino, © 2020 elucabista, © 2018-2019 El
Universal, © 2020 Encyclopedia Britannica, Inc., © 2019 Entravision, ©
2019 Epoch Times Russia, © 2019 euronews, © 2018-2019 Europa Press, ©
2018 Euroradio, © 2020 Excelsior, © 2014 FAN, © 2018 First News Media,
© 2014 Forbes.com LLC, © 2018 France 24, © 2017 Future Publishing
Limited, Quay House, The Ambury, Bath BA1 1UA, © 2020 GardaWorld, ©
2020 GlobalResearch.ca, © 2020 G/O Media Inc., © 2014-2015, 2017
Golden Middle LLC, © 2018-2019 Google LLC, © 2014 GORDON, © 2014
Graham Digital Holding Company, © 2018 Grupo La República
Publicaciones SA, © 2014 Guardian News and Media Limited or its
affiliated companies, © 2014 Haaretz Daily Newspaper Ltd., © 2020
Havana Times, © 2019 HindustanTimes, © 2018, 2020 HispanTV, © 2020
Houston Public Media, A Service of the University of Houston, © 2020
HSB Group, © 2018 ID "Interlocutor", © 2014 Image and Communication, ©
2018 Impremedia Operating Company LLC, © 2017 Independent.co.uk, ©
2018-2020 Infobae, © 2017 Information agency "Ukrainian National
News", © 2014 Informe21.com, © 2018 Innova and Comunica Media SL, ©
2014 InoSMI.ru, © 2017 Interfax-Ukraine, © 2017 iPress.ua, © 2020 IT
Plus, © 2018 Izvestia MIC, © 2017 Journal Media Ltd., © 2018
Journalistic Society El Ciudadano Ltda, © 2015-2018 JSC Business News
Media, © 2014-2015, 2017 JSC Kommersant, © 2014, 2017-2018 JSC
Gazeta.Ru, © 2018 JSC NTV Television Company, © 2019 JSC TRK Armed
Forces “ZVEZDA ", © 2013, 2017 JSC TV and Radio Company Petersburg, ©
2014-2015, 2017 Korrespondent.net, © 2014-2019 Latin Post, © 2017 LLC
Business Newspaper "Vzglyad", © 2017 LLC RTVIA Production, © 2014 Los
Angeles Times, © 2018 Media Corporation of Extremadura SA, © 2015-2016
Meduza, © 2015-2016 MIA Russia Today, © 2018 Miami Herald, ©2018 Miami
New Times, LLC, © 2018 Microsoft, © 2018 MintPress News, © 2019
Natural News Network, © 2015, 2017 NBC Universal, © 2017 News24Today,
© 2019 NEWS.am, © 2018 NEWSONE.UA, © 2018-2019 Newspaper First
Edition, © 2017 News up to date, © 2016 Newsweek Digital LLC, © 2018
Nextstar Media Inc., © 2018-2020 Nezavisimaya Gazeta, © 2014 Nine
Digital Network, © 2020 NOTICIAS AL DIA Y A LA HORA, © 2019 Novaya
Gazeta, © 2017-2018 npr, © 2020 OAS, © 2020 Orlando Sentinel, © 2020
Our newspaper,© 2018 PJmedia.com/Salem Media, © 2018 Polit.ru, © 2017
PolitRussia, © 2013-2019 Pravda.Ru LLC, © 2015-2016 Present Time, ©
2013, 2018 Publishing House <Komsomolskaya Pravda> JSC, © 2019 Radio
Havana Cuba, © 2020 Radio Televisión Martí, © 2018 Radio Vesti, © 2018
Relrus.ru, © 2014-2015, 2017-2018 Reuters, © 2020 RFE/RL, © 2020 RFI,
© 2017, 2019 Russian information and analytical agency "SM News", ©
2020 Russian International Affairs Council, © 2018 Rutube, © 2014,
2017 ROSBUSINESSCONSULTING JSC, © 2020 SA THE NATION, © 2018 SA Week
Publications, © 2019 SIA "TVNET GRUPA", © 2018 SIA "TV Rain", © 2017
Sky UK, © 2020 South Mail Newspaper, © 2020 Spanish Radio and
Television Corporation, © 2013, 2017, 2019 Sputnik, © 2014 SVIT24.NET,
© 2017, 2019 TASS, Russian news agency, © 2017 Telegraph Media Group
Limited, © 2018 Television and Radio Company Lux, TV Channel 24, ©
2014, 2016 Television news service, © 2018 The American Conservative,
a publication of The American Ideas Institute, © 2017 The Associated
Press, © 2013, 2017 The Atlantic Monthly Group, © 2014 The Christian
Science Monitor, © 2019 The Cooperator, © 2017-2018 The Daily Beast
Company LLC, © 2020 The Daily Left, © 2019 The Dallas Morning News, ©
2014 The Economist Intelligence Unit Limited, © 2015, 2017 THE
IBEROSPHERE GAZETTE, © 2015 The Irish Times, © 2020 The Jordan News, ©
2020 The Lion of El Español Publications SA, © 2020 The New Journal, ©
2018 The New Republic, © 2014-2015 The New York Times Company, © 2020
The Press, © 2019 The Region Newspaper, © 2018-2019 The San Diego
Union-Tribune, © 2019 The Star of Panama, © 2017-2018 The Stimulus, ©
2018 The Venezuelan News, © 2013, 2017 The Washington Post, © 2020 The
Washington Times, LLC, © 2014, 2017 The World from PRX, © 2018
ThinkProgress, © 2017 TIME USA, LLC, © 2018 Titania Editorial Company
SL, © 2020 Tritón Comunicaciones S.A de C.V., © 2018, 2020 TRT World,
© 2020 Turkuvaz Haberleşme ve Yayıncılık, © 2017 TV Center JSC, © 2017
UA.NEWS, © 2014 uapress, © 2013-2019 UDF.BY, © 2013-2019 Ukrinform, ©
2014-2017 UNIAN.NET, © 2014, 2018 Unidad Editorial Informacion
General, © 2019 United Press International, Inc., © 2017-2018
Univision Communications Inc., © 2017 USA TODAY, a division of Gannett
Satellite Information Network, LLC, © 2017 Verizon Media, © 2014-2015,
2017 Vesti.Ru online edition, © 2019 VK LLC, © 2014, 2018 Vox Media,
LLC, © 2020 Workers World, © 2013 World and Politics, © 2014
worldnewsage.com, © 2017 www.charter97.org, © 2015 XINHUANET.com, ©
2018 Yahoo, © 2018-2020, 2023 Trustees of the University of
Pennsylvania


11.0 Contacts

Stephanie Strassel <strassel@ldc.upenn.edu> - AIDA PI