README FILE FOR THE FEB. 7, 2013 UPDATE FOR: "Spanish Gigaword 3rd Edition" (LDC2011T12) -- REPLACEMENT DATA FILES FROM THE XINHUA SPANISH COLLECTION -- The data that accompanies this README file is an update to the Spanish Gigaword corpus, 3rd Edition, LDC2011T12. (It could apply as well to the 2nd Edition, LDC2009T21, because it involves data that were first released in that edition.) This update fixes a problem that affected 26 consecutive monthly data files from Xinhua Spanish: xin_spa_200601 through xin_spa_200802; i.e. all files from 2006 and 2007, plus the first two files from 2008. The problem was that all letters with diacritic marks had been omitted in the text data for that portion of the collection. For example, the word "año" was presented as "ao" (minus the "n with tilde" character), "aspiración" appeared as "aspiracin", and similarly for all accented characters (UTF-8 letters outside the ASCII range). The loss of accented letters was due to a processing error in the preparation of this data subset from the original archives provided by Xinhua. In order to address the problem, the original archival data was reprocessed with corrected procedures, and the "DOC" unit inventory of the previous edition was reassembled from this new output. The 3rd Edition release contained a total of 186381 "DOC" units in the 26 affected files. This update replicates 186374 of them - i.e. only 7 "DOC" units from the 3rd Edition are omitted. Those DOCs were left out because they contained no Spanish text. The specific 'id' attributes of the affected DOCs are listed below, along with the file names containing them. For all the data replicated here, the 'id' attribute for each DOC unit matches the one that appeared for the same news story as it appeared in the 3rd Edition release. In other words, DOC 'id' strings from that release still refer to the same stories in this update. For about 1.6% of DOC units (about 3000), the difference between the previous release and this update may involve more than just the addition of missing accented characters: the new filtering procedure preserved or re-organized some content that had been removed or mis-tagged by the previous faulty processing. A handful of DOCs have been relabeled in their 'type' attribute, e.g. from "story" to "other" or "multi" (and as a result, are now presented without "

" tags in the "TEXT" portion of the story). Here is the list of DOC 'id' strings that were present in the 3rd Edition release but are omitted from this update, along with the file names containing them: XIN_SPA_20060214.0345 xin_spa_200602 XIN_SPA_20070131.0124 xin_spa_200701 XIN_SPA_20070131.0133 xin_spa_200701 XIN_SPA_20070131.0134 xin_spa_200701 XIN_SPA_20070420.0259 xin_spa_200704 XIN_SPA_20070605.0298 xin_spa_200706 XIN_SPA_20080224.0079 xin_spa_200802 ----------- README file created by David Graff, Feb. 7, 2013