======================================================== 1) What is it? ======================================================== Item Name: AttImam:Arabic Attribution Language(s): Arabic, standard Arabic Data Type: Text , Tool Data Size (KB): 2130 KB (2,235,037 bytes) Format: Plain Text Size (words, tokens, etc): 2,334 attribution relations Character Encoding: UTF-8 Script (ISO 15924 [1]): Arab Data Source(s): newswire Application(s): language identification, entity extraction, discourse analysis Description AttImam is a Corpus of Arabic text files contains annotations of Attribution relations for 532 newswire text from Agence France Presse (AFP) that exactly used in Arabic Treebank (ATB) Part 1 - V4.1(large-scale corpus annotated for morphological and syntactic). So, AttImam is an additional discourse layer contains 2,334 attribution relations released as plain-text in UTF-8. Each one is defined by four constitutive classes: the cue, the source, the content, and the general features of the attribution. ESNAD tool (Extracting Sentence Attribution in Arabic Discourse) is one of this work objectives. ESNAD is developed to ensure precise way to generate high-quality corpus. Noteworthily, the tool provides valuable annotation features and, furthermore, can compute all types of inter-annotator agreement measures (e.g., exact match, accuracy, F-score Kappa, and Agr for text attributes. Annotation process and inter-annotator agreement are shown clearly in docs/AttImam-A_Corpus_of_Arabic_Attribution.pdf. Data Directory Structure 1.data/ directory: ------------ It consists of 532 .att files. Each file named by (the opposite name of newswire in AFP).att and consists of attribution relations each of them is defined by four elements: The cue: the lexical anchor that connects the source with the content. The source: the entity or the agent that owns the content. The content: the basic element expressing the claim or the reported news. The general features of the attribution include 4 features: attribution style feature, which can either be direct or indirect, determinacy feature to determine whether the attribution relation is factual or non-factual and the attribution purpose which is a feature that signifies the nature of the relation between an agent and the cue such as assertion, expression and etc. For each Arabic text, the opposite Buckwalter(bw)format is provided besides its index in the news article. Additionally, supplement information is provided if appropriate for (cue, source and content) elements. Note: The file that only contains the file name title indicates that there are no attribution relations in the annotated article. 2.docs/ directory: ------------ - Annotation_Guidelines_of_Attribution_Relation_and_Implicit_Relations.docx - This file elaborates the proposed scheme to annotate attribution relation in Arabic and how make a decision in different situations. Also, it contains detailed explanations of the use of ESNAD (Extracting Sentence Attribution in Arabic Discourse) tool. - Annotating_Attribution_Relations_in_Arabic.pdf - Published paper for a pilot annotation of attribution relation. - All_Attributions.xlsx - One table combines all annotated data. - AttImam-A_Corpus_of_Arabic_Attribution.pdf - Under-publication paper to show the entire work: attribution schema, annotation process and the inter-annotator agreement results and discussion. - list_of_cues.txt - list of distinct cues that indicate attribution. 3. tools/ directory: ------------ - Annotation_Tool_v2.jar - a java-based tool named ESNAD, which was used to annotate attribution relations in AttImam and conduct the inter annotator agreement study. - verbs.txt - a list of suggested cues the tool highlight them to facilitate annotation process. This file is empty; however, it can be edited as appropriate by accessing the file directly. ======================================================== 2) Who can use it? ======================================================== The following areas but are not limiting to: - Authorship identification. - Opinion extraction. - Identify truthful and reliable information. - Discourse analysis ======================================================== 3) How can it be used? ======================================================== AttImam consists of plain-text in UTF-8. So, it is easy to open and treat for the required purpose. To use ESNAD too, double click on tools/Annotation_Tool_v2.jar, then: -To prepare a "raw." Click on the File menu and choose Open File to open an existing file or choose Create File to create a new raw file. -After the desired file is opened, any verb from the list of possible verbs will be highlighted directly with pink in the text as well as displaying all the list verbs in the Suggested Cue list. -When an analyst clicks on any proposed verb in the list, the color of the shading of the verb is converted from pink to red. The annotation begins after the annotator reads and understands the entire text and makes the decision about each verb within the proposed verbs by answering this question: Does the verb represent a Cue in the text file? • If yes, click the arrow button to move the verb to the Attribution in text list and complete the annotation of the rest of the elements. Then, press the save annotation. • If no, click the arrow button to move the verb to the list of non-Cue verb. - At the end of the annotation process, the list of proposed verbs must be empty. Then the annotator saves the entire annotation by clicking the Save Annotation. The annotator can modify any texts and re-save for all annotation. - To generate a gold standard resource after annotating attribution relation, apply the agreement study function supplied by ESNAD which takes the annotated files from the first and second annotators; ensures the given files are equivalent in number and names and then conducts the agreement considering five measures for labeled elements (observed agreement, precision, recall, F-score and Kappa) and Agr measure for text spans elements. The results will be saved in tools/Att_agreement_measures.txt ======================================================== 4) Acknowledgment ======================================================== The project is partially funded by the Research Deanship at Al-Imam Mohammad Ibn Saud Islamic University and KACST at Saudi Arabia. ======================================================== 5) Contact ======================================================== Questions or feedback? Email: abeer.q.cs@gmail.com