Popis: |
We present DiscoGeM, a crowdsourced corpus of 6,505 implicit discourse relations from three genres: political speech,literature, and encyclopedic texts. Each instance was annotated by 10 crowd workers. Various label aggregation methodswere explored to evaluate how to obtain a label that best captures the meaning inferred by the crowd annotators. The resultsshow that a significant proportion of discourse relations in DiscoGeM are ambiguous and can express multiple relation senses.Probability distribution labels better capture these interpretations than single labels. Further, the results emphasize that textgenre crucially affects the distribution of discourse relations, suggesting that genre should be included as a factor in automaticrelation classification. We make available the newly created DiscoGeM corpus, as well as the dataset with all annotator-levellabels. Both the corpus and the dataset can facilitate a multitude of applications and research purposes, for example tofunction as training data to improve the performance of automatic discourse relation parsers, as well as facilitate research intonon-connective signals of discourse relations. |