Corpus

README.md

training data, test data, test solutions, level4 context table

The DETESTS-Dis dataset consists of two text genres: comments on news articles (DETESTS) and posts on Twitter reacting to hoaxes (StereHoax-ES) about the integration of immigrants.

DETESTS

The DETESTS dataset was released for the DETESTS competition at IberLEF 2022 (Ariza-Casabona et al., 2022). It is made up of two parts: one part from the NewsCom-TOX corpus (Taulé et al., 2021) and another part from the StereoCom corpus, which was specially created for DETESTS following the same methodology as NewsCom-TOX. Both corpora consist of comments published in response to different articles extracted from Spanish online newspapers (ABC, elDiario.es, El Mundo, NIUS, etc.) and discussion forums (such as Menéame). In the case of NewsCom-TOX, the comments were extracted from news articles published from August 2017 to August 2020, while in the case of StereoCom they date from June 2020 to November 2021.

NewsCom-TOX articles were manually selected considering their controversial subjects, their potential toxicity and the number of published comments (minimum 50 comments). A keyword-based approach was used to search for articles mainly related to racism and xenophobia. Since the NewsCom-TOX corpus was designed primarily to study toxicity and not stereotypes, we used only the part of the corpus with the highest percentage of stereotypes. In order to obtain a sufficient and balanced data volume in terms of the presence or absence of stereotypes, the StereoCom corpus was also collected with the same content (i.e., comments in response to immigration-related news items in Spanish digital media), selected by subject on the basis of a keyword search.

The comments were selected in the same order in which they appear in the temporal web thread, and in reference to the conversational thread. The author (pseudo-anonymized) and the timestamp of the comment were also retrieved. Each comment was segmented by punctuation into sentences, and the comment to which every sentence belongs and its position within the comment are indicated.

For each sentence, two features were annotated, namely the presence or absence of stereotypes, and the implicitness of the message, i.e., whether the stereotype is expressed explicitly or implicitly. The features “Stereotype” and “Implicitness” have binary values (0=absence of the feature and 1=presence of the feature). Each comment was annotated in parallel by three annotators with a moderate inter-annotator agreement of 0.57 on the presence of stereotypes and of 0.41 for the implicit forms. For the aggregated form of this dataset, the cases of disagreements were discussed by the annotators and a senior annotator until agreement was reached in its aggregated form. This dataset will also be released in its disaggregated form to give the opportunity to participants to carry out their experiments, taking into account the disagreement among annotators. The team of annotators involved in the task consisted of two expert linguists and two trained annotators, who were students of linguistics.

At present, the corpus consists of 3,306 sentences from NewsCom-TOX and 2,323 sentences from StereoCom, for a total of 5,629 annotated sentences. On average, 40% of the sentences contain a stereotype. This dataset will be increased with 1,100 additional sentences from new comments extracted from online news articles collected in 2023 in order to balance the test set for textual genre.

StereoHoax-ES

The StereHoax-ES dataset contains tweets retrieved from Twitter in 2021 reacting to hoaxes published online that aim to disseminate false news against immigrants in Spain. These tweets were collected also taking into account their conversational thread. From 449 conversational heads, we retrieved a total of 5,349 tweets.

This corpus was created within the framework of the STERHEOTYPES project (Bourgeade et al., 2023), which brings together international research units based in Italy, France, and Spain. The corpus used for the second edition of DETESTS is the Spanish part of the StereHoax multilingual dataset.

The collection of these tweets started with the manual identification of 72 anti-immigrant hoaxes on debunking websites like maldita.es and newtral.es. Using the titles, keywords, and contents of the hoaxes, we searched for them on Twitter using the Twitter APIs v2 for Academia, collecting conversations related to them. The conversational thread is represented by a conversational head (the tweet starting the conversation), direct replies and replies to replies.

The annotation of these tweets focuses on the identification of the presence of stereotypes in the tweet, also looking at the conversational context (represented by the conversational head and, if they exist, the direct replies). The annotators have also labeled whether the stereotype expressed in the text is implicit or explicit. The annotation process involved three annotators (two linguistics students trained for this task and a researcher) with a substantial agreement on the presence of stereotypes (0.75) and a slight agreement on implicitness (0.15). For the aggregated form of this dataset, the cases of disagreements were discussed; however, this corpus will also be released in its disaggregated form like the DETESTS dataset.

Provided data

We will provide participants with 82% of the DETESTS-Dis dataset to train their models, while the remaining 18% will be used to test them. The training set will consist of the following columns:

  • source = {“detests”, “stereohoax”}
  • id = unique identifier
  • comment_id = comment identifier
  • text = sentence or tweet
  • level1 = previous sentece, refers to “id” (only if source=”detests”)
  • level2 = previous tweet or comment, refers to “comment_id”
  • level3 = first tweet or comment, refers to “comment_id”
  • level4 = news text or racial hoax, refers to “id” column in “level4.csv” table.
  • stereotype_a1 = individual annotation
  • stereotype_a2
  • stereotype_a3
  • stereotype = majority voting (hard label)
  • stereotype_soft = softmax normalization (soft label)
  • implicit_a1
  • implicit_a2
  • implicit_a3
  • implicit
  • implicit_soft

The test set will have: source, id, comment_id, text, level1, level2, level3, level4.

level1 to level4 columns can be used as wanted to provide the models with different kinds of context.

Given the restrictions posed by EU GDPR regulations and to avoid any conflict with the sources of the comments regarding their intellectual property rights (IPR), both training and test data are made available for academic purposes only, and participants therefore will access the data with a password by filling in an online form published on the task website, and by accepting the task’s terms and conditions, including the commitment not to redistribute the dataset. It is important to note that user data is not disclosed, since all data will be anonymized by removing all personal information such as @user and generating new IDs for the texts coming from Twitter.

References

Ariza-Casabona, A., Schmeisser-Nieto, W. S., Nofre, M., Taulé, M., Amigó, E., Chulvi, B., & Rosso, P. (2022). Overview of DETESTS at IberLEF 2022: DETEction and classification of racial STereotypes in Spanish. Procesamiento del lenguaje natural, 69, 217-228.

Bourgeade, T., Cignarella, A. T., Frenda, S., Laurent, M., Schmeisser-Nieto, W., Benamara, F., Bosco, C., Moriceau, V., Patti, V., & Taulé, M. (2023). A Multilingual Dataset of Racial Stereotypes in Social Media Conversational Threads. In Findings of the Association for Computational Linguistics: EACL 2023 (pp. 674-684).

TaulĂ©, Mariona, Alejandro Ariza, Montserrat Nofre, Enrique AmigĂł, Paolo Rosso (2021). ‘Overview of the DETOXIS Task at IberLEF-2021: DEtection of TOXicity in comments In Spanish’, Procesamiento del Lenguaje Natural, 67: 209-221.