Evaluation & Results
Subtask 1 will be evaluated as in the Le.Wi.Di shared tasks at SemEval 2021 and 2023 about learning with disagreement (Uma et al. 2021 and Leonardelli et al. 2023). First, the models that output hard labels will be compared to the gold standard by using the standard classification metric F1. The second evaluation metric will be the cross-entropy between the system soft label values and the soft labels generated from the annotators average votes.
Subtask 2 is a binary hierarchical classification problem. We will use the ICM metric (Amigó and Delgado, 2022), an information theoretic based metric which considers both the hierarchical structure and the class specificity. It is applicable to both hard and soft labels. The ICM metric will be the official metric considered for the ranking for both hard labels (ICM) and soft labels (ICM Soft).
The implementation of the official metrics (F1, cross-entropy and ICM) will be based on PyEvALL. An example of the evaluation can be found in Examples.ipynb.
For both subtasks 1 and 2, we will use non-informative baselines that classify all instances as either the majority or the minority classes (“BASELINE_all_ones” and “BASELINE_all_zeros”). Furthermore, we also use a random classifier “BASELINE_random_classifier”, a Term Frequency-Inverse Document Frequency (TFIDF) with Support Vector Classifier (SVC) “BASELINE_tfidf_svc”, Fast Text embedding with SVC “BASELINE_fast_text_svc”, and a fine-tuned BERT model (Cañete et al. 2020) “BASELINE_beto”. The baselines and code to produce them can be found at clic-ub/DETESTS-Dis
Results
Task 1 with Hard Labels
Rank | Run | F1 | ||
---|---|---|---|---|
0 | BASELINE_gold_standard | 1.000 | ||
1 | Brigada Lenguaje_1 | 0.724 | ||
2 | I2C-Huelva_1 | 0.712 | ||
3 | I2C-Huelva_2 | 0.701 | ||
4 | EUA_2 | 0.691 | ||
5 | EUA_3 | 0.685 | ||
6 | BASELINE_beto | 0.663 | ||
7 | EUA_1 | 0.653 | ||
8 | UC3M-SAS_2 | 0.641 | ||
9 | TaiDepZai999_UIT_AIC_1 | 0.630 | ||
10 | TaiDepZai999_UIT_AIC_3 | 0.624 | ||
11 | TaiDepZai999_UIT_AIC_2 | 0.608 | ||
12 | UC3M-SAS_1 | 0.594 | ||
13 | BASELINE_all_ones | 0.589 | ||
14 | VINE Bias Busters_1 | 0.581 | ||
15 | VINE Bias Busters_2 | 0.552 | ||
16 | VINE Bias Busters_3 | 0.545 | ||
17 | I2C-Huelva_3 | 0.375 | ||
18 | BASELINE_tfidf_svc | 0.297 | ||
19 | BASELINE_random_classifier | 0.297 | ||
20 | BASELINE_fast_text_svc | 0.297 | ||
21 | BASELINE_all_zeros | 0.000 |
Task 1 with Soft Labels
Rank | Run | Cross Entropy | ||
---|---|---|---|---|
0 | BASELINE_gold_standard | 0.255 | ||
1 | UC3M-SAS_1 | 0.841 | ||
2 | EUA_2 | 0.850 | ||
3 | UC3M-SAS_2 | 0.865 | ||
4 | BASELINE_beto | 0.893 | ||
5 | Brigada Lenguaje_1 | 0.938 | ||
6 | Brigada Lenguaje_2 | 0.979 | ||
7 | EUA_3 | 1.081 | ||
8 | EUA_1 | 1.409 |
Task 2 with Hard Labels
Rank | Run | ICM | ICM Norm | |||
---|---|---|---|---|---|---|
0 | BASELINE_gold_standard | 1.380 | 1.000 | |||
1 | BASELINE_beto | 0.126 | 0.546 | |||
2 | EUA_2 | 0.065 | 0.524 | |||
3 | EUA_3 | 0.061 | 0.522 | |||
4 | EUA_1 | 0.045 | 0.516 | |||
5 | Brigada Lenguaje_1 | -0.240 | 0.413 | |||
6 | BASELINE_tfidf_svc | -0.275 | 0.400 | |||
7 | I2C-Huelva_1 | -0.328 | 0.381 | |||
8 | BASELINE_fast_text_svc | -0.412 | 0.351 | |||
9 | BASELINE_all_zeros | -0.797 | 0.211 | |||
10 | BASELINE_random_classifier | -1.056 | 0.117 | |||
11 | BASELINE_all_ones | -1.210 | 0.061 | |||
12 | I2C-Huelva_3 | -1.263 | 0.042 | |||
13 | I2C-Huelva_2 | -1.403 | 0.000 | |||
14 | UC3M-SAS_2 | -2.103 | 0.000 |
Task 2 with Soft Labels
Rank | Run | ICM Soft | ICM Soft Norm | |||
---|---|---|---|---|---|---|
0 | BASELINE_gold_standard | 4.651 | 1.000 | |||
1 | EUA_3 | -0.900 | 0.403 | |||
2 | EUA_1 | -0.917 | 0.401 | |||
3 | EUA_2 | -0.969 | 0.396 | |||
4 | BASELINE_beto | -1.124 | 0.379 | |||
5 | UC3M-SAS_2 | -1.250 | 0.366 | |||
6 | Brigada Lenguaje_1 | -1.684 | 0.319 |
References
Amigó, E., & Delgado, A. (2022, May). Evaluating extreme hierarchical multi-label classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5809-5819).
Cañete, J., Chaperon, G., Fuentes, R., Ho, J. H., Kang, H., & Pérez, J. (2020). Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020.
Leonardelli, E., Uma, A., Abercrombie, G., Almanea, D., Basile, V., Fornaciari, T., Plank, B., Rieser, V., Uma, A., & Poesio, M. (2023). SemEval-2023 Task 11: Learning With Disagreements (LeWiDi). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 2304–2318. Association for Computational Linguistics.
Uma, A., Fornaciari, T., Dumitrache, A., Miller, T., Chamberlain, J., Plank, B., Simpson, E. & Poesio, M. (2021). ‘SemEval-2021 Task 12: Learning with Disagreements’. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021) (pp. 338-347). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2021.semeval-1.41