Subtask 1 will be evaluated as in the Le.Wi.Di shared tasks at SemEval 2021 and 2023 about learning with disagreement (Uma et al. 2021 and Leonardelli et al. 2023). First, the models that output hard labels will be compared to the gold standard by using the standard classification metric F1. The second evaluation metric will be the cross-entropy between the system soft label values and the soft labels generated from the annotators average votes.

Subtask 2 is a binary hierarchical classification problem. We will use the ICM metric (Amigó and Delgado, 2022), an information theoretic based metric which considers both the hierarchical structure and the class specificity. It is applicable to both hard and soft labels. The ICM metric will be the official metric considered for the ranking for both hard labels (ICM) and soft labels (ICM Soft).

The implementation of the official metrics (F1, cross-entropy and ICM) will be based on PyEvALL. An example of the evaluation can be found in Examples.ipynb.

For both subtasks 1 and 2, we will use non-informative baselines that classify all instances as either the majority or the minority classes (“BASELINE_all_ones” and “BASELINE_all_zeros”). Furthermore, we also use a random classifier “BASELINE_random_classifier”, a Term Frequency-Inverse Document Frequency (TFIDF) with Support Vector Classifier (SVC) “BASELINE_tfidf_svc”, Fast Text embedding with SVC “BASELINE_fast_text_svc”, and a fine-tuned BERT model (Cañete et al. 2020) “BASELINE_beto”. The baselines and code to produce them can be found at clic-ub/DETESTS-Dis

Results

Task 1 with Hard Labels

Rank	Run	F1
0	BASELINE_gold_standard	1.000
1	Brigada Lenguaje_1	0.724
2	I2C-Huelva_1	0.712
3	I2C-Huelva_2	0.701
4	EUA_2	0.691
5	EUA_3	0.685
6	BASELINE_beto	0.663
7	EUA_1	0.653
8	UC3M-SAS_2	0.641
9	TaiDepZai999_UIT_AIC_1	0.630
10	TaiDepZai999_UIT_AIC_3	0.624
11	TaiDepZai999_UIT_AIC_2	0.608
12	UC3M-SAS_1	0.594
13	BASELINE_all_ones	0.589
14	VINE Bias Busters_1	0.581
15	VINE Bias Busters_2	0.552
16	VINE Bias Busters_3	0.545
17	I2C-Huelva_3	0.375
18	BASELINE_tfidf_svc	0.297
19	BASELINE_random_classifier	0.297
20	BASELINE_fast_text_svc	0.297
21	BASELINE_all_zeros	0.000

Task 1 with Soft Labels

Rank	Run	Cross Entropy
0	BASELINE_gold_standard	0.255
1	UC3M-SAS_1	0.841
2	EUA_2	0.850
3	UC3M-SAS_2	0.865
4	BASELINE_beto	0.893
5	Brigada Lenguaje_1	0.938
6	Brigada Lenguaje_2	0.979
7	EUA_3	1.081
8	EUA_1	1.409

Task 2 with Hard Labels

Rank	Run	ICM	ICM Norm
0	BASELINE_gold_standard	1.380	1.000
1	BASELINE_beto	0.126	0.546
2	EUA_2	0.065	0.524
3	EUA_3	0.061	0.522
4	EUA_1	0.045	0.516
5	Brigada Lenguaje_1	-0.240	0.413
6	BASELINE_tfidf_svc	-0.275	0.400
7	I2C-Huelva_1	-0.328	0.381
8	BASELINE_fast_text_svc	-0.412	0.351
9	BASELINE_all_zeros	-0.797	0.211
10	BASELINE_random_classifier	-1.056	0.117
11	BASELINE_all_ones	-1.210	0.061
12	I2C-Huelva_3	-1.263	0.042
13	I2C-Huelva_2	-1.403	0.000
14	UC3M-SAS_2	-2.103	0.000

Task 2 with Soft Labels

Rank	Run	ICM Soft	ICM Soft Norm
0	BASELINE_gold_standard	4.651	1.000
1	EUA_3	-0.900	0.403
2	EUA_1	-0.917	0.401
3	EUA_2	-0.969	0.396
4	BASELINE_beto	-1.124	0.379
5	UC3M-SAS_2	-1.250	0.366
6	Brigada Lenguaje_1	-1.684	0.319

References

Amigó, E., & Delgado, A. (2022, May). Evaluating extreme hierarchical multi-label classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5809-5819).

Cañete, J., Chaperon, G., Fuentes, R., Ho, J. H., Kang, H., & Pérez, J. (2020). Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020.

Leonardelli, E., Uma, A., Abercrombie, G., Almanea, D., Basile, V., Fornaciari, T., Plank, B., Rieser, V., Uma, A., & Poesio, M. (2023). SemEval-2023 Task 11: Learning With Disagreements (LeWiDi). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 2304–2318. Association for Computational Linguistics.

Uma, A., Fornaciari, T., Dumitrache, A., Miller, T., Chamberlain, J., Plank, B., Simpson, E. & Poesio, M. (2021). ‘SemEval-2021 Task 12: Learning with Disagreements’. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021) (pp. 338-347). Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2021.semeval-1.41