KIPoS@Evalita 2020

The KIPoS shared task on KIParla Part of Speech tagging will be organised within Evalita 2020, the 7th evaluation campaign of Natural Language Processing and Speech tools for Italian, which will be held in Bologna (Italy) co-located with CLiC-it 2020 (November 30th – December 3rd 2020).

Introduction and Motivation

A large part of the current research on spoken Italian has been focusing on the description and analysis of varieties, an area of interest that has sensibly grown as the availability of oral corpora has increased (see e.g. Albano Leoni 2013; Crocco 2015; Pistolesi 2016). The KIParla corpus (Mauri et al. 2019) introduces major improvements with respect to previous studies about spoken Italian including a wide range of metadata (speakers' socio-geographic profile, interaction settings), but the lack of PoS-tagging and lemmatization currently places limits on its application.
Following the experience of PoSTWITA (PoS tagging for Italian Social Media Texts - Evalita 2016) (Bosco et al. 2016), KIPoS offers the opportunity of addressing the theoretical and methodological challenges related to PoS tagging of KIParla. Carrying out this task means processing spontaneous speech data (as opposed to experimental speech data) and dealing with a great amount of sociolinguistic variation, intended as the alternation between forms with social significance.
The most challenging aspects to be addressed in the unconstrained spoken language of KIParla are:

  • To identify modespecific phenomena, such as repetitions, reformulations, fillers, incomplete syntactic structures, ...
  • Considering that the KIParla corpus includes non-standard features at the morphological and syntactic level, to trace a relevant set of non-standard alternatives back to the same linguistic phenomenon (e.g. annà, andà, andare for "to go"), either assigning them to the correct part-of-speech, or working out an ad-hoc solution
  • To deal with different types of interaction (casual conversations, interviews, office hours, etc.) with a variable number of participants (1 to 5), each transcribed on a separate line and corresponding to an autonomous text string.

Target Audience

The task is open to everyone from industry and academia and we encourage the participations of researchers, industrial teams, and students too.

Task description

Given the highly innovative features of KIParla, we propose as a task for EVALITA 2020 to consider adaptation of a PoS tagger in order to represent and give access to its specific features. We provide data for training (Development Set, henceforth DS) and testing (Test Set, henceforth TS) systems organized in two ensembles which respectively represent formal (DS-formal and TS-formal) and informal register (DS-informal and TS-informal) and allowed us the organization of the following tracks:

  • Main task - general: training on all given data (both DS-formal and DS-informal) and testing on all test set data (both TS-formal and TS-informal)
  • Subtask A - crossFormal: training on data from DS-formal only and testing separately on data from formal register (TS-formal) and from informal register (TS-informal)
  • Subtask B - crossInformal: training on data from DS-informal only and testing separately on data from formal register (TS-formal) and from informal register (TS-informal)

Data

The whole dataset consists of approximately 200K tokens, with an equal proportion of informal and formal speech. For training participant systems, approximatey 30K tokens manually annotated are provided as a gold standard (composed of DS-formal and DS-informal) and further data only annotated with an automatic procedure as a silver standard. A similar proportion of data for formal and informal register will be released for testing participant systems (TS-formal and TS-informal).
Participants are allowed to use other resources both for training and to enhance final performances, as long as their results apply the tagset provided for KIPOS and are compliant with the format described in the guidelines. See Data for further details.

Evaluation

Each participating team will initially have access to the training data only (i.e. DS-formal and DS-informal). Later, the unlabelled test data (TS-formal and TS-informal) will be released (see the timeframe below). After the assessment, the labelled test data (the socalled gold standard for TS-formal and TS-informal) will be released as well together with the evaluation script and the score of each participant.

The evaluation is performed in a black box approach: only the systems output is evaluated. The evaluation metric will be based on a token-by-token comparison and only a single tag is allowed for each token. The considered metric is the Tagging accuracy: it is defined as the number of correct PoS tag assignment divided by the total number of tokens in the Test Set. The submissions will be ranked by F1-score (precision, recall and F-measure).

How to participate

Register your team by using the Evalita2020 registration web form.

Read all the information about data format, submission and evaluation available in the KIPOS2020 data repository. Download from this repository the data for training (DS-formal and DS-informal) and tuning your system. Test it on data for testing (TS-formal and TS-informal) when also data for testing will be made available on the same repository. You will be required also to provide a technical report for the publication in the Proceedings of contest including a brief description of your approach, an illustration of experiments, techniques and resources used, and an analysis of the results achieved.

Subscribe to our mailing list in order to be kept up to date with the latest news related to the task. Please share comments and questions with the mailing list. The organizers will assist you for any potential issues that could be raised.

Important dates

  • on-line registration is open
  • 29th May 2020: development data (DS-formal and DS-informal) are available (see data repository for download)
  • 4th September 2020: registration closes
  • 25th September 2020: test data (TS-formal and TS-informal) available (see data repository for download)
  • 2nd October 2020: systems results due to the organizers
  • 16th October 2020: assessment returned to participants
  • 6th November 2020: technical reports due to organizers (camera-ready)
  • 2nd-3rd December 2020: final workshop online

References

Albano Leoni, F. (2013), Il parlato e la comunicazione parlata, in G. Iannàccaro (a cura di), La linguistica italiana all’alba del terzo millennio (1997-2010), Roma, Bulzoni, pp. 129-148.

Bosco, C., F. Tamburini, A. Bolioli, A. Mazzei (2016) Overview of the EVALITA 2016 Part Of Speech on TWitter for ITAlian Task, in Basile, P., F. Cutugno, M. Nissim, V. Patti, R. Sprugnoli (eds.), Proceedings of the Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop EVALITA 2016.

Crocco, C. (2015), Corpora e testi di italiano contemporaneo, in M. Iliescu, E. Roegiest (eds.), Manuel des anthologies, corpus et textes romans, Berlin-New York, De Gruyter, pp. 509-534.

Mauri. C., S. Ballarè, E. Goria, M. Cerruti, F. Suriano (2019), KIParla Corpus: A New Resource for Spoken Italian, in R. Bernardi, R. Navigli, G. Semeraro (eds.), CLiC-it 2019 – Italian Conference on Computational Linguistics. Proceedings of the Sixth Italian Conference on Computational Linguistics

Pistolesi, E. (2016), Aspetti diamesici, in S. Lubello (a cura di), Manuale di linguistica italiana, Berlin-New York, De Gruyter, pp. 442-458.

organizers

Eugenio Goria – Dipartimento di Studi Umanistici, Università degli Studi di Torino

Massimo Cerruti – Dipartimento di Studi Umanistici, Università degli Studi di Torino

Silvia Ballarè – Dipartimento di Filologia Classica e Italianistica "Alma Mater Studiorum", Università di Bologna

Caterina Mauri – Dipartimento di Lingue, Letterature e Culture Moderne dell’Università di Bologna

Cristina Bosco - Dipartimento di Informatica, Università degli Studi di Torino