KIPoS@Evalita 2020

The dataset provided for the KIPOS task are extracted from the KIParla corpus, a resource for the study of spoken Italian, composed of transcribed conversations. All speakers were informed of the aims of the project, agreed to the recording and signed a consent form.

The whole KIPOS dataset consists of approximately 200K tokens, with an equal proportion of informal and formal speech. For training participant systems, approximately 30K tokens manually annotated are provided as a gold standard (i.e. DS-formal and DS-informal) and further data only annotated with an automatic procedure as a silver standard using UDpipe and related tools trained on all the treebanks available for Italian within the Universal Dependencies repository.
For testing and evaluating the performance of participating systems two smaller datasets (TS-formal and TS-informal) will be released with an equal proportion of informal and formal speech at the scheduled time.

For the purpose of the task, the original orthographic transcriptions in a tab-delimited txt format are provided with three main identifiers, respectively indicating the conversation (alphanumeric), the speaker (alphanumeric) and the position of the turn (numeric) within the context of the conversation, see example below:

# conversation = BOD2018
# speaker = 1_MP_BO118
# turn = 1
# text = dovresti parlarmi della tua casa
1 dovresti AUX
2-3 parlarmi VERB_PRON
2 parlar VERB
3 mi PRON
4-5 della ADP_A
4 di ADP
5 la DET
6 tua DET
7 casa NOUN

# conversation = BOD2018
# speaker = 2_MP_BO118
# turn = 2
# text = attuale
1 attuale ADJ

# conversation = BOD2018
# speaker = 3_AM_BO140
# turn = 3
# text = mh sì
1 mh PARA
2 sì INTJ

The format and the labels for tagging the part of speech are compliant with that provided in the Universal Dependencies Italian treebanks. Data are released in a CoNNL-like format which includes only its three first columns separated by tab keys.
For the purpose of the evaluation, the format of data of the test set (TS-formal and TS-informal) will only include a single token for each line, that is all multiple token lines will be removed and substituted by those necessary for hosting the token not split. This makes the format of the test set slightly different from that used in the development data (that for each multiple token word includes both the multiple token line and those where this line is split in the different tokens), but more compliant with the evaluation scripts and procedures. An example of this format follows:

# conversation = BOD2018
# speaker = 1_MP_BO118
# turn = 1
# text = dovresti parlarmi della tua casa
1 dovresti AUX
2 parlarmi VERB_PRON
3 della ADP_A
4 tua ADJ
5 casa NOUN

Systems results will be evaluated only in this format, no lines must be added to those of the distributed test set files.

According to the timetable data will be released to be downloaded by participants in the KIPOS2020 data repository together with the Creative Commons license (CC BY-NC-SA 4.0) and a document which provides detailed guidelines.
Following the indications published in the KIPOS2020 data repository, the participants must fill in a form for accepting the licence and to be registered as authorized data users. Only after filling in the form, they will receive by email the password necessary to unzip the downloaded material.