TUT and TUT-Penn
are the parallel treebanks
for training and testing parsing systems
at the
EVALITA 2007 Parsing Task (at the AI*IA'07 Conference).

The parsing task of EVALITA aims to define and extend the current state of the art in parsing of Italian by encouraging the application of existing models to this language.
In order to represent the intrinsic complexity of this task and account for different approaches, the task is composed by subtasks where quantitative evaluation of different kinds of outputs, also annotating different sets of features, will be performed. This, we expect, will result in a clear picture of the problems that lie ahead for Italian parsing and the kind of work necessary for adapting existing parsing models to this language.
The availability of more formats for the TUT allows for comparatively testing the results we expect from this task. Therefore the EVALITA Parsing Task is the first real possibility for the extension to Italian and comparison of both parsing models developed for the English Penn Treebank and those developed according to dependency-based approaches.

DEVELOPMENT SETS (1st March 2007 updated):
- DEVELOPMENT SET for Dependency Parsing Task (download) that includes
  - the improved version of the native dependency TUT
  - a new version completely consistent with the CoNLL standard (which results from the application of the scripts below).
- DEVELOPMENT SET for Constituency Parsing Task (download)
- SCRIPTS for data standardization according to CoNLL (download) that include
  - the script update_noCoNLL_indexes_deletingtraces.pl, which deletes all traces annotated in TUT and updates all the positional indexes of words (in order to avoid the indexes of the form n.m); moreover, it deletes also the lines which contain MARKERs and are unannotated, and eliminates the numbers of the column "features" for the numbers annotated in the TUT corpus
  - the script columnator_and_relationreductor-evalita.pl, which produces a 10-columns CoNLL style version of the treebank also reducing the amount of relations by deleting the morpho-syntactic and the semantic components that is annotated in the native TUT relations thus reducing from about 200 to 50 items the tagset
  - the updated version of the script tut2tutUTF8_accent.pl that outputs an UTF-8 version of data.
TEST SETS (May the 20th 2007 available, updated May the 28th): The following lists describe the test sets that participant can use as input of their parsers:
- the test set for constituency parsing subtask (download) includes:
  - the files newEVALITAtestset-codciv-4constituencypars.pen and newEVALITAtestset-newspap-4constituencypars.pen where sentences are PoS tagged according to the tagset in use by TUT-Penn in UTF-8 encoding; the files new-nonUTF-EVALITAtestset-codciv-4constituencypars.pen and new-nonUTF-EVALITAtestset-newspap-4constituencypars.pen where sentences are PoS tagged according to the tagset in use by TUT-Penn in Roman encoding (like the development set sentences)
- the test set for dependency parsing subtask (all these sets are encoded in UTF-8) (download) includes the following three files:
  - the files EVALITAtestset-codciv-4dependencypars-native.tut and newEVALITAtestset-newspap-4dependencypars-native.tut where sentences are PoS tagged according to the tagset in use by TUT, and lines include positional indexes of native TUT (with points too)
  - the files EVALITAtestset-codciv-4dependencypars-native-withoutpoints.tut and newEVALITAtestset-newspap-4dependencypars-native-withoutpoints.tut where sentences are PoS tagged according to the tagset in use by TUT, but the lines include positional indexes CoNLL compliant (without points)
  - the files EVALITAtestset-codciv-4dependencypars-CoNLL.col and newEVALITAtestset-newspap-4dependencypars-CoNLL.col where sentences are PoS tagged according to the tagset in use by TUT and are splitted in 10 columns according to the CoNLL standard for dependency parsing evaluation (no indexes with points are included)
NEW!! GOLD STANDARD used for the EVALUATION of the content results (June the 25th available):
- the annotated test sets for dependency parsing task (CoNLL format, i.e. 10 columns, without traces and pointed indexes) (download)
- the annotated test sets for constituency parsing task (Penn format) (download)

See also at the EVALITA Parsing Task page and guidelines.