TUT and TUT-Penn
are the parallel treebanks
for training and testing
EVALITA 2007 Parsing Task (at the AI*IA'07 Conference).
The parsing task of EVALITA aims to define and extend the current state of the art in parsing of Italian by encouraging the application
of existing models to this language.
order to represent the intrinsic complexity of this task and account for different
approaches, the task is composed by subtasks where quantitative
evaluation of different kinds of outputs, also annotating different
sets of features, will be performed.
This, we expect, will result in a clear picture of the problems that
lie ahead for Italian parsing and the kind of work necessary for
adapting existing parsing models to this language.
The availability of
more formats for the TUT allows for comparatively testing the results we expect
from this task. Therefore the EVALITA Parsing Task is the first real
possibility for the extension to Italian and comparison of both parsing
models developed for the English Penn Treebank and those developed
according to dependency-based approaches.
See also at the EVALITA Parsing Task page and
- DEVELOPMENT SETS (1st March 2007 updated):
- DEVELOPMENT SET for Dependency Parsing Task
- the improved version of the native dependency TUT
- a new version completely consistent with the CoNLL standard (which
results from the application of the scripts below).
- DEVELOPMENT SET for Constituency Parsing Task
- SCRIPTS for data standardization according to CoNLL
(download) that include
- the script update_noCoNLL_indexes_deletingtraces.pl, which deletes all traces annotated in TUT and updates all the
positional indexes of words (in order to avoid the indexes of the form n.m); moreover, it deletes also the lines which
contain MARKERs and are unannotated, and eliminates the numbers of the column "features" for the numbers annotated
in the TUT corpus
- the script columnator_and_relationreductor-evalita.pl, which produces a 10-columns CoNLL style version of the treebank also
reducing the amount of relations by deleting the morpho-syntactic and the semantic components that is annotated in the native
TUT relations thus reducing from about 200 to 50 items the tagset
- the updated version of the script tut2tutUTF8_accent.pl that outputs an UTF-8 version of data.
- TEST SETS (May the 20th 2007 available, updated May the 28th):
The following lists describe the test sets that participant can use as input of their parsers:
- the test set for constituency parsing subtask
- the files newEVALITAtestset-codciv-4constituencypars.pen and newEVALITAtestset-newspap-4constituencypars.pen where sentences are PoS
tagged according to the tagset in use by TUT-Penn in UTF-8 encoding; the files new-nonUTF-EVALITAtestset-codciv-4constituencypars.pen and
new-nonUTF-EVALITAtestset-newspap-4constituencypars.pen where sentences are PoS tagged according to the tagset in use by TUT-Penn in
Roman encoding (like the development set sentences)
- the test set for dependency parsing subtask (all these sets are encoded in UTF-8)
the following three files:
- the files EVALITAtestset-codciv-4dependencypars-native.tut and newEVALITAtestset-newspap-4dependencypars-native.tut where sentences
are PoS tagged according to the tagset in use by TUT, and lines include positional indexes of native TUT (with points too)
- the files EVALITAtestset-codciv-4dependencypars-native-withoutpoints.tut and newEVALITAtestset-newspap-4dependencypars-native-withoutpoints.tut
where sentences are PoS tagged according to the tagset in use by TUT, but
the lines include positional indexes CoNLL compliant (without points)
- the files EVALITAtestset-codciv-4dependencypars-CoNLL.col and newEVALITAtestset-newspap-4dependencypars-CoNLL.col
where sentences are PoS tagged according to the tagset in use by TUT and are splitted in
10 columns according to the CoNLL standard for dependency parsing evaluation (no indexes with points are included)
- NEW!! GOLD STANDARD used for the EVALUATION of the content results (June the 25th available):
- the annotated test sets for dependency parsing task (CoNLL format, i.e. 10 columns, without traces and pointed indexes)
- the annotated test sets for constituency parsing task (Penn format)