The validation of existing NLP models strongly depends on the possibility of generalizing their
results on data and languages other than those on which they have been trained and tested,
i.e. usually English. A valuable contribute to the validation of existing models and data comes from
experiences that allow for consistent comparisons among approaches and
representation schemes establishing shared standards, resources, tasks and evaluation practices
with reference to various languages.
In this perspective, the aim of the EVALITA events is to promote the development of language
technologies for the Italian language, by providing a shared framework to evaluate different
systems and approaches in a consistent manner.
As in its first edition, EVALITA 2007 held in September
2007, EVALITA 2009 aims at provide a shared framework
where participants' systems were evaluated on different tasks and linguistic resources for Italian.
In the context of this evaluation campaign, the PARSING TASK of EVALITA aims to define and extend
the current state of the art in parsing of Italian by encouraging the application of existing models and approaches
(i.e. statistical and rule-based), and accounting for different annotation paradigms. Therefore the task will be articulated
into two different tracks, i.e. Dependency and Constituency Parsing.
-
Dependency Parsing
It is articulated into two subtasks that provide the possibility of testing parsers across data differing in size,
composition, granularity and annotation schemes:
- The main dependency subtask, which is obligatory for all participants
to the Dependency Parsing track, uses as the development set the Turin University
Treebank (TUT) encoded in CoNLL format, developed by
the University of Torino, also used as the reference treebank for dependency parsing in the
Evalita 2007, and recently increased and newly
released (rel. 2.1).
In this new release, the treebank includes, in particular, a small portion of data shared with
Passage, an evaluation campaign for parsing of
French language, which are extracted from the JRC-Acquis
Multilingual Parallel Corpus.
- The pilot dependency subtask, which is optional for the participants to the Dependency
Parsing track, uses as the development set the TANL dependency annotated corpus jointly
developed by the Istituto di Linguistica Computazionale (ILC-CNR) and the University of Pisa
in the framework of the project
Analisi di Testi per il Semantic Web e il Question Answering. The TANL dependency annotated
corpus originates as a revision of the ISST-CoNLL corpus used in the multilingual track of the
CONLL-2007 shared task,
which was built in its turn starting from the
Italian Syntactic-Semantic Treebank,
in particular the morpho-syntactic and syntactic dependency annotation levels.
All participants are strongly encouraged to perform both the dependency subtasks.
-
Constituency Parsing
It consists in a single task based on the corpus of the Turin University Treebank annotated in a Penn-like format
(TUT-Penn), developed by the University of Torino by a fully automatic conversion applied to TUT data, and also
used as the reference treebank for constituency parsing in the Evalita 2007.
By showing a higher distance from the state of the art for constituency than for dependency parsing for Italian,
the results of the Evalita 2007 confirmed the hypothesis known in literature that dependency structures are
more adequate for the representation of Italian, regardless of the fact that the task is based on the Penn treebank
format, which is the more diffused and parsed in the world.
In order to foster the development of data and evidences for the comparison among different paradigms and
annotations, the unannotated data for testing the participants' results will be the same for Consituency and
Dependency Parsing main subtask, while there will be an intersection between the unannotated data for testing
results of Dependency Parsing main subtask and Dependency Parsing pilot subtask.
All participants are strongly encouraged to perform more subtasks and tracks.
- Evalita 2009 homepage where information about organization and all tasks is available
- Evalita 2009 registration web page (registration closes at
September the 10th 2009)
- Official guidelines for all the
Evalita Parsing Task tracks and subtasks with all the details for participants
- For the Dependency Parsing track only:
- Description of the mainDepPar tag sets in use by TUT treebank for morphological and
functional-syntactic annotation
- Description of pilotDepPar tag sets in use in the TALN corpus for morphological and dependency annotation
For the Constituency parsing track only:
Description of the mainDepPar tag sets in use by TUT-Penn treebank for morphological and
functional-syntactic annotation
Set of examples for the comparison between test-set format and results/development-set format
Scripts for the conversion of data in TUT format into CoNLL format
The data in the datasets are covered by licenses.
Turin University Treebank (TUT)
by
Cristina Bosco, Leonardo Lesmo, Vincenzo Lombardo,
Alessandro Mazzei, Livio Robaldo
is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.
The TANL Dependency annotated corpus
by the University of Pisa (Dipartimento di Informatica, Dipartimento di Linguistica) and the
Istituto di Linguistica Computazionale (ILC-CNR)
is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.
- Development data
September the tenth 2009:
NEW VERSION AVAILABLE FOR DEPENDENCY MAIN TASK
- Passage corpus, and CONSTITUENCY TASK - newspaper corpus
The current version of data for dependency main task is updated at August the third 2009
(for newspaper and civillaw) and at September the tenth (for passage).
The current version of data for constituency task is updated at August the third 2009 (for civillaw) and
at September the tenth (for newspaper).
(Previous versions can be asked to the organizers).
- For DEPENDENCY parsing track:
- MAIN subtask:
the development data are the three corpora of the release 2.1 of TUT; they can be
downloaded both in original TUT and CoNLL format from the following links:
civillaw,
newspaper,
JRC-Passage-Evalita
- PILOT subtask:
the development data are articulated into two different sets:
- Training Corpus, containing data annotated according to the TANL specifications to
be used for training participating systems
- Development Corpus, a smaller corpus to be used for development
The PILOT subtask data are available at this link
- For CONSTITUENCY parsing track:
the development data are the two corpora of the release 2.1 of TUT in TUT-Penn format; they
can be downloaded
from the following links: civillaw,
newspaper
- Test data:
September the tenth 2009: TEST SETS AVAILABLE FOR DEPENDENCY MAIN
TASK and CONSTITUENCY TASK
- For DEPENDENCY parsing track MAIN subtask:
download
- For CONSTITUENCY parsing track:
download
- October the 5th: GOLD STANDARD annotated TEST SET, as used in the
evaluation, AVAILABLE
- For DEPENDENCY parsing track MAIN subtask:
download
- For DEPENDENCY parsing track PILOT subtask:
download
- For CONSTITUENCY parsing track:
download
Any further upgraded version of data, if available, will be announced to participants and published in this site.
Requests of information and feedbacks about data are welcome and can be addressed to
bosco[at]di.unito.it.
Updated deadlines are available at the Evalita 2009 deadlines web page.
[event]
[documents and tools]
[datasets]
[deadlines]
[TUT homepage]
[Evalita 2009 homepage]