Evalita 2009 Parsing Task

top

The validation of existing NLP models strongly depends on the possibility of generalizing their results on data and languages other than those on which they have been trained and tested, i.e. usually English. A valuable contribute to the validation of existing models and data comes from experiences that allow for consistent comparisons among approaches and representation schemes establishing shared standards, resources, tasks and evaluation practices with reference to various languages.
In this perspective, the aim of the EVALITA events is to promote the development of language technologies for the Italian language, by providing a shared framework to evaluate different systems and approaches in a consistent manner. As in its first edition, EVALITA 2007 held in September 2007, EVALITA 2009 aims at provide a shared framework where participants' systems were evaluated on different tasks and linguistic resources for Italian.

In the context of this evaluation campaign, the PARSING TASK of EVALITA aims to define and extend the current state of the art in parsing of Italian by encouraging the application of existing models and approaches (i.e. statistical and rule-based), and accounting for different annotation paradigms. Therefore the task will be articulated into two different tracks, i.e. Dependency and Constituency Parsing.

Dependency Parsing
It is articulated into two subtasks that provide the possibility of testing parsers across data differing in size, composition, granularity and annotation schemes:
- The main dependency subtask, which is obligatory for all participants to the Dependency Parsing track, uses as the development set the Turin University Treebank (TUT) encoded in CoNLL format, developed by the University of Torino, also used as the reference treebank for dependency parsing in the Evalita 2007, and recently increased and newly released (rel. 2.1). In this new release, the treebank includes, in particular, a small portion of data shared with Passage, an evaluation campaign for parsing of French language, which are extracted from the JRC-Acquis Multilingual Parallel Corpus.
- The pilot dependency subtask, which is optional for the participants to the Dependency Parsing track, uses as the development set the TANL dependency annotated corpus jointly developed by the Istituto di Linguistica Computazionale (ILC-CNR) and the University of Pisa in the framework of the project Analisi di Testi per il Semantic Web e il Question Answering. The TANL dependency annotated corpus originates as a revision of the ISST-CoNLL corpus used in the multilingual track of the CONLL-2007 shared task, which was built in its turn starting from the Italian Syntactic-Semantic Treebank, in particular the morpho-syntactic and syntactic dependency annotation levels.
All participants are strongly encouraged to perform both the dependency subtasks.
Constituency Parsing
It consists in a single task based on the corpus of the Turin University Treebank annotated in a Penn-like format (TUT-Penn), developed by the University of Torino by a fully automatic conversion applied to TUT data, and also used as the reference treebank for constituency parsing in the Evalita 2007. By showing a higher distance from the state of the art for constituency than for dependency parsing for Italian, the results of the Evalita 2007 confirmed the hypothesis known in literature that dependency structures are more adequate for the representation of Italian, regardless of the fact that the task is based on the Penn treebank format, which is the more diffused and parsed in the world.

In order to foster the development of data and evidences for the comparison among different paradigms and annotations, the unannotated data for testing the participants' results will be the same for Consituency and Dependency Parsing main subtask, while there will be an intersection between the unannotated data for testing results of Dependency Parsing main subtask and Dependency Parsing pilot subtask.
All participants are strongly encouraged to perform more subtasks and tracks.

top

Evalita 2009 homepage where information about organization and all tasks is available
Evalita 2009 registration web page (registration closes at September the 10th 2009)
Official guidelines for all the Evalita Parsing Task tracks and subtasks with all the details for participants
For the Dependency Parsing track only:
- Description of the mainDepPar tag sets in use by TUT treebank for morphological and functional-syntactic annotation
  - the TUT PoS tag set
  - the TUT grammatical relation set
- Description of pilotDepPar tag sets in use in the TALN corpus for morphological and dependency annotation
  - the TALN PoS tag set
  - the TANL dependency relation set

For the Constituency parsing track only: Description of the mainDepPar tag sets in use by TUT-Penn treebank for morphological and functional-syntactic annotation

Set of examples for the comparison between test-set format and results/development-set format

Scripts for the conversion of data in TUT format into CoNLL format

top

The data in the datasets are covered by licenses.

Turin University Treebank (TUT) by Cristina Bosco, Leonardo Lesmo, Vincenzo Lombardo, Alessandro Mazzei, Livio Robaldo
is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.

The TANL Dependency annotated corpus by the University of Pisa (Dipartimento di Informatica, Dipartimento di Linguistica) and the Istituto di Linguistica Computazionale (ILC-CNR) is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.

Development data
September the tenth 2009: NEW VERSION AVAILABLE FOR DEPENDENCY MAIN TASK - Passage corpus, and CONSTITUENCY TASK - newspaper corpus The current version of data for dependency main task is updated at August the third 2009 (for newspaper and civillaw) and at September the tenth (for passage).
The current version of data for constituency task is updated at August the third 2009 (for civillaw) and at September the tenth (for newspaper).
(Previous versions can be asked to the organizers).
- For DEPENDENCY parsing track:
  - MAIN subtask:
    the development data are the three corpora of the release 2.1 of TUT; they can be downloaded both in original TUT and CoNLL format from the following links: civillaw, newspaper, JRC-Passage-Evalita
  - PILOT subtask:
    the development data are articulated into two different sets:
    - Training Corpus, containing data annotated according to the TANL specifications to be used for training participating systems
    - Development Corpus, a smaller corpus to be used for development
    The PILOT subtask data are available at this link
- For CONSTITUENCY parsing track:
  the development data are the two corpora of the release 2.1 of TUT in TUT-Penn format; they can be downloaded from the following links: civillaw, newspaper
Test data: September the tenth 2009: TEST SETS AVAILABLE FOR DEPENDENCY MAIN TASK and CONSTITUENCY TASK

For DEPENDENCY parsing track MAIN subtask: download

For CONSTITUENCY parsing track: download

October the 5th: GOLD STANDARD annotated TEST SET, as used in the evaluation, AVAILABLE
- For DEPENDENCY parsing track MAIN subtask: download
- For DEPENDENCY parsing track PILOT subtask: download
- For CONSTITUENCY parsing track: download

Any further upgraded version of data, if available, will be announced to participants and published in this site.
Requests of information and feedbacks about data are welcome and can be addressed to bosco[at]di.unito.it.

top

Updated deadlines are available at the Evalita 2009 deadlines web page.

[event] [documents and tools] [datasets] [deadlines] [TUT homepage] [Evalita 2009 homepage]

Last updated: August the fourth 2009 by bosco[at]di.unito.it

top

Dependency Parsing

Constituency Parsing

top

top

top

Last updated:
August the fourth 2009
by bosco[at]di.unito.it