top

The validation of existing NLP models strongly depends on the possibility of generalizing their results on data and languages other than those on which they have been trained and tested, i.e. usually English. A valuable contribute to the validation of existing models and data comes from experiences that allow for consistent comparisons among approaches and representation schemes establishing shared standards, resources, tasks and evaluation practices with reference to various languages.
In this perspective, the aim of the EVALITA events is to promote the development of language technologies for the Italian language, by providing a shared framework to evaluate different systems and approaches in a consistent manner. As in its first edition, EVALITA 2007 held in September 2007, EVALITA 2009 aims at provide a shared framework where participants' systems were evaluated on different tasks and linguistic resources for Italian.

In the context of this evaluation campaign, the PARSING TASK of EVALITA aims to define and extend the current state of the art in parsing of Italian by encouraging the application of existing models and approaches (i.e. statistical and rule-based), and accounting for different annotation paradigms. Therefore the task will be articulated into two different tracks, i.e. Dependency and Constituency Parsing.

In order to foster the development of data and evidences for the comparison among different paradigms and annotations, the unannotated data for testing the participants' results will be the same for Consituency and Dependency Parsing main subtask, while there will be an intersection between the unannotated data for testing results of Dependency Parsing main subtask and Dependency Parsing pilot subtask.
All participants are strongly encouraged to perform more subtasks and tracks.


top

  • For the Constituency parsing track only: Description of the mainDepPar tag sets in use by TUT-Penn treebank for morphological and functional-syntactic annotation
  • Set of examples for the comparison between test-set format and results/development-set format
  • Scripts for the conversion of data in TUT format into CoNLL format


    top

    The data in the datasets are covered by licenses.
    Creative Commons License
    Turin University Treebank (TUT) by Cristina Bosco, Leonardo Lesmo, Vincenzo Lombardo, Alessandro Mazzei, Livio Robaldo
    is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.

    The TANL Dependency annotated corpus by the University of Pisa (Dipartimento di Informatica, Dipartimento di Linguistica) and the Istituto di Linguistica Computazionale (ILC-CNR) is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.

    Any further upgraded version of data, if available, will be announced to participants and published in this site.
    Requests of information and feedbacks about data are welcome and can be addressed to bosco[at]di.unito.it.


    top

    Updated deadlines are available at the Evalita 2009 deadlines web page.

    [event] [documents and tools] [datasets] [deadlines] [TUT homepage] [Evalita 2009 homepage]