top

TUT is a project for the development of a collection of morphologically, syntactically and semantically annotated Italian sentences; it includes:

Open/close more about the native TUT format.

The native TUT format

TUT adopts a representation format based on the dependency paradigm centred upon the notion of predicate-argument structure, as described with reference to major Italian linguistic phenomena in the Linguistic notes. The choice of this paradigm that describes syntactic structures using dependency relations between pairs of words, depends on the partial configurationality of the reference language, i.e. Italian is a free word order language.
In TUT the dependency relations are annotated by following the Augmented Relational Structure (ARS) where each relation is implemented as a feature structure that can include values for a morpho-syntactic, a functional-syntactic and a syntactic-semantic component.

The need for a description of grammatical relations more detailed and more proximate to semantics has determined the development of a rich and flexible grammatical relation system for TUT, i.e. around 250 relations annotated at variable degree of specification according to a hierarchical organization. When the annotator cannot select a specific relation to label the dependency edge linking two words, he/she can select a more generic relation from the higher levels of this taxonmy.
To represent some phenomenon involving discontinuity and deletions as well as pro-drop subject, TUT has been enriched with a trace-filler notation (see section 1. Traces and co-indexing in Linguistic notes for examples and further details).
See at the following example:

Each line contains all the information concerning a single node-word X:
the position of X within the linear order of the sentence,
the morphological features of X (in round brackets),
the position of the head-word Y from which X depends, and the name of the relation linking X to Y (both in square brackets).
According to the ARS, the name of each relation may include three components separated by hyphens:
MORPHOSYNTACTIC - FUNCTIONALSYNTACTIC - SEMANTIC (the symbol + is used as a separator between 2 parts of a single component).

Open/close more about the formats obtained by conversion of native TUT (i.e. TUT-Penn, Cons-TUT, TUT-Penn).

The converted formats

Parallel treebanks may serve as a suitable infrastructure for the comparison of parsers from different linguistic frameworks, thus contributing in the investigation of the causes of the irreproducibility of state-of-the-art results on annotations other than Penn Wall Street Journal and languages other than English.
A conversion tool, called TUTtoPENN converter, has been applied to the native TUT in order to generate a Penn-like annotation (called TUT-Penn). As a side effect, has been developed two other formats that show intermediate layers of variation/similarity with respect to the TUT and Penn in terms of both richness of functional-syntactic information (i.e. amount and specificity of grammatical relations) and type of linguistic framework (i.e. constituency versus dependency, or minimal versus maximal projection). For more information and downloading of the conversion tool see the TUTtoPENN converter web page.
The following image shows the cascade of formats (a rich selection of examples in parallel formats is available in Parallel annotations in TUT formats):

Moreover, from the Constituency-TUT has been derived a treebank for Italian based on Combinatory Categorial Grammar called CCG-TUT (see the CCG-TUT web page for more details). In the process of conversion that generates the CCG-TUT, TUT dependency trees are mapped onto constituency trees (i.e. a binarized ConsTUT trees enriched with labels that distinguish heads, arguments and modifiers) and then mapped into CCG derivations.

The procedures for the treebank development

The TUT development exploits the TULE dependency parser.
The development of formats other than TUT is obtained by automatic conversion. See in particular at the CCG-TUT web page for the conversion in CCG format.


top

The treebank currently consists in 2,860 Italian sentences, organized in five sections, and a section composed of 200 English sentences.

Open/close more about the data in TUT sections.

TUT corpora and data

Currently the treebank includes five sections:

Note that the sections NEWS and VEDCH include isolated sentences, while the other four sections involve full paragraphs (Articles of the Code, European Directives and Wikipedia articles respectively).
Finally, it is possible to download 200 English sentences annotated in the TUT standard format, i.e. the ENGLISH section. This could enable people not acquainted with Italian to have a feeling of how the annotation schema is organizad. For peculiar features (and extensions) of the TUT scheme for representing English structures see the section 16. Applying TUT on English in the Linguistic notes.

The following table shows the size of the sections in terms of sentences and tokens in TUT native and CoNLL (where available) formats:
section
sentences
tokens
in TUT format
tokens
in CoNLL format
CODICECIVILE 1,100 30,669 28,048
NEWS 700 19,134 18,044
VEDCH 400 14,390 12,508
EUDIR 201 7,955 7,426
WIKI 459 15,766 14,746
ENGLISH 200 5,940

Download the treebank

Creative Commons License
Turin University Treebank (TUT) by Cristina Bosco, Leonardo Lesmo, Vincenzo Lombardo, Alessandro Mazzei, Livio Robaldo
is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.

All the data currently included in the treebank are in UTF-8 encoding and are delivered in various formats (se project and publications for a description of the formats):

The available data can be downloaded from the following table (in .zip):

section

raw text

native TUT

TUT in CoNLL

TUT-Penn

CCG-TUT

NEWS NEWS.raw NEWS.tut NEWS.conl NEWS.penn CCG-TUT
VEDCH VEDCH.raw VEDCH.tut VEDCH.conl VEDCH.penn
CODICECIVILE CODICECIVILE.raw CODICECIVILE.tut CODICECIVILE.conl CODICECIVILE.penn
EUDIR EUDIR.raw EUDIR.tut EUDIR.conl EUDIR.penn
WIKI WIKI.raw WIKI.tut WIKI.conl WIKI.penn *
ENGLISH ENGLISH.raw ENGLISH.tut ** ** **
Note:
*The CCG format is currently non available for the Wiki section of TUT.
**For the English section only the TUT version is currently available.

All the data that can be downloaded from the table have been updated in November the 22th 2010, with the exception of the CCG-TUT, that will be soon newly released, and the English section.

Older versions of the treebank, as well as the one used in the Italian context Evalita 2009, are also available:
Open/close more about the old versions of the treebank.

TUT old versions

The following table shows the links for downloading the treebank in the native and in constituency-based formats. See the CCG-TUT web page for downloading the treebank in CCG format and for more information about it.

Raw text*

Release

TUT
FORMAT

TUT in CoNLL
FORMAT

Cons-TUT
FORMAT

TUT-PENN
FORMAT

civil law
newspaper
JRC-Passage-Evalita
Evalita**
civil law
and newspaper
civil law
and newspaper
1.1 civillaw
newspaper
2.1** civillaw
newspaper
JRC-Passage-Evalita
civillaw
newspaper
JRC-Passage-Evalita
civillaw
newspaper
wikipedia
2.2 wikipedia
wikipedia
English
E0.1 English

*Texts originally collected include characters accented by using different strategies (e.g. "à" and "a'") and types of encoding (from ISO-Latin to UTF-8). As regards the characters, some word can therefore occur in the treebank in various writing forms (e.g. "città" and "citta'") not all corresponding to those of the Italian dictionary (like the lemmatized form that TUT includes for each word, which is e.g. "CITTÀ" both for "città" and "citta'"). As regards the encoding, from 2007, all the released annotation of TUT are instead in standard UTF-8.
**The releases Evalita (development set for Evalita 2007) and 2.1 (development set for Evalita 2009) include the data in original TUT and CoNLL format both.


top


top


top

[project] [treebank] [documents] [publications] [links] [back to the interaction models group]