TUT is a project for the development of a collection of morphologically, syntactically and semantically annotated
Italian sentences; it includes:
Open/close more about the native TUT format.
TUT adopts a representation format based on the dependency paradigm centred upon the notion of
predicate-argument structure, as described with reference to major Italian linguistic phenomena in
the Linguistic notes.
The choice of this paradigm that describes syntactic structures using dependency relations between pairs of
words, depends on the partial configurationality of the reference language, i.e. Italian is a free word order
language.
In TUT the dependency relations are annotated by following the Augmented Relational Structure (ARS) where
each relation is implemented as a feature structure that can include values for a morpho-syntactic, a
functional-syntactic and a syntactic-semantic component.

The need for a description of grammatical relations more detailed and more proximate to semantics has
determined the development of a rich and flexible grammatical relation system for TUT, i.e. around 250 relations
annotated at variable degree of specification according to a hierarchical organization. When the annotator
cannot select a specific relation to label the dependency edge linking two words, he/she can select a more generic
relation from the higher levels of this taxonmy.
To represent some phenomenon involving discontinuity and deletions as well as pro-drop subject, TUT has been
enriched with a trace-filler notation (see section 1. Traces and co-indexing in
Linguistic notes for
examples and further details).
See at the following example:
Each line contains all the information concerning a single node-word X:
the position of X within the linear
order of the sentence,
the morphological features of X (in round brackets),
the position of the
head-word Y from which X depends, and the name of the relation linking X to Y (both in square brackets).
According to the ARS, the name of each relation may include three components separated by hyphens:
MORPHOSYNTACTIC - FUNCTIONALSYNTACTIC - SEMANTIC (the symbol + is used as a separator between 2
parts of a single component).
Parallel treebanks may serve as a suitable infrastructure for the comparison of parsers from different
linguistic frameworks, thus contributing in the investigation of the causes of the irreproducibility
of state-of-the-art results on annotations other than Penn Wall Street Journal and languages other than
English.
A conversion tool, called TUTtoPENN converter, has been applied to the native TUT in order to generate
a Penn-like annotation (called TUT-Penn).
As a side effect, has been developed two other formats that show intermediate layers of variation/similarity with
respect to the TUT and Penn in terms of both richness of functional-syntactic information (i.e. amount and specificity of
grammatical relations) and type of linguistic framework (i.e. constituency versus dependency, or minimal versus
maximal projection). For more information and downloading of the conversion tool see the
TUTtoPENN converter web page.
The following image shows the cascade of formats (a rich selection of examples in parallel
formats is available in Parallel annotations
in TUT formats):
Moreover, from the Constituency-TUT has been derived a treebank for Italian based on Combinatory Categorial Grammar called CCG-TUT (see the CCG-TUT web page for more details). In the process of conversion that generates the CCG-TUT, TUT dependency trees are mapped onto constituency trees (i.e. a binarized ConsTUT trees enriched with labels that distinguish heads, arguments and modifiers) and then mapped into CCG derivations.
The TUT development exploits the TULE
dependency parser.
The development of formats other than TUT is obtained by automatic conversion. See in particular at
the CCG-TUT web page for the conversion in CCG format.
The treebank currently consists in 2,860 Italian sentences, organized in five sections, and a section composed of 200 English sentences.
Open/close more about the data in TUT sections.
Currently the treebank includes five sections:
Note that the sections NEWS and VEDCH include isolated
sentences, while the other four sections involve full paragraphs
(Articles of the Code, European Directives and Wikipedia articles respectively).
Finally, it is possible to download 200 English sentences annotated in
the TUT standard format, i.e. the ENGLISH section. This could enable people not acquainted with
Italian to have a feeling of how the annotation schema is organizad. For peculiar features
(and extensions) of the TUT scheme for representing English structures see the section
16. Applying TUT on English in the
Linguistic notes.
The following table shows the size of the sections in terms of sentences and tokens in TUT native and CoNLL (where available) formats:
| section |
sentences |
tokens in TUT format |
tokens in CoNLL format |
| CODICECIVILE | 1,100 | 30,669 | 28,048 |
| NEWS | 700 | 19,134 | 18,044 |
| VEDCH | 400 | 14,390 | 12,508 |
| EUDIR | 201 | 7,955 | 7,426 |
| WIKI | 459 | 15,766 | 14,746 |
| ENGLISH | 200 | 5,940 |
Turin University Treebank (TUT)
by
Cristina Bosco, Leonardo Lesmo, Vincenzo Lombardo,
Alessandro Mazzei, Livio Robaldo
is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.
All the data currently included in the treebank are in UTF-8 encoding and are delivered in various formats (se project and publications for a description of the formats):
The available data can be downloaded from the following table (in .zip):
section |
raw text |
native TUT |
TUT in CoNLL |
TUT-Penn |
CCG-TUT |
| NEWS | NEWS.raw | NEWS.tut | NEWS.conl | NEWS.penn | CCG-TUT |
| VEDCH | VEDCH.raw | VEDCH.tut | VEDCH.conl | VEDCH.penn | |
| CODICECIVILE | CODICECIVILE.raw | CODICECIVILE.tut | CODICECIVILE.conl | CODICECIVILE.penn | |
| EUDIR | EUDIR.raw | EUDIR.tut | EUDIR.conl | EUDIR.penn | |
| WIKI | WIKI.raw | WIKI.tut | WIKI.conl | WIKI.penn | * |
| ENGLISH | ENGLISH.raw | ENGLISH.tut | ** | ** | ** |
All the data that can be downloaded from the table have been updated in November the 22th 2010, with the exception of the CCG-TUT, that will be soon newly released, and the English section.
Older versions of the treebank, as well as the one used in the Italian context Evalita 2009, are
also available:
Open/close more about the old versions of the treebank.
The following table shows the links for downloading the treebank in the native and in constituency-based formats. See the CCG-TUT web page for downloading the treebank in CCG format and for more information about it.
Raw text* |
Release
|
TUT
|
TUT in CoNLL
|
Cons-TUT
|
TUT-PENN
|
|
civil law newspaper JRC-Passage-Evalita |
Evalita** |
civil law and newspaper |
civil law and newspaper |
||
| 1.1 | civillaw newspaper |
||||
| 2.1** |
civillaw newspaper JRC-Passage-Evalita |
civillaw newspaper JRC-Passage-Evalita |
civillaw newspaper |
||
| wikipedia |
2.2 |
wikipedia |
wikipedia |
||
|
English |
E0.1 | English |
|||
*Texts originally collected include characters accented by
using different strategies (e.g. "à" and "a'") and types of encoding (from ISO-Latin to UTF-8).
As regards the characters, some word can therefore occur in the treebank in various writing forms
(e.g. "città" and "citta'") not all corresponding to those of the Italian dictionary (like the lemmatized
form that TUT includes for each word, which is e.g. "CITTÀ" both for
"città" and "citta'"). As regards the encoding, from 2007, all the released
annotation of TUT are instead in standard UTF-8.
**The releases Evalita (development set for Evalita 2007) and 2.1 (development set for Evalita 2009) include
the data in original TUT and CoNLL format both.
[project] [treebank] [documents] [publications] [links] [back to the interaction models group]