top

ParTUT is a project for the development of a multilingual parallel treebank for Italian, English and French. The aim of this work is twofold: building an aligned parallel treebank for Italian, English and French, by extending and applying a single treebank schema to other languages, and studying how the schema can be used to address issues typically related to parallel corpora.
The annotation and tools used for the development of this resource are those of the Turin University Treebank (TUT), a collection of Italian sentences annotated at a morpho-syntactic, syntactic and (to a lesser extent) semantic level, with dependency-oriented representation format.

Treebank development

Automatic analysis of data was carried out using the Turin University Linguistic Environment (TULE), a rule-based system which includes all the linguistic tools needed for producing an annotated text.
The development of formats other than TUT is obtained by automatic conversion. See in particular at the TUTtoPENN web page for the conversion in Penn format.
The corpus is aligned on the sentence level with the Microsoft Bilingual Sentence Aligner (see Moore, 2002) and the LF Aligner, an automatic tool based on Gale and Church algorithm which enables the storage of sentence pairs as translation units in TMX files and the review of the output in formatted xls spreadsheets (see Section Aligned Sentences).


top

The treebank currently consists of 3194 sentences, organized in six sections.

Open/close more about the data in ParTUT sections.

TUT corpora and data

Currently the treebank includes four sections:

The following table shows the size of the sections in terms of sentences and tokens in the TUT native format:
corpus
sentences
tokens
JRCAcquis 540 19,268
UDHR 230 6,791
CC 291 9,241
FB 341 5,697
Europarl 1505 43,479
WIT3 287 4,715

Download the treebank

Creative Commons License
ParTUT by Manuela Sanguinetti, Cristina Bosco, Leonardo Lesmo
is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.

All the data currently included in the treebank are in UTF-8 encoding and are delivered in various formats:

The available data can be downloaded from the following table:

corpus

raw texts

native TUT

TUT-CoNLL

TUT-Penn

TUT-Tiger

JRCAcquis Ita It Ita Ita Ita
En En En
Fr Fr Fr Fr Fr
UDHR It It It It It
En En En En En
Fr Fr Fr Fr Fr
CC It It It It It
En En En En
Fr Fr Fr Fr Fr
FB It It It It
En En En En
Fr Fr Fr Fr
Europarl It It It
En En En
Fr Fr Fr Fr
WIT3 It It It It It
En En En
Fr Fr Fr

Aligned Sentences (in raw and CoNLL format)

All data presented here were last updated on October, 2013.


top

The annotation guidelines are the same as those used for TUT:

For language-specific annotation criteria for English and French, see the document below:


top


top

Parallel Treebank projects:

Workshops and Conferences:

[project] [treebank] [documents] [publications] [links] [TUT homepage] [Interaction Models Group homepage]