top

ParTUT is a project for the development of a multilingual parallel treebank for Italian, English and French. The aim of this work is twofold: building an aligned parallel treebank for Italian, English and French, by extending and applying a single treebank schema to other languages, and studying how the schema can be used to address issues typically related to parallel corpora.
The annotation and tools used for the development of this resource are those of the Turin University Treebank (TUT), a collection of Italian sentences annotated at a morpho-syntactic, syntactic and (to a lesser extent) semantic level, with dependency-oriented representation format.

Treebank development

Automatic analysis of data was carried out using the Turin University Linguistic Environment (TULE), a rule-based system which includes all the linguistic tools needed for producing an annotated text.
The development of formats other than TUT is obtained by automatic conversion. See in particular at the TUTtoPENN web page for the conversion in Penn format.
The corpus is aligned on the sentence level with the Microsoft Bilingual Sentence Aligner (see Moore, 2002) and the LF Aligner, an automatic tool based on Gale and Church algorithm which enables the storage of sentence pairs as translation units in TMX files and the review of the output in formatted xls spreadsheets (see Section Aligned Sentences).

top

The treebank currently consists of 3194 sentences, organized in six sections.

Open/close more about the data in ParTUT sections.

TUT corpora and data

Currently the treebank includes four sections:

JRCAcquis: 540 sentences from the Jrc-Acquis multilingual parallel corpus, the total body of EU law.
UDHR: 230 sentences from the Universal Declaration of Human Rights.
CC: comprises 291 sentences respectively from the Italian, English and French versions of the Creative Commons licence.
FB: comprises 341 sentences from publicly available pages retrieved from Facebook website.
Europarl: 1505 sentences from the Europarl multilingual parallel corpus of Proceedings of the European Parliament.
WIT3: 287 from the Web Inventory of Transcribed Translated Talks.

The following table shows the size of the sections in terms of sentences and tokens in the TUT native format:

corpus
sentences
tokens

JRCAcquis 540 19,268

UDHR 230 6,791

CC 291 9,241

FB 341 5,697

Europarl 1505 43,479

WIT3 287 4,715

Download the treebank

ParTUT by Manuela Sanguinetti, Cristina Bosco, Leonardo Lesmo
is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.

All the data currently included in the treebank are in UTF-8 encoding and are delivered in various formats:

The raw input texts
The parsed sentences in the standard native dependency TUT annotation
The parsed sentences TUT-structured, but with the annotation expressed in the CoNLL format
The parsed sentences in the Penn Treebank phrase structure format. The conversion has been obtained by means of a script that is described at: http://www.di.unito.it/~tutreeb/TUTtoPENNconverter/
The parsed sentences in the Tiger-Xml format

The available data can be downloaded from the following table:

corpus

raw texts

native TUT

TUT-CoNLL

TUT-Penn

TUT-Tiger

JRCAcquis Ita It Ita Ita Ita

En En En

Fr Fr Fr Fr Fr

UDHR It It It It It

En En En En En

Fr Fr Fr Fr Fr

CC It It It It It

En En En En

Fr Fr Fr Fr Fr

FB It It It It

En En En En

Fr Fr Fr Fr

Europarl It It It

En En En

Fr Fr Fr Fr

WIT3 It It It It It

En En En

Fr Fr Fr

Aligned Sentences (in raw and CoNLL format)

All data presented here were last updated on October, 2013.

top

The annotation guidelines are the same as those used for TUT:

Syntactic categories
the Part of Speech tagset of the TUT corpus
Labels of the edges
the list of the grammatical relations labelling the dependency edges of the second release of TUT corpus

For language-specific annotation criteria for English and French, see the document below:

Linguistic notes

top

M. Sanguinetti, C. Bosco. Building the Multilingual TUT Parallel Treebank. In Proceedings of the 2nd Workshop on Annotation and Exploitation of Parallel Corpora (AEPC 2), 2011 pdf
C. Bosco, M. Sanguinetti, L.Lesmo. The Parallel-TUT: a Multilingual and Multiformat Treebank. In Proceedings of LREC '12, 2012 pdf
M. Sanguinetti, C. Bosco. Translational Divergences and their Alignment in a Parallel Multilingual Treebank. In Proceedings of the 11th Workshop on Treebanks and Linguistic Theories (TLT11), 2012 pdf
M. Sanguinetti, C. Bosco, L. Lesmo. Dependency and Constituency in Translation Shift Analysis. In Proceedings of the 2nd Conference on Dependency Linguistics (DepLing), 2013 pdf

top

Parallel Treebank projects:

Workshops and Conferences:

[project] [treebank] [documents] [publications] [links] [TUT homepage] [Interaction Models Group homepage]

corpus	sentences	tokens
JRCAcquis	540	19,268
UDHR	230	6,791
CC	291	9,241
FB	341	5,697
Europarl	1505	43,479
WIT3	287	4,715

corpus	raw texts	native TUT	TUT-CoNLL	TUT-Penn	TUT-Tiger
JRCAcquis	Ita	It	Ita	Ita	Ita
	En	En	En
	Fr	Fr	Fr	Fr	Fr
UDHR	It	It	It	It	It
	En	En	En	En	En
	Fr	Fr	Fr	Fr	Fr
CC	It	It	It	It	It
	En	En	En		En
	Fr	Fr	Fr	Fr	Fr
FB	It	It	It	It
	En	En	En		En
	Fr	Fr	Fr		Fr
Europarl	It	It	It
	En	En	En
	Fr	Fr	Fr		Fr
WIT3	It	It	It	It	It
	En	En	En
	Fr	Fr	Fr

Last updated: October, 2013 by msanguin[at]di.unito.it

top

Treebank development

top

TUT corpora and data

Download the treebank

corpus

raw texts

native TUT

TUT-CoNLL

TUT-Penn

TUT-Tiger

Aligned Sentences (in raw and CoNLL format)

top

top

top

Last updated:
October, 2013
by msanguin[at]di.unito.it