ParTUT is a project for the development of a multilingual parallel treebank
for Italian, English and French. The aim of this work is twofold: building an aligned parallel treebank for Italian, English and French,
by extending and applying a single treebank schema to other languages, and studying how the schema can be used to address issues typically
related to parallel corpora.
The annotation and tools used for the development of this resource are those of the Turin University Treebank (TUT),
a collection of Italian sentences annotated at a morpho-syntactic, syntactic and (to a lesser extent) semantic level,
with dependency-oriented representation format.
Automatic analysis of data was carried out using the Turin University Linguistic Environment (TULE), a rule-based system which includes all
the linguistic tools needed for producing an annotated text.
The development of formats other than TUT is obtained by automatic conversion. See in particular at
the TUTtoPENN web page for the conversion in Penn format.
The corpus is aligned on the sentence level with the Microsoft Bilingual Sentence Aligner (see Moore, 2002) and the LF Aligner, an automatic tool based on Gale and
Church algorithm which enables the storage of sentence pairs as translation units in TMX files and the review of the output in formatted xls spreadsheets (see Section Aligned Sentences).
The treebank currently consists of 3194 sentences, organized in six sections.
Open/close more about the data in ParTUT sections.
Currently the treebank includes four sections:
The following table shows the size of the sections in terms of sentences and tokens in the TUT native format:
corpus |
sentences |
tokens |
JRCAcquis | 540 | 19,268 |
UDHR | 230 | 6,791 |
CC | 291 | 9,241 |
FB | 341 | 5,697 |
Europarl | 1505 | 43,479 |
WIT3 | 287 | 4,715 |
ParTUT
by
Manuela Sanguinetti, Cristina Bosco, Leonardo Lesmo
is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.
All the data currently included in the treebank are in UTF-8 encoding and are delivered in various formats:
The available data can be downloaded from the following table:
corpus |
raw texts |
native TUT |
TUT-CoNLL |
TUT-Penn |
TUT-Tiger |
JRCAcquis | Ita | It | Ita | Ita | Ita |
En | En | En | |||
Fr | Fr | Fr | Fr | Fr | |
UDHR | It | It | It | It | It |
En | En | En | En | En | |
Fr | Fr | Fr | Fr | Fr | |
CC | It | It | It | It | It |
En | En | En | En | ||
Fr | Fr | Fr | Fr | Fr | |
FB | It | It | It | It | |
En | En | En | En | ||
Fr | Fr | Fr | Fr | ||
Europarl | It | It | It | ||
En | En | En | |||
Fr | Fr | Fr | Fr | ||
WIT3 | It | It | It | It | It |
En | En | En | |||
Fr | Fr | Fr |
All data presented here were last updated on October, 2013.
The annotation guidelines are the same as those used for TUT:
For language-specific annotation criteria for English and French, see the document below:
Parallel Treebank projects:
Workshops and Conferences:
[project] [treebank] [documents] [publications] [links] [TUT homepage] [Interaction Models Group homepage]