CCG-TUT is a project for the development of a treebank for Italian based on Combinatory Categorial Grammar
Other treebanks based on categorial grammar exist for English, in particular the CCGbank derived from the Penn Treebank. The English CCGbank has proven to be a useful resource for training robust parsers, and we follow its design as closely as possible. Like the English CCGbank, our CCG-TUT for Italian is derived from an existing treebank, i.e. TUT. Unlike the English CCGbank, we apply lexicalisation to punctuation symbols, and refrain from introducing special combinatory rules for dealing with punctuation. The number of different lexical categories generated for all the three corpora is 1,152 of which 627 occur more than once. An example CCG derivation of the treebank is sentence A-6 from TUT, in CCG-TUT format below (In questa vicenda tira un'aria tutta balcanica).
A process of conversion from TUT generates CCG-TUT. We take as input a set of sentences in the TUT format of dependencies. These are (1) mapped onto constituency trees (i.e. ConsTUT format), (2) which in turn undergo surgery to become binary trees, and (3) are then mapped into CCG derivations. ConsTUT is a TUT-oriented constituency-based annotation with TUT relations annotated on constituents. In ConsTUT trees each terminal category X corresponds to a node (i.e. word) of a TUT tree, and projects into non-terminal nodes which represent intermediate (Xbar) and maximal (XP) projections of X, according to Xbar theory (for more details and download of TUT in CONS-TUT format see TUT).
The Italian CCGbank comes in three different formats: (1) derivations (Prolog terms), (2) derivations (pretty printed) and (3) tuples of words, POS and CCG category.
Version 1.0 consists in 1,837 Italian sentences organized in three corpora:
CCG-TUT by Johan Bos, Cristina Bosco, Alessandro Mazzei
is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Italy License.
Download the CCG-TUT treebank (version 1.0, 1.3 Mb)