SentiTUT is a project for the development of a novel Italian corpus for sentiment analysis, which includes sentiment annotations concerning irony and consists in a collection of texts from Twitter. This resource includes annotations concerning both sentiment and morpho-syntax, in order to make available several possibilities of further exploitation related to sentiment analysis. For what concerns the annotation at sentiment level, we focus on irony and we selected therefore texts on politics from a social media, namely Twitter, where irony is usually applied by humans. Our aim is to add a new sentiment dimension, which explicitly accounts for irony, to a sentiment analysis classification framework based on polarity annotation.
With respect to the composition and size of the data set, it is organized in two subcorpora, namely TWNEWS and TWSPINO. The former is currently composed of around three thousands of tweets, published in the weeks after the new Italian prime minister Mario Monti announced his Cabinet (from October 2011 the 16th to February 2012 the third). The latter is instead composed of more than one thousand tweets extracted from the Twitter section of Spinoza, published from July 2009 to February 2012.
Spinoza, is a very popular collective Italian blog which includes a high percentage of posts with sharp satire on politics, which is published on Twitter since 2009. This subcorpus has been therefore added in order to enlarge our data set with texts where various forms of irony are involved. The collection of all the data has been done by exploiting a collaborative annotation tool, which is part of the Blogmeter social media monitoring platform.
The TWNEWS corpus will be available soon for research purpose.
Tweets from Spinoza can be accessed here.
The project for the development of the Senti–TUT involves the annotation of the linguistic data with respect to two distinguished levels. While the first one includes morphological and syntactic tags as usual e.g. in treebanks,the second refers instead to concepts typical of sentiment analysis.
The annotation guidelines are the same as those used for TUT for what concerns syntax and morphology:
The data are currently annotated at tweet level, since one sentiment tag is applied to each tweet (considering that a tweet can be composed by more than one sentence).In the table below the sentiment tags used for the annotation of Senti-TUT are described.
|MIXED||POS and NEG both|
Even if, for the present time, the focus of the Senti-TUT is mainly the annotation at tweet level, the resource we are currently developing has to be seen in the wider framework of a project for sentiment analysis and opinion mining. And within this context it should be considered also the availability of the morpho-syntactic annotation on the same data, which allows in the future for the application of other more fine-grained annotations and analysis related to sentiment analysis.
The annotation of the sentiment tags at the tweet level was manually performed by exploiting a collaborative annotation tool,
which is part of the Blogmeter social media monitoring platform. Among the utilities made available by Blogmeter we applied, in particular, those related to filtering out the non relevant data.
Workshops and Conferences: