******************************************* * Sentipolc 2014 @ EVALITA Development Data * * http://di.unito.it/sentipolc14 * * Task co-organizers * Valerio Basile (v.basile[at]rug.nl), University of Groningen, The Netherlands * Andrea Bolioli (abolioli[at]celi.it), CELI, Torino, Italy * Malvina Nissim (malvina.nissim[at]unibo.it), FICLIT, University of Bologna, Italy * Viviana Patti (patti[at]di.unito.it), Dipartimento di Informatica, University of Torino, Italy * Paolo Rosso (prosso[at]dsic.upv.es), Universitat Politècnica de València, Spain * * Task Guidelines: http://www.di.unito.it/~tutreeb/sentipolc-evalita14/guidelines.pdf * ******************************************* A single development set will be provided. The distribution consists of a set of 4,513 twitter status IDs, with annotations concerning all three Sentipolc's subtasks: subjectivity classification, polarity classification and irony detection. In compliance with Twitter's terms, rather than releasing the original tweet's text, we are providing a web interface based on the use of RESTful Web API technology to download the text: http://www.di.unito.it/~tutreeb/sentipolc-evalita14/tweet.html The data format is as follows: "idtwitter","subj","pos","neg","iro","top","text" where the field "text" is to be filled using the step by step procedure available on the website. The interface works as follows: 1. Click on the button with label: "Step 1. Get Corpus Items". The 4513 items of the Sentipolc development dataset will be loaded. You'll be notified by a pop-up window when the loading is completed. At this stage the actual text of the Tweets is still to be retrieved. 2. Click on the button with label: "Step 2. Get Tweets" Tweets contents will be downloaded by querying data from Twitter on the fly by using Twitter's API. This process can take some time. All processing is made client-side by the browser running javascript code. A red warning box with label "Working... please wait!" will be visible on the web page until the process has completed to download all the messages. At this stage, when the message is retrieved, it is displayed in the cells under the "text" column. In cases where the Tweet is no longer available, those cells are filled with the string: "Tweet Not Available", rather than with the text of the Tweet. *IMPORTANT: Notice that, as specified in the guidelines (Sec. 5, Final remarks), Twitter users can delete their own posts anytime; their accounts can be temporarily suspended or deactivated. For this reason not all the released tweets will be retrievable at download time: in the interim between the collection/annotation time and the release time many posts may become not available. Participants from other shared tasks using Twitter data (i.e. Semeval 2013, Task 2), reported that up to 10% of tweets may not be available (with a possible loss of about 10% of the training data). This is less than optimal, but distributing the Tweet IDs (which others can use to fetch the contents using Twitter's API) turns out to be the only possibility to redistribute Twitter messages to others, according the current Twitter's policies.* You'll be notified by an alert message when the download is completed ("Get Tweets terminated"). 3. Click on the button with label: "Step 3. Export the corpus". After all tweet are downloaded the dataset can be stored in different formats: e.g. comma separated values, Excel, PDF. You can select your preferred format by clicking on one of the buttons available on the right.