Developing High-Quality Linguistic Corpora
for Analysing Highly Subjective Phenomena

Tutorial co-located at

5th International Conference on Computational Social Science

Short Description

Methods for the creation of high-quality datasets, in particular from natural language data. The tutorial focuses on a survey of agreement metrics, annotation techniques such as crowdsourcing, and leveraging of controversiality and polarization of opinions. An exercise will be included, to provide hands-on experience with annotation techniques.


We propose a tutorial on methods for the creation of high-quality data sets, focused on the annotation of linguistic data, which are becoming more and more central for empirical social science.
More specifically, we plan to start with a discussion on the traditional expert annotation approach, with its strengths and weaknesses. We will then move to the measurement and meaning of inter-rater agreement metrics and the current state of the art in such field.
Considering the fact that in recent years crowdsourcing platforms play a significant role in annotation experiment, we will provide an overview of the current available services, presenting the use case of Figure Eight.
We will then present the benefits and pitfalls of gamification in the context of linguistic annotation and a few recent experiments on attempts and best practices for producing gold standards with acceptable quality.
Finally, we will dedicate a section of this tutorial to the topic of annotating highly subjective linguistic phenomena (such as hate speech), presenting new approaches to model and leverage controversiality and polarization of opinions.
During the tutorial, we aim to provide hands-on experience for the attendees, with a short crowdsourced annotation task followed by an analysis of the results and an open discussion about the interpretation of the results of the measures of agreements.

Tutorial outline

Part I: The annotation process

Organizers Bio

Valerio Basile

valerio Valerio Basile is a postdoc research fellow at the Department of Computer Science of the University of Torino. He received his PhD in 2015 at the University of Groningen with a thesis on Natural Language Generation.
He contributed to the Groningen Meaning Bank, a corpus of English text annotated with formal meaning representations using heterogeneous methods of annotation, and Wordrobe, a collection of online games to collect linguistic annotations.
He held a two-year position at Inria Sophia Antipolis, where created the resource DeKO (Default Knowledge about Objects), a repository of commonsense knowledge about objects, using several approaches to automatic and semi-automatic knowledge extraction, in the context of the European project ALOOF (Autonomous Learning of the Meaning of Objects).
At the current position, he is working on the detection of hate speech in social media, and modelling highly subjective and controversial phenomena. He contributed to the organization of several shared tasks for the Italian community EVALITA and at the international level Semeval).

Komal Florio

valerio Komal Florio is a PhD candidate at her second year at the Department of Computer Science of the University of Torino (Italy), working under the supervision of Prof. Viviana Patti in the Content-Centered Computing Group and visiting scientists at Universitat Hamburg, Dept. of Computer Science, Ethics in Information Technology Group.
Her current research lies at the intersection of traditional NLP for sentiment analysis, social media mining and statistical analysis for demography. More specifically she aims to use NLP techniques, combined with statistical analysis to detect and describe hate speech phenomena on Social Media in Italy, with a focus on immigration and immigrants integration issues. In the first year of her PhD she worked on manually annotation of linguistic data, and on the design, run and evaluation of crowdsourced annotation tasks for hate speech detection on social media.
Previous research experience include a collaboration as research assistance at Yahoo! Labs Barcelona, where she worked on financial networks mining and a student visiting at ISI Foundation Torino working on ecological agent-based models in the research group lead by Prof. Sorin Solomon (Hebrew University of Jerusalem).
She graduated in Theoretical Physics in 2008 from University of Torino with a thesis on financial applications of Statistics.