Developing High-Quality Linguistic Corpora
for Analysing Highly Subjective Phenomena
Tutorial co-located at
5th International Conference on Computational Social Science
Short Description
Methods for the creation of high-quality datasets, in particular from natural language data. The tutorial focuses on a survey of agreement metrics, annotation techniques such as crowdsourcing, and leveraging of controversiality and polarization of opinions. An exercise will be included, to provide hands-on experience with annotation techniques.
Topic
We propose a tutorial on methods for the creation of high-quality data sets, focused on the annotation of linguistic data, which are becoming more and more central for empirical social science.
More specifically, we plan to start with a discussion on the traditional expert annotation approach, with its strengths and weaknesses. We will then move to the measurement and meaning of inter-rater agreement metrics and the current state of the art in such field.
Considering the fact that in recent years crowdsourcing platforms play a significant role in annotation experiment, we will provide an overview of the current available services, presenting the use case of Figure Eight.
We will then present the benefits and pitfalls of gamification in the context of linguistic annotation and a few recent experiments on attempts and best practices for producing gold standards with acceptable quality.
Finally, we will dedicate a section of this tutorial to the topic of annotating highly subjective linguistic phenomena (such as hate speech), presenting new approaches to model and leverage controversiality and polarization of opinions.
During the tutorial, we aim to provide hands-on experience for the attendees, with a short crowdsourced annotation task followed by an analysis of the results and an open discussion about the interpretation of the results of the measures of agreements.
Tutorial Handsout Here !
Tutorial Slides Here !
Tutorial outline
Part I: The annotation process
- Introduction
- Annotation: a definition
- Gold standard: what, why, when… and how
- Know your data
- The issue with the keywords
- Size matters
- GDPR and the problem of privacy
- Task design
- The expert VS crowd dilemma
- Guidelines -- for whom?
- Experts annotation scheme
- Crowdsourcing platforms (aka the more the merrier)
- Amazon Mechanical Turk overview
- Figure Eight platform exploration
- Lighttag (general hints)
- Weichaishi (the Chinese case)
- All the glitters is not gold: the agreement measures
- The trivial percent agreement
- Scott’s pi metric
- Cohen’sK
- Fleiss K
- Krippendorff α
- Figure Eight Confidence Score
- Recent developements
- The 7 Myths of Crowdsourcing
- CrowdTruth metrics
- Best- Worst Scaling
- Issues with Agreement measure
- The problem with Kappa
- Paradoxes and abnormalities
- Annotator reliability: MACE
- Gamification of annotation
- Phase Detective
- Wordrobe
- Annotation of subjective phenomena
- Annotators background: study on Brexit
- Polarization of annotation
- Figure Eight live demo
- Tips & tricks from real case text

Part II: Advanced topics and Subjective phenomena
Part III:Annotation process: a live demo
Organizers Bio
Valerio Basile
Valerio Basile is a postdoc research fellow at the Department of Computer Science of the University of Torino. He received his PhD in 2015 at the University of Groningen with a thesis on Natural Language Generation.
He contributed to the Groningen Meaning Bank, a corpus of English text annotated with formal meaning representations using heterogeneous methods of annotation, and Wordrobe, a collection of online games to collect linguistic annotations.
He held a two-year position at Inria Sophia Antipolis, where created the resource DeKO (Default Knowledge about Objects), a repository of commonsense knowledge about objects, using several approaches to automatic and semi-automatic knowledge extraction, in the context of the European project ALOOF (Autonomous Learning of the Meaning of Objects).
At the current position, he is working on the detection of hate speech in social media, and modelling highly subjective and controversial phenomena. He contributed to the organization of several shared tasks for the Italian community EVALITA and at the international level Semeval).
Komal Florio
Komal Florio
is a PhD candidate at her second year at the Department of Computer Science of the University of Torino (Italy), working under the supervision of
Prof. Viviana Patti
in the Content-Centered Computing Group
and visiting scientists at Universitat Hamburg, Dept. of Computer Science, Ethics in Information Technology Group.
Her current research lies at the intersection of traditional NLP for sentiment analysis, social media mining and statistical analysis for demography.
More specifically she aims to use NLP techniques, combined with statistical analysis to detect and describe hate speech phenomena on Social Media in Italy, with a focus on immigration and immigrants integration issues.
In the first year of her PhD she worked on manually annotation of linguistic data, and on the design, run and evaluation of crowdsourced annotation tasks for hate speech detection on social media.
Previous research experience include a collaboration as research assistance at Yahoo! Labs Barcelona, where she worked on financial networks mining
and a student visiting at ISI Foundation Torino working on ecological agent-based models in the research group lead by Prof. Sorin Solomon (Hebrew University of Jerusalem).
She graduated in Theoretical Physics in 2008 from University of Torino with a thesis on financial applications of Statistics.