Developing High-Quality Linguistic Corpora
for Analysing Highly Subjective Phenomena

Tutorial co-located at

5th International Conference on Computational Social Science

Short Description

Methods for the creation of high-quality datasets, in particular from natural language data. The tutorial focuses on a survey of agreement metrics, annotation techniques such as crowdsourcing, and leveraging of controversiality and polarization of opinions. An exercise will be included, to provide hands-on experience with annotation techniques.

Topic

We propose a tutorial on methods for the creation of high-quality data sets, focused on the annotation of linguistic data, which are becoming more and more central for empirical social science.
More specifically, we plan to start with a discussion on the traditional expert annotation approach, with its strengths and weaknesses. We will then move to the measurement and meaning of inter-rater agreement metrics and the current state of the art in such field.
Considering the fact that in recent years crowdsourcing platforms play a significant role in annotation experiment, we will provide an overview of the current available services, presenting the use case of Figure Eight.
We will then present the benefits and pitfalls of gamification in the context of linguistic annotation and a few recent experiments on attempts and best practices for producing gold standards with acceptable quality.
Finally, we will dedicate a section of this tutorial to the topic of annotating highly subjective linguistic phenomena (such as hate speech), presenting new approaches to model and leverage controversiality and polarization of opinions.
During the tutorial, we aim to provide hands-on experience for the attendees, with a short crowdsourced annotation task followed by an analysis of the results and an open discussion about the interpretation of the results of the measures of agreements.

Tutorial Handsout Here !

Tutorial Slides Here !

Tutorial outline

Part I: The annotation process

Introduction
- Annotation: a definition
- Gold standard: what, why, when… and how
Know your data
- The issue with the keywords
- Size matters
- GDPR and the problem of privacy
Task design
- The expert VS crowd dilemma
- Guidelines -- for whom?
- Experts annotation scheme
Crowdsourcing platforms (aka the more the merrier)
- Amazon Mechanical Turk overview
- Figure Eight platform exploration
- Lighttag (general hints)
- Weichaishi (the Chinese case)
All the glitters is not gold: the agreement measures
- The trivial percent agreement
- Scott’s pi metric
- Cohen’sK
- Fleiss K
- Krippendorff α
- Figure Eight Confidence Score
Recent developements
- The 7 Myths of Crowdsourcing
- CrowdTruth metrics
- Best- Worst Scaling

Part II: Advanced topics and Subjective phenomena

Issues with Agreement measure
- The problem with Kappa
- Paradoxes and abnormalities
- Annotator reliability: MACE
Gamification of annotation
- Phase Detective
- Wordrobe
Annotation of subjective phenomena
- Annotators background: study on Brexit
- Polarization of annotation

Part III:Annotation process: a live demo

Figure Eight live demo
Tips & tricks from real case text

Organizers Bio

Valerio Basile

valerio Valerio Basile is a postdoc research fellow at the Department of Computer Science of the University of Torino. He received his PhD in 2015 at the University of Groningen with a thesis on Natural Language Generation.
He contributed to the Groningen Meaning Bank, a corpus of English text annotated with formal meaning representations using heterogeneous methods of annotation, and Wordrobe, a collection of online games to collect linguistic annotations.
He held a two-year position at Inria Sophia Antipolis, where created the resource DeKO (Default Knowledge about Objects), a repository of commonsense knowledge about objects, using several approaches to automatic and semi-automatic knowledge extraction, in the context of the European project ALOOF (Autonomous Learning of the Meaning of Objects).
At the current position, he is working on the detection of hate speech in social media, and modelling highly subjective and controversial phenomena. He contributed to the organization of several shared tasks for the Italian community EVALITA and at the international level Semeval).

Komal Florio

valerio Komal Florio is a PhD candidate at her second year at the Department of Computer Science of the University of Torino (Italy), working under the supervision of Prof. Viviana Patti in the Content-Centered Computing Group and visiting scientists at Universitat Hamburg, Dept. of Computer Science, Ethics in Information Technology Group.
Her current research lies at the intersection of traditional NLP for sentiment analysis, social media mining and statistical analysis for demography. More specifically she aims to use NLP techniques, combined with statistical analysis to detect and describe hate speech phenomena on Social Media in Italy, with a focus on immigration and immigrants integration issues. In the first year of her PhD she worked on manually annotation of linguistic data, and on the design, run and evaluation of crowdsourced annotation tasks for hate speech detection on social media.
Previous research experience include a collaboration as research assistance at Yahoo! Labs Barcelona, where she worked on financial networks mining and a student visiting at ISI Foundation Torino working on ecological agent-based models in the research group lead by Prof. Sorin Solomon (Hebrew University of Jerusalem).
She graduated in Theoretical Physics in 2008 from University of Torino with a thesis on financial applications of Statistics.

Komal Florio

PhD Candidate

Hate Speech Detection · Computational Social Science · Demography · Social Media Mining

Developing High-Quality Linguistic Corpora
for Analysing Highly Subjective Phenomena

Tutorial co-located at

5th International Conference on Computational Social Science

Short Description

Topic

Tutorial Handsout Here !

Tutorial Slides Here !

Tutorial outline

Part I: The annotation process

Part II: Advanced topics and Subjective phenomena

Part III:Annotation process: a live demo

Organizers Bio

Valerio Basile

Komal Florio

Contact

Komal Florio

PhD Candidate

Hate Speech Detection · Computational Social Science · Demography · Social Media Mining

Developing High-Quality Linguistic Corpora for Analysing Highly Subjective Phenomena

Tutorial co-located at 5th International Conference on Computational Social Science

Short Description

Topic

Tutorial Handsout Here !

Tutorial Slides Here !

Tutorial outline

Part I: The annotation process

Part II: Advanced topics and Subjective phenomena

Part III:Annotation process: a live demo

Organizers Bio

Valerio Basile

Komal Florio

Contact

Developing High-Quality Linguistic Corpora
for Analysing Highly Subjective Phenomena

Tutorial co-located at

5th International Conference on Computational Social Science