The workshop will be organised around themes. Each theme will have plenary talks and themed parallel breakout sessions. The discussed topics will also be based on participants' interests.
9.30 - 10.00 | Arrival (coffee) |
10.00 - 10.30 | Introduction talk, Maria Liakata (University of Warwick/Turing) and Dong Nguyen (Edinburgh University/Turing) |
Operationalising social and cultural concepts (chair: Diana Maynard) | |
10.30 - 11.15 | Ground truth? The uses and abuses of human annotation in text analysis, Ken Benoit (London School of Economics and Political Science) |
11.15 - 12.00 | Operationalizing creativity: the data science of innovation from elites to vox pop, Simon DeDeo (Carnegie Mellon University) |
12.00 - 12.45 | Themed breakout sessions |
12.45 - 13.30 | Lunch |
Ethics (chair: Rob Procter) | |
13.30 - 14.15 | There is no AI ethics: The human origins of machine prejudice, Joanna Bryson (University of Bath/Princeton) |
14:15-14:35 | Short talk: Algorithmic accountability: Mapping the ethical landscape, Brent Mittelstadt (University of Oxford/Turing) |
14.35 - 15:20 | Themed breakout sessions |
15:20 - 15:45 | Coffee break |
Data bias (chair: Taha Yasseri) | |
15:45 - 16:30 | Reading between the lines - the hidden bias of NLP, Dirk Hovy (University of Copenhagen) |
16:30 - 17.15 | API bias: how digital data collection methods impact scientific inference, Rebekah Tromble (Leiden University) |
17.15 - 18:00 | Themed breakout sessions |
18:00 - 20.00 | Reception |
Small versus big data (chair: Scott Hale) | |
9.30 - 10.15 | Small data and big data: web archives and research in the humanities, Jane Winters (University of London) |
10.15 - 11.00 | Finding more needles by building bigger haystacks: Size and specificity in big data sociolinguistics, Jacob Eisenstein (Georgia Tech) |
11:00 - 11:20 |
Short talk: Small stories & big data: challenges and opportunities, Alexandra Georgakopoulou-Nunes (King's College London) |
11.20 - 12:05 | Themed breakout sessions |
12.05 - 13.00 | Lunch |
Computational methods for theory building and explanation (chair: Sabrina Sauer) | |
13:00 - 13:45 | Comparing grounded theory and topic modeling:
extreme divergence or unlikely convergence? David Mimno (Cornell) |
13.45 - 14:30 | Modeling space: quantitative parsimony in computational literary analysis,
Dennis Tenen (Columbia) |
14:30 - 15:15 | Themed breakout sessions |
15:15 - 15:45 | Coffee break |
Interdisciplinary research | |
15:45 - 16:45 | Panel discussion, Host: Maria Liakata (Warwick University/Turing) |
16.45 - 17.15 | Summary + closing |
The rise of big data, and consequently the introduction of new types of data sources and research directions, has raised numerous complex ethical questions spanning the entire research process, ranging from data collection (e.g., private vs public data, linking data across various sources), to data annotation and operationalization of variables (e.g., assigning social variables such as ethnicity and gender), to the development of NLP representations and methods that might encode human bias (e.g., word embeddings and machine translation), leading to the potential dissemination of harmful and erroneous information (e.g. through the reinforcement of stereotypes).
Qualitative approaches seek to preserve the complexity of the phenomena of interest, but quantitative approaches need to discretize (and simplify) the phenomena to enable measurements. The operationalization process has a huge influence, both in developing meaningful and re-useable annotation schemes and in interpreting results. The increasing availability of large and complex datasets, (e.g. on social media), makes the use of computational methods to process and analyse the data a necessity, forcing researchers to rethink concepts from the social sciences and the humanities that were used for smaller datasets (Manovich 2016, Tangherlini 2016). Conversely, insights from the social sciences and the humanities could serve as a source of inspiration and reflection for the operationalization of concepts and definition of research problems in natural language processing.
In the social sciences and the humanities, researchers seek to understand a social or cultural phenomenon. Statistical models are mostly used to support their understanding and for theory building and testing. In natural language processing, however, the dominant focus is on building models that are effective on prediction tasks. Underlying objectives drive the selection and validation of models. As machine learning models are becoming increasingly complex and are often seen as black boxes, we are possibly moving even further away from using NLP/ML for theory building and explanation.
While big social and cultural datasets offer many opportunities to study social and cultural phenomena, the datasets are often less controlled compared to more traditional datasets. They often contain unknown or not well understood biases and as a result, models and findings could be confounded, leading to low interpretability and limited justification for generalization to other domains.
Firstly, the datasets may contain content created by people who are very different from the general population. The uncontrolled nature of the data makes it challenging to obtain information about the social backgrounds of the content creators and therefore to account for these biases. Second, the focus on different data sources is not well balanced. For example, social media studies often focus on individual data sources, and in particular Twitter. The dominance of Twitter is described by Tufekci (2014) as Twitter being the ‘model organism’ for social media studies. This is problematic however, since the design of social media platforms affects user behavior at all levels, e.g., leading to differences in language use and network structures. Third, even when focusing on a single data source, biases might be introduced due to the sampling method (e.g. geotagged tweets (Pavalanathan and Eisenstein, 2015, Sloan and Morgan 2015)).
While much research in this area has focused on big data, small data stays equally relevant. Big data is often messy and uncontrolled, while data collected using more traditional methods are often more controlled, but as a consequence also smaller in scale. Many big data studies focus on big trends, but big data also provides the opportunity to study minorities who have historically been understudied, thus making big data small again (Foucault Welles 2014). Moreover in some cases any data collected is small by design, primarily because of annotation or experiment costs (e.g. fine-grained annotations of social media, annotations relating to health). However, using natural language processing for small data introduces various technical challenges. Many languages are under-resourced and there are key questions stemming from how one can make use of models built on a large dataset of mainstream language that can port well to under-resourced dialects or languages. Furthermore, deep analyses often require in-depth, time-consuming annotations and many technical challenges arise in training machine learning models based on a small number of annotations.
Research in this area is inherently interdisciplinary. As argued by Nissani (1997), interdisciplinary research has the potential to lead to creative breakthroughs and to prevent cross-disciplinary oversights and disciplinary cracks (e.g., neglecting important research problems that do not fall within disciplinary boundaries). But doing interdisciplinary research is accompanied by various challenges, such as finding suitable publication venues, acquiring funding, and the impact on research evaluation.