Program

The workshop will be organised around themes. Each theme will have plenary talks and themed parallel breakout sessions. The discussed topics will also be based on participants' interests.

Day 1: Thursday 21st of September

9.30 - 10.00	Arrival (coffee)
10.00 - 10.30	Introduction talk, Maria Liakata (University of Warwick/Turing) and Dong Nguyen (Edinburgh University/Turing)
	Operationalising social and cultural concepts (chair: Diana Maynard)
10.30 - 11.15	Ground truth? The uses and abuses of human annotation in text analysis, Ken Benoit (London School of Economics and Political Science)
11.15 - 12.00	Operationalizing creativity: the data science of innovation from elites to vox pop, Simon DeDeo (Carnegie Mellon University)
12.00 - 12.45	Themed breakout sessions
12.45 - 13.30	Lunch
	Ethics (chair: Rob Procter)
13.30 - 14.15	There is no AI ethics: The human origins of machine prejudice, Joanna Bryson (University of Bath/Princeton)
14:15-14:35	Short talk: Algorithmic accountability: Mapping the ethical landscape, Brent Mittelstadt (University of Oxford/Turing)
14.35 - 15:20	Themed breakout sessions
15:20 - 15:45	Coffee break
	Data bias (chair: Taha Yasseri)
15:45 - 16:30	Reading between the lines - the hidden bias of NLP, Dirk Hovy (University of Copenhagen)
16:30 - 17.15	API bias: how digital data collection methods impact scientific inference, Rebekah Tromble (Leiden University)
17.15 - 18:00	Themed breakout sessions
18:00 - 20.00	Reception

Day 2: Friday 22nd of September

	Small versus big data (chair: Scott Hale)
9.30 - 10.15	Small data and big data: web archives and research in the humanities, Jane Winters (University of London)
10.15 - 11.00	Finding more needles by building bigger haystacks: Size and specificity in big data sociolinguistics, Jacob Eisenstein (Georgia Tech)
11:00 - 11:20	Short talk: Small stories & big data: challenges and opportunities, Alexandra Georgakopoulou-Nunes (King's College London)
11.20 - 12:05	Themed breakout sessions
12.05 - 13.00	Lunch
	Computational methods for theory building and explanation (chair: Sabrina Sauer)
13:00 - 13:45	Comparing grounded theory and topic modeling: extreme divergence or unlikely convergence? David Mimno (Cornell)
13.45 - 14:30	Modeling space: quantitative parsimony in computational literary analysis, Dennis Tenen (Columbia)
14:30 - 15:15	Themed breakout sessions
15:15 - 15:45	Coffee break
	Interdisciplinary research
15:45 - 16:45	Panel discussion, Host: Maria Liakata (Warwick University/Turing)
16.45 - 17.15	Summary + closing

Themes

Ethics

The rise of big data, and consequently the introduction of new types of data sources and research directions, has raised numerous complex ethical questions spanning the entire research process, ranging from data collection (e.g., private vs public data, linking data across various sources), to data annotation and operationalization of variables (e.g., assigning social variables such as ethnicity and gender), to the development of NLP representations and methods that might encode human bias (e.g., word embeddings and machine translation), leading to the potential dissemination of harmful and erroneous information (e.g. through the reinforcement of stereotypes).

Tentative topics for break out sessions

Stereotyping, e.g., as caused by the treatment of social variables like gender in NLP (Bamman et al. 2014, Nguyen et al. 2014, Larson et al., 2017, Koolen and van Cranenburgh, 2017)
Perpetuation of human bias in models (Caliskan et al., 2017, Bolukbasi et al., 2016)
Effectiveness of NLP tools across social/demographic groups (Hovy and Søgaard, 2015, Tatman 2017)
Privacy, e.g., public vs. hidden data, data linking, inferring 'hidden' information (Williams et al. 2017)
Incorporating ethics more concretely in the research process.

Background material

Operationalising social and cultural concepts

Qualitative approaches seek to preserve the complexity of the phenomena of interest, but quantitative approaches need to discretize (and simplify) the phenomena to enable measurements. The operationalization process has a huge influence, both in developing meaningful and re-useable annotation schemes and in interpreting results. The increasing availability of large and complex datasets, (e.g. on social media), makes the use of computational methods to process and analyse the data a necessity, forcing researchers to rethink concepts from the social sciences and the humanities that were used for smaller datasets (Manovich 2016, Tangherlini 2016). Conversely, insights from the social sciences and the humanities could serve as a source of inspiration and reflection for the operationalization of concepts and definition of research problems in natural language processing.

Tentative topics for break out sessions

How we create annotation schemas in NLP vs social sciences
- Interoperability for linguistic annotations (LAW workshops, Ide and Suderman 2014)
- The science of corpus annotation, Hovy and Lavid 2010
Operationalisation of social variables (e.g. age, gender, social class, ethnicity)
Validation
- In the political sciences (Grimmer and Stewart 2013, Lowe and Benoit 2013)
- In natural language processing (evaluation of NLP systems (Resnik and Lin 2010), strategies in evaluating NLP metrics (Reiter and Belz 2009, Mani 2001))
- Issues of construct validity and reliability in massive, passive data collections (Lazer 2015)
Data driven vs theory driven (Kitchin 2014)
How can large-scale computational approaches support non-reductionist, non-essentialist approaches?

Background material

Computational methods for theory building and explanation

In the social sciences and the humanities, researchers seek to understand a social or cultural phenomenon. Statistical models are mostly used to support their understanding and for theory building and testing. In natural language processing, however, the dominant focus is on building models that are effective on prediction tasks. Underlying objectives drive the selection and validation of models. As machine learning models are becoming increasingly complex and are often seen as black boxes, we are possibly moving even further away from using NLP/ML for theory building and explanation.

Tentative topics for break out sessions

The role of linguistic and social science theories in data annotation and building NLP models.
The tension between prediction and explanation (including its impact on evaluating NLP research) (Shmueli 2010, Hofman et al. 2017, Breiman 2001, Shmueli and Koppius 2011)
Approaches to improve transparency/interpretability of NLP models (Li et al. 2017, Arras et al. 2016, Aubakirova and Bansal 2016)
Abductive vs. inductive vs. deductive approaches in the humanities and the social sciences (Kitchin 2014), and the role of NLP
Causality (Fong and Grimmer 2016, Grimmer 2015)
Comparison of methods, e.g., comparing grounded theory and topic modeling (Baumer et al. 2017)

Data bias

While big social and cultural datasets offer many opportunities to study social and cultural phenomena, the datasets are often less controlled compared to more traditional datasets. They often contain unknown or not well understood biases and as a result, models and findings could be confounded, leading to low interpretability and limited justification for generalization to other domains.

Firstly, the datasets may contain content created by people who are very different from the general population. The uncontrolled nature of the data makes it challenging to obtain information about the social backgrounds of the content creators and therefore to account for these biases. Second, the focus on different data sources is not well balanced. For example, social media studies often focus on individual data sources, and in particular Twitter. The dominance of Twitter is described by Tufekci (2014) as Twitter being the ‘model organism’ for social media studies. This is problematic however, since the design of social media platforms affects user behavior at all levels, e.g., leading to differences in language use and network structures. Third, even when focusing on a single data source, biases might be introduced due to the sampling method (e.g. geotagged tweets (Pavalanathan and Eisenstein, 2015, Sloan and Morgan 2015)).

Tentative topics for break out sessions

Sampling methods (Herring 2004, Pavalanathan and Eisenstein 2015, Cohen and Ruths 2013, Morstatter et al. 2014)
Source selection (e.g., overrepresentation of Twitter in research) and comparison (Baldwin et al. 2013, Hu et al. 2013).
Domain adaptation, such as adapting tools to dialects on social media (Jørgensen et al. 2015), adapting POS taggers for Twitter (Wulff and Søgaard 2015), and tools for under-resourced languages (Agić et al. 2015, Agić et al. 2016)

Background material

Small vs big data

While much research in this area has focused on big data, small data stays equally relevant. Big data is often messy and uncontrolled, while data collected using more traditional methods are often more controlled, but as a consequence also smaller in scale. Many big data studies focus on big trends, but big data also provides the opportunity to study minorities who have historically been understudied, thus making big data small again (Foucault Welles 2014). Moreover in some cases any data collected is small by design, primarily because of annotation or experiment costs (e.g. fine-grained annotations of social media, annotations relating to health). However, using natural language processing for small data introduces various technical challenges. Many languages are under-resourced and there are key questions stemming from how one can make use of models built on a large dataset of mainstream language that can port well to under-resourced dialects or languages. Furthermore, deep analyses often require in-depth, time-consuming annotations and many technical challenges arise in training machine learning models based on a small number of annotations.

Tentative topics for break out sessions

Qualitative vs. quantitative research. The impact of computational social science and digital humanities on the traditional divide between qualitative and quantitative research methods (Manovich 2011), computational methods to support qualitative research
Natural language processing for small data (Bamman 2017)
The importance of small data in the era of big data (Foucault Welles 2014, Kitchin 2017, Kitchin and Lauriault 2015)
Linking small data to create big data

Background material

Interdisciplinary research

Research in this area is inherently interdisciplinary. As argued by Nissani (1997), interdisciplinary research has the potential to lead to creative breakthroughs and to prevent cross-disciplinary oversights and disciplinary cracks (e.g., neglecting important research problems that do not fall within disciplinary boundaries). But doing interdisciplinary research is accompanied by various challenges, such as finding suitable publication venues, acquiring funding, and the impact on research evaluation.

Tentative discussion topics

Publishing interdisciplinary research (e.g., publication venues)
Impact of interdisciplinary research on career paths and evaluation of research
Building a community and connecting disciplines (e.g., non-archival submission options in computer science conferences)
Funding

Background material

Crossing Paths Interdisciplinarity report (British Academy, 2016)