The Domesticated Aboutness of Big Data Types

Ana Gross (University of Warwick)

What is Big Data about? In the following article I will briefly respond to this question by exploring some of the techniques that are being deployed for attributing particular meanings to data and for configuring what new data types are ultimately about. I suggest that, as with any other techno-scientific object, Big Data needs to be domesticated, to be introduced and reconfigured as part of society in particular ways, and that the domestication of Big Data is being carried out by a range of different actors using a range of different techniques and methods within society at large. I’m interested in thinking and describing some social and public techniques through which data has been rendered personal and about persons.

My claim is that data that today count as ‘personal’ or ‘biographical’ was not understood as such before the development of a range of digital technologies and devices which now facilitate the recording of behaviours, movements, traits and opinions as part of everyday routines. The gathering of what is currently taken for granted as ‘personal data’ is becoming more extended as interaction with new digital technologies and devices becomes ubiquitously embedded as part of daily life. However, our understanding of what types of data are personal is also indebted to the different techniques that have been deployed in society at large to domesticate certain data types and qualify these as such.

Think for example of search keywords. It could be argued that these are a relatively new data type that has emerged thanks to the introduction of digital interfaces of search as part of software, web and other digital devices. Whilst one could think of search keywords as the expression of a particular cognitive search interest or need, one indeed capable of being translatable into a text, search keywords have also recently acquired the status of ‘personal data’. In what follows I would like to briefly reflect on two techniques through which society has configured the characteristics and meaning of search keywords, not as mere information needs but as expressions of personal interiority and intimacy instead. These techniques are the reidentification demonstration undertaken by New York Times journalists and a data art piece entitled I Love Alaska, based on real search keywords typed in by a real web-based search user.

Let me first introduce what reidentification demonstrations are: these consist of the deployment of different techniques – including the design of algorithms and the amalgamation of different databases – in order to publicly trace, identify and name persons out of databases containing purportedly anonymisied data. Arguably, the phenomenon originated, and is indebted, to the activities undertaken by Professor Latanaya Sweeney who is now director of the Data Privacy Lab at Harvard University. In the mid 1990s, when Professor Sweeney was an MIT graduate student, the Massachusetts Group Insurance Commission (GIC) responsible for buying health insurance for state employees collected and released patient specific data containing nearly one hundred attributes per 135,000 visits to hospital. Under the assumption that the data had been made anonymous, GIC released it for research purposes and also sold copies to industry-related organisations (Sweeney 2002). Reassurance came for Massachusetts’s governor Bill Weld who guaranteed employees that due to the anonymisation procedures inflicted on the data, the risk of personal public reidentification through the data was non-existent.

Unfortunately, Weld later collapsed at a local public event and was admitted to hospital, an occasion that Professor Sweeney took to demonstrate how Weld’s hospital entry could be publicised and reframed further. It followed that for twenty dollars she purchased ‘the voter registration list for Cambridge Massachusetts and received the information on two diskettes’ (Sweeney 2002). The data in the diskettes contained the name, address, ZIP code, birth date and gender of every voter in Massachusetts. Sweeney combined this data with GIC records and made Governor Weld identifiable. Apparently, only six people in the city of Cambridge shared Weld’s birth date, only three of them where men and only one (Weld himself) lived in his postcode. To set the scene further and with a rather ‘theatrical flourish’ (Ohm 2009, 24) Sweeney posted the records – which included diagnoses and prescriptions – to Weld personally.

Since then a range of different reidentification demonstrations have taken place, undertaken by academics, citizens and a variety of other actors, including most prominently the case of Netflix Prize Data releaseand the so-called AOL data leak but also Sweeney’s most recent engagement with the Personal Genome Project database to name a few.

Let me here briefly describe AOL’s data leak further. On the 3rd of August 2006, American Online (AOL) Research made public 20 million search queries typed-in by nearly 500,000 AOL users over three months. The data – which was initially circulated to be used for research purposes only – leaked and was reconfigured in alternative ways by a number of diverse actors – including journalists, artists, bloggers and others. What was defined, and became known, by its public and participants as the AOL Data Leak consisted of the proliferation of experimental and alternative data arrangements which came to contest the status of search keywords configured by search engines as simple, effortless and a priori expressions of naturally occurring subjective informational needs.

Despite the fact that the AOL data had been anonymised, Michael Barbaro and Tom Zeller, two journalists at the New York Times used AOL’s files to trace different clues to identity in search keywords by, this time, combining the database with phonebook records instead. They searched for and invoked a person when looking at user number 4417749’s queries. The journalists claimed that:

Number 4417749 conducted hundreds of searches over a three-month period on topics ranging from “numb fingers” to “60 single men” to “dog that urinates on everything”.

And search-by-search, click-by-click, the identity of AOL’s user number 4417749 became easier to discern. There are queries for “landscapers in Lilburn, Ga”, several people with the last name Arnold and “homes sold in shadow lake subdivision Gwinnett county Georgia.”

It did not take much investigating to follow that data trail to Thelma Arnold, a 62-year old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. “Those are my searches”, she said, after a reporter read part of the list to her.

It has been suggested that as soon as AOL’s data was made public both ‘AOL and the rest of the world soon learned that search engines queries are windows to the soul’ (Ohm 2009, 16). I suggest however that search queries became articulated and were not inherently personal to start with. Search keywords are not an unmediated expression of subjectivity but there are framings and techniques that make certain pieces of data personal and about persons. Indeed, reidentification demonstrations can be conceived as one of such techniques. It was deployed in the case of AOL for publicly augmenting what is deemed personal, and for triggering search keywords – no more than a string and concatenation of relatively meaningful but sometimes inaccurately typed-in-words in an interface – to become an expression of a particular way of being a person.

In such a retrofitted and orchestrated encounter between Thelma Arnold’s person and the search queries she allegedly produced, search keywords emerged configured as an expression of interiority and of personal dimensionality. As philosopher Ian Hacking (2001) has argued, possible ways of being a person can come into being and disappear, there are degrees of possibility for being a person in particular spaces and times. Reidentification demonstrations like the one described here opened up the possibility of becoming a person with particular characteristics (interiority and dimensionality) through search.

One could therefore argue that there is no essential distinction in principle between personal and impersonal data, but rather societal techniques for making and rendering certain pieces of data as about persons. As part of AOL’s reidentification demonstration search keywords were reclaimed and re-described as parts and expressions of personhood. When search queries were reinserted and reconfigured as part of Thelma Arnold and her context, a part like a search keyword that can multiply belong to different wholes –a database, a person, etc – was demonstrably orchestrated to occur.

Let us now engage with a second example of what could be defined as a technique for rendering data personal. Also using AOL’s leaked database, artists Lernert Engelberts and Sander Plug unfolded AOL’s User Number 711391 persona in a series of thirteen short documentary pieces entitled I Love Alaska – The Heartbreaking Search History of AOL’s User 711391. The documentaries were produced under a specific format that consisted of putting together thirteen episodic documentaries of no longer than seven minutes each with the intention of developing and shaping a self-contained story. I Love Alaska documentaries deployed a barren, isolated landscape whilst a female narrator read, in chronological order, a range of selected search queries made by an anonymous AOL user uniquely identified by the number 711391 in the database.

In I Love Alaska database and cinematic sequencing were merged into a new form. The method deployed in the documentaries to portray persons through the use of data was to traverse and navigate AOL’s database and reorganise its records in order to accomplish a seemingly biographical narrative. Such a narrative, established through the sequential arrangement of search keywords, not only opened the possibility of a peculiar nameless biographical account, but also, I suggest, contributed to reframing the ‘aboutness’ of search keywords in a similar way to what was accomplished in the reidentification demonstration of Thelma Arnold. It was claimed, for example, that, although it would be unfair to presume that we have a coherent picture of user 711391 through I Love Alaska’s biographical rendering, ‘the bottomless solitude of being trapped in one’s own skin’ is conjured up in each of the search keywords made public. The documentaries reframe a search history as an ‘incidental archive of the self’ where keywords now provide access into the ‘fabled perpetual process of interior life’ – that of searching.

Techniques like reidentification demonstrations or data art pieces, such as the ones I have described, are a means of social experimentation with new techno-scientific objects, in this case, search keywords belonging to Big Data sets. In other words, techniques like these organise and render the meaning and experience of Big Data units. As Bell (2015) has recently suggested ‘Big Data now bubbles up to the surface in all sorts of places. It is already finding its space in a cartography of social imagination’ (10). More precisely, the imagination and aboutness of Big Data is becoming configured and negotiated in the techniques being deployed for its social domestication.

References:
Bell G (2015), The Secret Life of Big Data, in Boellstorff T. and Maurer B., Data – Now Bigger and Better!, Prickly Paradigm Press: Chicago.
Hacking I (2001), Making Up People: On Some Looping Effects of the Human Kind – Institutional Reflexivity or Social Control?, European Journal of Social Theory, 3 (1).
Ohm P. (2009), Broken Promises of Privacy: Responding to the Surprising Failure of Anonymisation, UCLA Law Review, Vol. 57.
Sweeney L. (2002), k-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5).

Ana Gross is a PhD Student at the Centre for Interdisciplinary Methodologies, University of Warwick and Visiting PhD Student at the Centre for the Study of Invention and Social Process, Goldsmiths College, University of London. Her research explores different data forms and their qualities: personal data, provisional data, perturbed data, anonymised data

Image: flickr creative commons – Ars Electronica