The Coming Epistemic Transition: From Small to Big Data

Hamish Robertson, Joanne Travaglia and Nick Nicholas

Current attitudes towards ‘big data’ are quite mixed including positive, negative and various kinds of ‘so what?’ reactions. The big data enthusiasts dominate the media and literature but other perspectives are emerging. The rapid development of AI and some quite specific kinds of algorithmic applications are seen as a threat to specific groups and individuals. And not without reason. The rise of datafication means that increasingly sophisticated digital technologies not only produce, but monitor and analyse, big data outputs, often in ‘real-time’. These, in turn, add an element of panopticism to the comparatively slow and cumbersome small data technologies with which we are all familiar. The results imply a coming transition when this situation effectively crosses over from the conventional ‘small data’ paradigm we have lived for with most of the 20^th century into one in which big data is the norm. As a growing number of commentators have pointed out, this ‘transition’ has a variety of epistemological implications.

In this context, big data is not simply a case of more is better (or worse, depending on your position) but an indicator that the rigidities of the small data paradigm are gradually losing ground in relation to new data technologies and their associated epistemologies. By this we mean that the rise of big data perspectives, methods and strategies is contributing to new ways of understanding and thinking about ‘data’ and the kind of knowledge the small-to-big data transition is leading us towards. In effect, as the technologies of big data expand, we can expect to see an epistemological impact. The growing intersectionalities of technology, data characteristics and methodological tools are currently in flux, meaning that new knowledge is being produced via novel approaches. We are no longer working at the height of the small data period but confronting an epistemic transition.

Many conventional data environments relied on a mix of traditional manual record-keeping (e.g. patient records in healthcare), electromechanical data systems (e.g. weather data collection by observers with analogue instruments) or some combination of the two. These technologies still supported individual interpretation and expertise as well as a growing shift to calculative data practices. This was perfectly reasonable in the natural sciences but it has had a variety of problematic consequences in the social sciences and their application to ‘real world’ problems, as conceived of and acted on by social elites and established power structures. By the end of World War Two the idea that quantification and small data justified the use of various levels of force against civil populations was well-established even in the Cold War democracies. Better documented examples include British Bomber Command’s use of Operations Research to support its campaigns against German cities (see Freeman Dyson’s autobiography for a far more personal view), the US application of systems theory and computers in the Vietnam War and, more generally, the calculus of nuclear missile strategies throughout the Cold War. Our contention is that an orthodox small data hegemony was finally consolidated around the same time as digitisation began its rise and the potential of a transition to a big data environment began to emerge.

So while there are clear risks associated with many of the developments in big data, the positioning of big data as somehow less virtuous than small data has to be directly confronted historically, sociologically and socio-politically. Small data has had its own hegemonic influences and tendencies – orthodox economics or psychology include plenty of examples of the power that mathematics can add to bad ideas. Criminologists as well as feminist and critical race theorists can, for their part, point to a variety of successfully repressive social policies and practices implemented under and supported by small data practices. The positioning of big data as, and in, an emerging risk environment must, of necessity, acknowledge the problematical nature of the hegemonic small data paradigm that emerged over the previous two centuries of quantitative and calculative technologies, and their deep political connections.

In acknowledging the epistemic implications of a coming knowledge transition, it is equally important to evaluate not only what we are transitioning to, important as this is, but also what we are transitioning from. Here we take issue with the framing of the coming transition as a simplistic moral transition from virtue to vice, positive to negative, privacy to panopticon. And, while not accepting the industry hype associated with big data, look more closely at what it is we are transitioning from and inquire on small data’s moral and intellectual culpabilities over nearly two centuries of its gradual development and increasingly hegemonic tendencies.

Frické, for example, has suggested that “big data is both fallible and tainted by theory”. There are two immediate issues with this statement, one being the idea that theory naturally ‘taints’ or indeed that this is of necessity a way to conceive of the relationship between data and theory. To be fair, he doesn’t and references Popperian falsificationism in his discussion. But this is a common, even familiar, supposition amongst those who believe that small data, once collected, cleaned and entered into an orderly information system of some sort, are both objective and immutable. Secondly, if ‘big data’ is fallible (of course), then so too is small data and small data has therefore been potentially fallible for much longer than big data. By implication, any knowledge base built and maintained utilising small data methods must also been seen as being at least potentially fallible. Both frequentists and Bayesians alike generally accept that probability theory does not support truth claims of the unconditional, affirmative and belief-oriented type. And yet, we can observe a sustained history of combining positivism, probability theory and small data to produce some highly dubious outcomes for vulnerable groups in our communities.

This situation would be fine in the sciences so long as the type of truths we are seeking to identify are conditionally scientific and not some form of quasi-theological or philosophical (yes, we know this is short-cutting) truth. It is the elision between differing truth positions that leads to a great deal of misinformation in the small-to-big data space. Small data users have mostly emerged in and through the positivist paradigm, and the authority that positivism acquired was far in excess of its capacity to deliver truthful outcomes in any absolute sense. Indeed, we would argue that one that major failings of this ‘positivism light’ has been a tendency to affirm types of truth that probabilistic data methods were never designed to support. The levels of determinism associated with the small data paradigm instead reflect their deep history, close association with the authoritarian state and its administrative structures, and a capacity to enforce their legitimacy even in areas where their capabilities were slight. The insidious, persistent and over-wrought dichotomy of quantitative versus qualitative data being a case in point.

Small data knowledge production has always been embedded in reductionist practices. Reduction summaries an incredibly complex reality and it does so of necessity. And we have done this through quantification, with all that the resulting metrics entail, both positive and negative. We and our small data technologies weren’t that good at representing nature in all its complexity, we needed reduction to function. Even now, the typical big data dashboard is a clear recognition and, more importantly, acceptance of this inimitable fact. As human beings, we struggle with high-volume, high-context, multi-source data and so did most of our small data tools. One of the reasons speed is emphasised in so much big data hype is not just because it is true but because it emphasises the anxiety associated with the mismatch between traditional small data methods and the consequences of major technological innovations.

Many research questions under this paradigm assume we know important factors already or we would not (in many cases, could not) collect data about everything that might be relevant to the problem under inquiry. Consider how quickly we have transitioned from early DNA-based genetics to junk DNA and onwards to epigenetics. Even under the small data paradigm, we have had to accept that our knowledge progresses and improves over time. Certainty is not only overrated in scientific knowledge production but it is also a potential barrier to improved understanding of complex problems. Health and medicine provide some classic cases in this regard because informational and practice-based orthodoxies emerged quite quickly in these fields, often making them resistant to the implications of new data, concepts and theories – consider the initial responses to the Helicobacter pylori research conducted in Australia by Warren and Marshall in the early 1980’s.

Perhaps what we are seeing with the ‘toing and froing’ in the small-to-big data transition is a brief period of epistemic anxiety. Epistemic orthodoxies are comforting in many contexts and, heuristically speaking, they make our agreed knowledge base much easy to accept and communicate. While any inconsistencies become far less confronting because we all ‘know’ what is accepted as true. Small data has been a prevailing orthodoxy for several generations now and this has had its own epistemic and ontological effects. One of these is a highly problematic version of what ‘truth through data’ actually is and a second is a hegemonic resistance to pluralism in scientific knowledge production. Big data clearly threatens that, producing a wide variety of new uncertainties and anxieties.

Putting big data pundits aside, a big ask in itself, these anxieties need to be worked through. The rapid expansion in ethical big data critiques indicates that responses are coming at a rate similar to changes in the technology and its applications. The kinds of questions that need to be addressed include: how do we know what is true or false under big data; how to we identify deliberate deceit in big data applications; and how do we address human biases within the big data paradigm? Generally, the answers to these types of question were well-established under small data but big data destabilises these assumptions while adding limited certainty to future analytic methods and outcomes. This is all exciting on the one hand but not very reassuring on the other.

Consequently, we propose that part of the answer to our current big data anxiety is a much closer and more critical inquiry on how we got to our present situation. That means unpacking historical small data orthodoxies, assumptions and prejudices that have emerged over time and, in many cases, attached themselves to other prevailing assumptions and unstated beliefs operant in our societies. This emerging period of transition insists of us that we look both backwards and forwards in order to situate ourselves, to reconcile those anxieties and also to accept much greater pluralism in scientific knowledge production. Small data methods produced, in some cases, a miserly approach to knowledge and a skewed understanding of scientific and social truth. The sociology of big data requires, therefore, a process of unpacking the past while concurrently adapting to our emerging present.

Hamish Robertson is a Research Fellow in the Centre for Health Services Management at the University of Technology Sydney (UTS) @HL_Robertson. Joanne Travaglia is Professor of Health Services Management at the University of Technology Sydney (UTS) @j_travaglia. Nick Nicholas is Managing Director of The Demographer’s Workshop in Sydney, Australia