At the fourth event of Season 4 in the Feminist and Accessible Publishing Communication, and Technologies Speaker and Workshop Series, founder Dr. Alex Ketchum and event co-sponsor Dr. Fenwick McKelvey (of the Machine Agencies working group) welcomed Dr. Alex Hanna and her presentation Beyond Bias: Algorithmic Unfairness, Infrastructure, and Genealogies of Data. Hanna’s presentation chronicled the growing concerns of bias, discrimination and unfairness that algorithm-based datasets are built upon and continue to perpetuate. Her most recent research revealed the history of algorithmic bias by creating a genealogy of machine learning. Her presentation was a cohesive narrative of her research over the past few years, some of the origins of machine learning datasets, and the researchers who first exposed these discriminatory datasets. Hanna cut through big data rhetoric and focused on datasets at their most structural level, explaining how they are a form of infrastructure and as such must be considered with the same amount of care and accountability as other societal structures.
She began by highlighting the researchers and thinkers who first documented the inherent bias in datasets—or algorithmic unfairness in Hanna’s terms—like the research of Dr. Joy Buolamwini, Gender Shades, a landmark case which revealed the racial and gender bias of facial recognition algorithms in the AI datasets from Microsoft, IBM and Face++. Hanna elaborated how this was not only an issue in regard to facial data but also with object recognition technology in general. Citing the work of Terrance DeVries et al. “Does Object Recognition Work for Everyone?”, object recognition was found to be not as objective as it seems. By collecting images of sample household items (like soap) from the Dollar Street image dataset and surveying a wide geographical area, the findings revealed that the dataset was better at recognizing household items from the global North and middle to higher income communities. However, even with the evidence of bias and disparity in datasets, there remains limited intervention from large companies; steps that are taken often lead to other concerns in regards to ethics and consent. Rather than focussing on how to diversify an already fractured machine, Hanna articulated the necessity to understand the history of datasets and “the data practices, values, and assumptions embedded in datasets themselves.” Her socio-historical approach was welcoming to those who (like myself) may not be fully versed in the world of data. What a genealogy of machine learning datasets allows for is an investigation that centres on “how and why these datasets have been created, what and whose values influence the choices of data to collect, the contextual and contingent conditions of their creation.” This line of inquiry led Hanna to rethink of datasets as a form of infrastructure.
As someone whose understanding of dataset information is hazy at best, I found Hanna’s breakdown of the specific features of datasets as infrastructure offered a sense of clarity and accessibility. Recounting her research from 2020, “Bringing the People Back In: Contesting Benchmark Machine Learning Datasets,” she elaborated on how datasets are made up of four distinct features: datasets determine what a model learns; datasets benchmark algorithms; datasets serve as model organisms; and datasets provide methodological grounds for model development in industry context. Hanna used ImageNet as an example to map out this framework. ImageNet is the dataset credited with leading to the emergence of deep learning (a type of AI and machine learning that teaches computers how to learn and perform classification tasks from images, text, and sound like voice control) and is found in every computer currently in operation. With over 14 million images and exceeding 20,000 categories, ImageNet is a key benchmark—a catch-all term for how data is measured against other datasets that are understood to be a reference point—for machine learning and image datasets, and yet, Hanna referred to their process of categorization as “the weird metaphysics of ImageNet” whereby “arcane, racist, sexist, homophopic, transphobic and bizarre categories are mapped to real images on the web.” While Hanna ran us through some examples of these categories, I learned how blatantly obvious these discriminatory categorizations are when you try to look for them. However, what was not so surprising to see was how they are easily reproduced. Since ImageNet is regarded as the key benchmark dataset, their algorithm is used as a pretraining layer for other larger datasets, perpetuating the categorising polemics across datasets at large (Hanna elaborated further on the polemics of datasets later on and found more evidence leading to questions of diversity and equity in benchmarking tactics). As such, the narrative of ImageNet exemplifies how datasets grow into infrastructures that monitor and dictate machine learning algorithms, and in turn, influence our own understanding of images.
I find big data rhetoric intimidating to navigate. Vernacular such as benchmarks, datasets, algorithm, machine learning, etc. are regurgitated and recycled in such a way that they seem commonplace yet remain ambiguous to those on the sidelines. The irony is that no one is removed from datasets, as most technologies around us, our speaker systems, computers, phones, tablets etc. possess some form of machine learning. Hanna not only provided definitions for data-talk but tracked for us how data became acclimatised so efficiently over the past decade (as further discussed in her recent work “Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development”). Four questions guide Hanna’s research program of the genealogy of data, all of which are interested in the sociological nature of datasets: how do datasets developers describe and motivate the decision to go into their creation?; what are the histories and contingent conditions of creation of benchmark datasets?; how have benchmark datasets become hegemonic or paradigmatic?; and what are the current work practices, norms, and routines that structure data collection, curation and annotation? Hanna explained the implications at stake when datasets are naturalised (as they already are) from both a quantitative and qualitative perspective, an approach I saw as extremely helpful to understand the extent to which data and human experiences are intrinsically intertwined.
By sampling different datasets used in computer vision (CV)— a field of AI that incorporates deep learning models and image datasets to identify and classify objects— she found that 50% or less of datasets incorporate descriptions in the data collected. This leads to issues of biased categorization, as found in the earlier studies of data discrimination Hanna referenced. These datasets also have an accessibility issue, as they are neither built for longevity or reproducibility. CV datasets are difficult to find and, in turn, difficult to download and rarely found in an institutional repository. There are also concerns regarding ethics and privacy for the human subjects of the image sets, Hanna cautioned. The root of these quantitative issues can be traced back to the qualitative values of data collecting, or lack thereof. Hanna structures this as a set of tradeoffs: efficiency is sacrificed for care; universality for contextuality; impartiality for positionality; and model work for data work. I found these tradeoffs to be a generative way to pinpoint the nuance of discrimination in datasets while also broadening our notion of what is and has been at stake to ensure the steady and inconspicious existence of datasets for decades.
Hanna then turned to her project “On the Genealogy of Machine Learning Datasets: A Critical History of ImageNet,” a research endeavour where her work on machine learning datasets as a form of infrastructure and the mapping of the genealogy of datasets collide. She dove into how ImageNet set the tone for the ways machine learning research is approached as a whole even though its inception was as a training tool for computer vision practitioners. Although the discourse surrounding ImageNet is inherently multidisciplinary (media studies, cultural studies, data science, art history, and linguistics come to mind), Hanna explained that the interdisciplinary nature of the dataset is rarely taken into consideration. Of the three striking findings that are key to this discovery, I was drawn to the second (under the subheading Computational Construction of Meaning and Understanding) to better understand the implications that come from ignoring an integrative approach. Hanna outlined how the underlying construction of computer vision stems from an epistemology that assumes the existence of a universal organization of the visual world. ImageNet does not work to parse out the complicated relationship between image and meaning and as such fails to recognize that this view perpetuates a discourse that reflects a perspective that is white, male, Western and, in turn, becomes naturalized and invisible.
When asked to elaborate further on remedies for benchmarking, Hanna advocated for a collective approach. “Find collectivities,” she urged. She encouraged us to read research on alternatives practices to benchmarking with fellow peers (a great starting point is Hanna’s own collaborative work). I found this final piece of guidance encapsulated the overall sentiment of Dr. Hanna’s presentation, to reconsider the universal narrative of datasets and benchmarking through collective learning.