In this presentation, we will look into historical dirty data examples and several open (or reproduced thus open) extraction-based, human-not-in-the-loop datasets and gathers their heuristic-based methods for filtering out dirty data. A heuristic is a practical method that, not guaranteed to be optimal, perfect, or rational, but is “good enough” solving the problem at hand. This publication questions whether a narrative of “cleaning” can emerge from technical papers, reflecting on these silent, anonymous yet upheld estimations and not-guaranteed rationalities in current sociotechnical artifacts, as well as for whom these estimations are good-enough, as it will soon be part our technological infrastructures.
filter heuristics – a collection of dirty, naughty, obscene and otherwise bad holes in datasets
Dirty data refers to data that is somewhat faulty and requires to be removed in data preprocessing. In 1980s, non-white women’s body size data was categorized as dirty data, now, in the age of GPT, when “scale-up” first is the golden rule, when researchers/engineers scraping the internet to get as much free data as possible, what is considered as dirty data and how are they removed from massive training materials?
Veranstaltungsort
Padelhalle Klybeck,
4057 Basel, Schweiz
Öff. Verkehrsmittel:
what3words: verfügbar.gelesen.abhalten