The “Data” Narrative eats itself

blog@tante.cc

“The first model fakes the data, then the second model trains on the fake data. Any problems in the synthetic data set are amplified further. Then the second model — based on fake data — is used to treat real patients. This is, of course, all fine.”

Today I want to take David Gerard’s recent post on synthetic data in medical research to talk about synthetic data (which is a fancy term for just making shit up) in general.

As David points out: The idea of just generating data itself is – for most cases – ridiculous.

Often synthetic data is shall supposedly fill in the holes that too little actual data has created. But you are filling those holes with … a mirage. Which for medical cases is patently absurd but even for other cases it’s just … not exactly a rational thing to do? Since when does “we don’t have data so we just fabricated it” pass scientific rigor? It doesn’t? So why does it if I claim “the AI did it”?

I’ve seen this being suggested for helping with social studies (“You can just ask our models instead of actual people“) and other humanities. But the point of studies is not just generating some numbers to put in a paper or a study, the numbers are supposed to be an abstraction, a foundation of an understanding of the actual world and the people in it. By definition you cannot do that with synthetic data. Sure, maybe you can scam a few people in the ad industry with it: “Here we can tell you how people will like the campaign by asking AI” is a product you can sell that means nothing but you might find a few clients to keep the scam going.

But he whole – problematic – narrative of the power of data (“the truth is in the data” and all that data sciency stuff) rests on the data actually representing something real, coming from actual sensors (talking to people being a sensor in that regard as well) in the real world. Even if the data represents something from the real world it is of course still biased and subjective but can at least be meaningfully contextualized: You can look at the way the sensor/the question/ the methodology/etc. work, can analyze its problems and issues. We’ve been doing that for a while and know how to understand data that way.

When you rip away that foundation, those roots in “reality” you also rip away the narrative of data being the supreme, pseudo-objective way of understanding the world.

Which might be one actually useful thing coming from this whole AI bubble (which is just a continuation of the old Big Data bubble): The understanding that the total supremacy of the “data” discourse was always a problematic, neoliberal way of seeing and structuring the world, of legitimizing violence according to the needs of those in power. Data always served those who had the power to give it its structure, who could form the way the data flowed from the world and therefore could form the pathways that that data opened up or closed off.

Data was never your friend or ally.

Liked it? Take a second to support tante on Patreon!

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

djgummikuh@mastodon.social

@blog this is intergenerational AI-incest

guidostevens@kolektiva.social

@blog This whole idea of "reality doesn't really matter" is baffling to me. But we in the "reality based community" have been sidelined since at least Bush/Cheney if not far longer.