This entire report from the Ontario government on genAI systems is worth a read, but the review of healthcare scribe accuracy is pretty devastating, imo.

reedmideke@mastodon.social

@mttaggart The minister responsible for that AI Scribe project's explanation of why it was all OK was incredibly bad https://mastodon.social/@reedmideke/116570464172955876

wcbdata@vis.social

@mttaggart It's already too late. This garbage is not only already in our medical records, but it's also being used to train the next generation of models, which will coupound (and hide) the issue!

z3r0fox@mastodon.social

@mttaggart Medicine should probably stick to machine learning pattern recognition in diagnostics, that seems useful? from what I've read?

shaulaevans@zirk.us

@z3r0fox @mttaggart If you dig into it deeper, it is also problematic.

keithzg@fediverse.keithzg.ca

@mttaggart@infosec.exchange I was cynically thinking to myself “and what are the chances that an industry-loving institution like the Ontario government had any conclusion other than ‘well we’ll just choose to use Good AI and that will be fine’, probably 100%” and jumping to the report’s conclusions,

establish KPI targets to measure and track Microsoft Copilot Chat’s adoption

take actions to increase use of Microsoft Copilot Chat to the targeted rates and usage in the OPS

educate OPS staff through AI training about the dangers of using non-Microsoft browsers when accessing AI websites

So, yeah, they did an audit showing LLMs are wildly unreliable and . . . concluded they should encourage use of Microsoft LLM products.

Their audit criteria also included “having due regard for economy”.

kkarhan@c.im

@mttaggart Seriously, any medical professional would get banned for malpractice if they had such a huge error rate.

avuko@infosec.exchange

@mttaggart we all want the autodocs we were promised in sci-fi, but genAI is not that.

PS: medicine is already a field suffering heavily from biases. Adding automated bias at scale is gonna literally kill millions more of us.

tehstu@hachyderm.io

@mttaggart and the evaluation noted in figure 7 there was despite the audio made available to the vendors, who then provided the results, if I'm reading the report correctly. Not even a live demonstration.

flyingpenguin@infosec.exchange

@mttaggart their audit reads like a Wirken requirements document written by someone who did not know Wirken existed. I do five or six calls on this a week now, which is why I open-sourced and started giving Wirken away for free. I've updated the marketing copy here, but I'll soon release a line-by-line response to the Ontario audit: https://gebruder.ottenheimer.app/wirken

mttaggart@infosec.exchange

I get to see this in action. Doctors want transcription and summarization services because of the challenges they face getting quickly familiar with a patient in a crazy short amount of time. They also want to automate notetaking for rounds, which can be chaotic. Problem is, these tools suck in chaotic situations, and even in relatively normal ones, hallucination abounds.

There will always be a claim of human review, but I know all too well that it's working against the current to have a human reviewer not assume the model got it right. What's more, those safeguards will eventually be seen as cost centers and redundancies—well, at least until the lawsuits.

One other thing. As noted above, these model-generated fields in charts are a) being used as training material for other models, and b) being used as input for other generative tools without human review. The potential for compound errors and model collapse is immense.

delta_vee@cosocial.ca

@mttaggart Family full of medical people, and can confirm they're all desperate for reliable transcription. That's what they've been sold, and it's not like they typically have time to spare to go over their own recordings to check WER or validate the summaries.

Some recognize the problem, though:

https://cosocial.ca/@delta_vee/116581810079302048