FARVEL BIG TECH

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

We'll see how I feel in the morning, but for now i seem to have convinced myself to actually read that fuckin anthropic paper

Ikke-kategoriseret

92 Indlæg 29 Posters 13 Visninger

J jenniferplusplus@hachyderm.io

Found it! n=52. wtf. I reiterate: 20 billion dollars, just for this current funding round, and they only managed to do this study with 52 people.
But anyway, let's return to the methods themselves. They start with the design of the evaluation component, so I will too. It's organized around 4 evaluative practices they say are common in CS education. That seems fine, but their explanation for why these things are relevant is weird.
1. Debugging. According to them "this skill is curcial for detecting when AI-generated code is incorrect and understanding why it fails.
Maybe their definition is more expansive than it seems here? But it's been my experience, professionally, that this is just not the case. The only even sort-of reliable mechanism for detecting and understanding the shit behavior of slop code is extensive validation suites.
S This user is from outside of this forum
S This user is from outside of this forum
sci_photos@troet.cafe

wrote sidst redigeret af

#75

@jenniferplusplus
1 Reply Last reply

0
J jenniferplusplus@hachyderm.io

So anyway, all of this is, apparently, in service to the "original motivation of developing and retaining the skills required for supervising automation."
Which would be cool, I'd like to read that study, because it isn't this one. This study is about whether the tools used to rapidly spit out meaningless code will impact one's ability to answer questions about the code that was spat. And even then, I'm not sure the design of the study can answer that question.
J This user is from outside of this forum
J This user is from outside of this forum
jenniferplusplus@hachyderm.io

wrote sidst redigeret af

#76

I guess this brings me to the study design. I'm struggling a little to figure out how to talk about this. The short version is that I don't think they're testing any of the effects they think they're testing.
So, they start with a warmup coding round, which seems to be mostly to let people become familiar with the tool. That's important, because the tool is commercial software for conducting coding interviews in a browser. They don't say which one, that I've seen.
Then they have two separate toy projects that the subjects should complete. 1 is a non-blocking ticker, using a specific async library. 2 is some async I/O record retrieval with basic error handling, using the same async library.
And then they take a quiz about that async library.
But there's some very important details. The coding portion and quiz are both timed. The subjects were instructed to complete them as fast as possible. And the testing platform did not seem to have code completion or, presumably, any other modern development affordance.
J 1 Reply Last reply

0
J jenniferplusplus@hachyderm.io

2. Code Reading. "This skill enables humans to understand and verify AI-written code before deployment."
Again, not in my professional experience. It's just too voluminous and bland. And no one has time for that shit, even if they can make themselves do it. Plus, I haven't found anyone who can properly review slop code, because we can't operate without the assumptions of comprehension, intention, and good faith that simply do not hold in that case.
S This user is from outside of this forum
S This user is from outside of this forum
sci_photos@troet.cafe

wrote sidst redigeret af

#77

@jenniferplusplus I agree; LLM-generated code (above a certain threshold of complexity) is like compiled C code with -O2 turned on. Hard to read, very hard to understand.
Code can get “compressed” quite a lot.
1 Reply Last reply

0
J jenniferplusplus@hachyderm.io

Chapter 4. Methods.
Let's go
First, the task. It's uh. It's basically a shitty whiteboard coding interview. The assignment is to build a couple of demo projects for an async python library. One is a non-blocking ticker. The other is some I/O ("record retrieval", not clear if this is the local filesystem or what, but probably the local fs) with handling for missing files.
Both are implemented in a literal white board coding interview tool. The test group gets an AI chatbot button, and encouragement to use it. The control group doesn't.
/sigh
I just. Come on. If you were serious about this, it would be pocket change to do an actual study
S This user is from outside of this forum
S This user is from outside of this forum
sci_photos@troet.cafe

wrote sidst redigeret af

#78

@jenniferplusplus Oh.
I was more thinking of a two week hackathon setting with multiple teams, lots of , and an evaluation of all different phases like
* planning (choosing right library, based on LLM-“discussions”),
* tests + implementations,
* searching bugs,
* adapting to spontaneous “changes” by the customer,
* readability / maintainability by other teams.
But … this …
1 Reply Last reply

0
J jenniferplusplus@hachyderm.io

I guess this brings me to the study design. I'm struggling a little to figure out how to talk about this. The short version is that I don't think they're testing any of the effects they think they're testing.
So, they start with a warmup coding round, which seems to be mostly to let people become familiar with the tool. That's important, because the tool is commercial software for conducting coding interviews in a browser. They don't say which one, that I've seen.
Then they have two separate toy projects that the subjects should complete. 1 is a non-blocking ticker, using a specific async library. 2 is some async I/O record retrieval with basic error handling, using the same async library.
And then they take a quiz about that async library.
But there's some very important details. The coding portion and quiz are both timed. The subjects were instructed to complete them as fast as possible. And the testing platform did not seem to have code completion or, presumably, any other modern development affordance.
J This user is from outside of this forum
J This user is from outside of this forum
jenniferplusplus@hachyderm.io

wrote sidst redigeret af

#79

Given all of that, I don't actually think they measured the impact of the code extruding chatbots at all. On anything. What they measured was stress. This is a stress test.
And, to return to their notion of what "code writing" consists of: the control subjects didn't have code completion, and the test subjects did. I know this, because they said so. It came up in their pilot studies. The control group kept running out of time because they struggled with syntax for try/catch, and for string formatting. They only stopped running out of time after the researchers added specific reminders for those 2 things to the project's instructions.
J S 2 Replies Last reply

0
J jenniferplusplus@hachyderm.io

I just
I'm not actually in the habit of reading academic research papers like this. Is it normal to begin these things by confidently asserting your priors as fact, unsupported by anything in the study?
I suppose I should do the same, because there's no way it's not going to inform my read on this
L This user is from outside of this forum
L This user is from outside of this forum
lispi314@udongein.xyz

wrote sidst redigeret af

#80

@jenniferplusplus@hachyderm.io > Is it normal to begin these things by confidently asserting your priors as fact, unsupported by anything in the study?

Not to my knowledge, no.

Summary of the document and hypothesis goes there.

Confident assertion is a maybe in the conclusion (some fields do lend themselves to unambiguous provable assertions) and generally it’s more of a recap of prior analysis.
1 Reply Last reply

0
D dalias@hachyderm.io

@jenniferplusplus The purpose of a paper is the assumptions it makes.
L This user is from outside of this forum
L This user is from outside of this forum
lispi314@udongein.xyz

wrote sidst redigeret af

#81

@dalias @jenniferplusplus Only if it's a bad paper.

Especially if it then goes on to debunk those very same assumptions while refusing to remark on it.

This is distinct from presenting a premise as a hypothetical to verify.
1 Reply Last reply

0
J jenniferplusplus@hachyderm.io

Given all of that, I don't actually think they measured the impact of the code extruding chatbots at all. On anything. What they measured was stress. This is a stress test.
And, to return to their notion of what "code writing" consists of: the control subjects didn't have code completion, and the test subjects did. I know this, because they said so. It came up in their pilot studies. The control group kept running out of time because they struggled with syntax for try/catch, and for string formatting. They only stopped running out of time after the researchers added specific reminders for those 2 things to the project's instructions.
J This user is from outside of this forum
J This user is from outside of this forum
jenniferplusplus@hachyderm.io

wrote sidst redigeret af

#82

So. The test conditions were weirdly high stress, for no particular reason the study makes clear. Or even acknowledges. The stress was *higher* on the control group. And the control group had to use inferior tooling.
I don't see how this data can be used to support any quantitative conclusion at all.
Qualitatively, I suspect there is some value in the clusters of AI usage patterns they observed. But that's not what anyone is talking about when they talk about this study.
J 1 Reply Last reply

0
I inthehands@hachyderm.io

@jenniferplusplus
…and good struggles, which are what good instructors help create
S This user is from outside of this forum
S This user is from outside of this forum
sci_photos@troet.cafe

wrote sidst redigeret af

#83

Yes, that's one important aspect during teaching/learning. @inthehands @jenniferplusplus
1 Reply Last reply

0
J jenniferplusplus@hachyderm.io

Given all of that, I don't actually think they measured the impact of the code extruding chatbots at all. On anything. What they measured was stress. This is a stress test.
And, to return to their notion of what "code writing" consists of: the control subjects didn't have code completion, and the test subjects did. I know this, because they said so. It came up in their pilot studies. The control group kept running out of time because they struggled with syntax for try/catch, and for string formatting. They only stopped running out of time after the researchers added specific reminders for those 2 things to the project's instructions.
S This user is from outside of this forum
S This user is from outside of this forum
sci_photos@troet.cafe

wrote sidst redigeret af

#84

@jenniferplusplus
1 Reply Last reply

0
J jenniferplusplus@hachyderm.io

So. The test conditions were weirdly high stress, for no particular reason the study makes clear. Or even acknowledges. The stress was *higher* on the control group. And the control group had to use inferior tooling.
I don't see how this data can be used to support any quantitative conclusion at all.
Qualitatively, I suspect there is some value in the clusters of AI usage patterns they observed. But that's not what anyone is talking about when they talk about this study.
J This user is from outside of this forum
J This user is from outside of this forum
jenniferplusplus@hachyderm.io

wrote sidst redigeret af

#85

And then there's one more detail. I'm not sure how I should be thinking about this, but it feels very relevant. All of the study subjects were recruited through a crowd working platform. That adds a whole extra concern about the subject's standing on the platform. It means that in some sense undertaking this study was their job, and the instruction given in the project brief was not just instruction to a participant in a study, but requirements given to a worker.
I know this kind of thing is not unusual in studies like this. But it feels like a complicating factor that I can't see the edges of.
T J 2 Replies Last reply

0
R realn2s@infosec.exchange

@jenniferplusplus
In a bit confused
Aren't lower grades worse?
And it even took longer because of "AI distractions"?
J This user is from outside of this forum
J This user is from outside of this forum
jenniferplusplus@hachyderm.io

wrote sidst redigeret af

#86

@realn2s Lower grades are, indeed, worse.
The AI did seem to speed things up, but not enough to achieve statistical significance. And as I describe further down the thread (just now, not suggesting you didn't read far enough), the AI chatbot seems to have been the only supportive tooling that was available. So it's not so much the difference between AI or not, as the difference between support tools or not.
R 1 Reply Last reply

0
J jsbarretto@social.coop

@jenniferplusplus Kind of a funny statement given that the whole point of abstraction, encapsulation, high level languages, etc. is to provide a formal basis for much of a program to be designed in terms of high level concepts
J This user is from outside of this forum
J This user is from outside of this forum
jenniferplusplus@hachyderm.io

wrote sidst redigeret af

#87

@jsbarretto That's not what people mean when they say system design.
They mean which way do dependencies flow. What is the scope of responsibility for this thing. How will it communicate with other things. How does the collection of things remain in a consistent state.
For example.
1 Reply Last reply

0
J jenniferplusplus@hachyderm.io

And then there's one more detail. I'm not sure how I should be thinking about this, but it feels very relevant. All of the study subjects were recruited through a crowd working platform. That adds a whole extra concern about the subject's standing on the platform. It means that in some sense undertaking this study was their job, and the instruction given in the project brief was not just instruction to a participant in a study, but requirements given to a worker.
I know this kind of thing is not unusual in studies like this. But it feels like a complicating factor that I can't see the edges of.
T This user is from outside of this forum
T This user is from outside of this forum
tartley@fosstodon.org

wrote sidst redigeret af

#88

@jenniferplusplus Holy carp this is a fabulous (slash shocking) thread. Thanks for taking the time.
1 Reply Last reply

0
H hrefna@hachyderm.io

@jenniferplusplus oh gods I need to read this.
J This user is from outside of this forum
J This user is from outside of this forum
jenniferplusplus@hachyderm.io

wrote sidst redigeret af

#89

@hrefna Im finding it frustrating, mainly
1 Reply Last reply

0
J jenniferplusplus@hachyderm.io

And then there's one more detail. I'm not sure how I should be thinking about this, but it feels very relevant. All of the study subjects were recruited through a crowd working platform. That adds a whole extra concern about the subject's standing on the platform. It means that in some sense undertaking this study was their job, and the instruction given in the project brief was not just instruction to a participant in a study, but requirements given to a worker.
I know this kind of thing is not unusual in studies like this. But it feels like a complicating factor that I can't see the edges of.
J This user is from outside of this forum
J This user is from outside of this forum
jenniferplusplus@hachyderm.io

wrote sidst redigeret af

#90

But now it's 1am. I may pick this up tomorrow, I'm not sure. If I do, the next chapter is their analysis. Seems like there would be things in there that merit comment
J 1 Reply Last reply

0
J jenniferplusplus@hachyderm.io

But now it's 1am. I may pick this up tomorrow, I'm not sure. If I do, the next chapter is their analysis. Seems like there would be things in there that merit comment
J This user is from outside of this forum
J This user is from outside of this forum
jenniferplusplus@hachyderm.io

wrote sidst redigeret af

#91

Actually, hang on. One more thing occurred to me. Does this exacerbate the difficulty of replication, given that the simple passage of time will render this library no longer new?
And now I'm done for the night, for real
https://hachyderm.io/@jenniferplusplus/115991499531084541
1 Reply Last reply

0
J jenniferplusplus@hachyderm.io

@realn2s Lower grades are, indeed, worse.
The AI did seem to speed things up, but not enough to achieve statistical significance. And as I describe further down the thread (just now, not suggesting you didn't read far enough), the AI chatbot seems to have been the only supportive tooling that was available. So it's not so much the difference between AI or not, as the difference between support tools or not.
R This user is from outside of this forum
R This user is from outside of this forum
realn2s@infosec.exchange

wrote sidst redigeret af

#92

@jenniferplusplus

I indeed asked the question before i had finished the thread
I was very confused and in some ways still are.
How can the authors of the paper think all this is an argument for AI (which I believe they do)?
1 Reply Last reply

0
J jwcph@helvede.net shared this topic

Login for at svare

1
2
3
4
5