5 Comments
User's avatar
Dominic Bates's avatar

Some interesting topics! I think most online models (e.g. ChatGPT) are fairly detectable because we specifically train them not to respond like normal humans, e.g. in the finetuning / RLHF steps, and we also almost never sample from the probability distributions completely randomly (I.e. temperature = 1, top_p = 1)

I did quite a bit of research in this area a couple of years ago (admittedly a bit out of date now) but I think if you took a large base model without fine-tuning steps, and sampled from the softmax completely randomly, a lot of these measured stylistic differences might dissappear (especially around lower variation and word choice). Although I guess most people are probably just using default ChatGPT model and settings so perhaps we don't need to worry too much!

Simon Raper's avatar

Thanks @dominicbates. That's very interesting. Did your research compare models before and after fine-tuning? And I wonder how much of a trade off between stylistic variation and making sense there is (where stylistic richness comes at the cost of intelligibility) and to what degree that trade off is addressed in the finetuning / RLHF steps. I've got another question that is bugging me. If I put the first part of the Campaign article I'm discussing in the dunghill into GPTZero the model says "We are highly confident this text was AI generated". If I put Chris Duncan's article (which I rate) into GPTZero it says "We are highly confident this text was human generated". I think the Campaign article is bad but it's not (I hope) written by genAI. So how do we interpret that result? Does it imply that the Campaign article is very human in the sense of very average/bland and of a style highly represented in a training data set that contains millions of similar articles and therefore quite typical of what a model trained on that data would spit out. Or, to take your point about the fine tuning steps producing less human responses, is the article very un-human?

Dominic Bates's avatar

It was a literature review primarily, so I only did small amounts of playing around with building detectors myself, but did read almost all the literature that was around then. Most studies were using ChatGPT output or similar so will have included the fine-tuning RLHF bits, but most didn't go in to much detail in to their dataset creation and didn't compare models so no particularly robust results.

Yeah I also suspect we prefer some kind of simplicity, which is reinforced by the RLHF and fine-tuning steps. But no evidence to support this.

Yeah good question. I would suspect it's just a slightly dodgy black box detector that was trained on a limited dataset? So works fairly well generally, but occasionally the probabilities are wildly wrong for models / topics / writing styles that are not part of the training data (at least that was the case 2 years ago). Must be almost impossible to get a complete like-for-like human / AI dataset with all possible styles, models, sampling parameters etc.

Simon Raper's avatar

That's very useful (as I may be doing something more on this next months). Thanks!

Dominic Bates's avatar

Ah fun! Let me know how it goes!