Glasseye
Issue 21: January 2026
In this month’s issue:
Semi-supervised asks that you show some compassion towards your predictors.
Owning up to uncertainty in the white stuff.
The dunghill looks at why old-school model scepticism is back in style.
The dunghill
There’s an old wisdom from the deeply unfashionable world of frequentist statistics which says that a model is just a view from a particular, and necessarily limited, perspective, that it is one view among many, and that it should be consulted, but never entirely trusted. This way of looking at things was a matter of course for old-school statisticians from Box to Cox;1 it is at the heart of Karl Popper’s account of scientific progress through conjecture and refutation; and it is built into the idea of a statistical significance test, where the null hypothesis is the incumbent and the alternative hypothesis no more than a hopeful challenger. At its most influential, this view was part of a post-war scientific culture that was understandably suspicious of excessive self-belief, and too much faith in world-shattering ideas. As a method it is based upon an unspoken assumption: the default position, the one challenged by the model and which we fall back on if unconvinced, is a set of assumptions that have worked well so far, and which are based in part on the kind of thing that can’t easily be captured in a model - gut-feels, condensed real-world experience, the wisdom of centuries. It’s all very small-c conservative, but not necessarily wrong.
And yet the strange thing is that this view of modelling, so downbeat, and so contrary to the current mood of tech optimism, is making a comeback. Even more surprising is who is pushing it forward. I previously quoted the co-founder of Anthropic, Jack Clark, interviewed on the Newsagents. I’m going to quote him again as I find it so striking:
The way that I use these systems [LLMs], and many do, is I read a research paper, I write out what I understand that paper to mean and when I upload the paper and my understanding of it to the system and say, do I have this right? And if I don’t have it right, explain to me. That’s useful learning because the system reads the paper, reads my explanation and tells me whether I got it right or wrong, just like a colleague. If we use these things in the right way, they can help us be a lot more capable and a lot smarter.
One way of reading this is as an ingenious pivot by an LLM provider who has realised that hallucinations are here to stay. If you set human judgement as the default and position the LLM as a brilliant but overly imaginative critic, whose insights must be filtered by the user, then you are off the hook for hallucinations. But another, more generous, interpretation is that this is a thoughtful correction to LLM overreach, one which places the model back in its rightful position - an advisor, no more and no less, and never to be entirely trusted.
Of course this is absolutely right. Notwithstanding their utter brilliance in many areas (Anthropic’s Claude Opus just found an embarrassing number of errors in some dense maths I was working on), LLMs, as much as twentieth-century regression models, are outrageous simplifications. That might seem like an absurd thing to say about something so intricate and opaque. Nevertheless they are simplifications on account of the fact that a) they reduce everything to the problem of predicting text and b) they leave out so much, e.g. the external world, its physics, its logical and mathematical laws, its complex systems, etc. What is more, they are worse than the old models in at least one respect - they are built on undoubtedly biased data, and we have no idea in which direction.
This is why, in my view, almost all applications in which LLMs are given autonomy are doomed. An LLM has a perspective. It is often a stunningly insightful perspective, which makes it an invaluable contributor in a human-computer partnership. But by itself it is a limited, flawed, irresponsible, dangerously one-sided perspective, missing huge chunks of reality. It is the perspective of someone who has spent too much time on the web, who is capable of extraordinary insights but needs to get out more. And indeed, if what it lacks, as many are now saying, is a world model, then getting out might well be the answer.
If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.
The white stuff
Model Uncertainty, Data Mining and Statistical Inference by Chris Chatfield was published in 1995. I wish I'd read it then - or at any time that decade. It would have saved me years of confusion. I was confused because the things that were puzzling me, and which seemed like pretty serious problems for statistical inference, were not anywhere being discussed. Chatfield's paper is full of helpful phrases like “in practice no one believes this” and “unfortunately this is never done” and “this is very strange given that … “.
The bomb he drops, and the main theme of the paper, is that the entire statistical field is obsessed with the calculation of probabilities that are only valid if it is assumed that the statistician has selected the correct model. In practice that’s an enormous “if”. The reality is that the uncertainty surrounding model selection often dwarfs the uncertainty surrounding the values of model parameters.
Chatfield was not the first point this out - Box and Tukey both had a good go at it - but he pulls together his arguments at a crucial point in the history of statistical modelling. In 1995, Vapnik was publishing on support vector machines, and the machine learning revolution was imminent. Machine learning treated model selection as a problem to be solved rather than ignored, and in doing so drew attention to the vast number of models that will fit a data set almost equally well. Chatfield’s worry that the automation of model selection was leaving to overfitting, his call for sensitivity analysis to proof against false confidence in model selection, and his favouring of predictions over estimates of population parameters - the problem with the latter is that “the analyst will never know whether the inferences are good since the estimates cannot be compared directly with the truth” - were all anticipations of this revolution, far ahead of their time, especially in the world of statistical modelling.
What's most shocking though is that the paper is still relevant. The problem of measuring the effect of model uncertainty in statistical inference still lacks a satisfactory solution, and I would still recommend the paper to half the graduates I work with, many of whom are just as confused as I was about the role of model selection, and would be relieved to find out that “in practice no one believes this”.
(Note that the paper is behind the JSTOR paywall. However you can read it online for free simply by signing up for a JSTOR account.)
Semi-supervised
From my Glasseye inbox:
I was hoping you could help me sanity-check how I’m framing a modelling problem. I can’t tell if I’m overthinking this, or if there’s a genuinely tricky aspect that I’m getting stuck on. I’ve been tasked with predicting the probability that a user will make a purchase within 24 hours of a reference time t. My initial instinct was that this is a fairly straightforward classification problem, but I've been struggling with how to construct the dataset in a way that avoids bias and data leakage.
No definitely not overthinking it - I would say this is probably one of the most underthought issues in model design. In fact it’s astonishing how little is written about it, given how many have fallen into this particular pit. So congratulations to the asker, and double congratulations for asking when you thought the answer might be obvious. The truth is it is genuinely tricky.
In a nutshell, we need to avoid the classic trap of training a predictive model on training data that does not match the input that it will face at the time of deployment. For argument’s sake let’s say that the reference point t is the moment that a user registers for a service, and that the data scientist has access to a range of demographic and behavioural features with which to make the prediction. Let’s also assume that they have constructed a target variable that is equal to one if the user made a purchase within 24 hours of registration, and zero otherwise. Now the stupidest thing to do is to train the model to predict this variable using the demographic and behavioural variables as they stand. I say stupid, but this is done, and it is done a lot! Why is this such a bad idea? Because by the time you build the model, behavioural data will have accrued that records events subsequent to or concurrent with the purchase, and which are then of course highly correlated with the purchase. It is not uncommon to find models where the user’s total spend has been used to predict the probability of first purchase.
So you are not going to do that. But what are you going to do? You have two options. I’ll deal with the hardest one first as it is more interesting. This requires you to recreate the data for each customer as it would have been at t - the only training data that makes sense since it is the only data that the classifier will see in the wild. But recreating this data is no easy matter. For a start, there is no single t. It is different for every user. If you are lucky enough to be using a system with very thorough database auditing then you may be able to recreate this data from snapshots. Otherwise you’ll need to rewind it yourself. For this your features usually fall into five categories:
Those which you can reasonably assume won’t have changed since t. Many demographic variables fall into this category, as do variables that record events prior to t.
Those which you can adjust using some time-based calculation. For example, a customer’s age at t can obviously be recreated. So can features like day of the week, hour of the day, etc.
Those which you can somehow infer based on other values. For example, a customer might have been flagged for a particular offer based on some item of data that has since been overwritten.
Those which you should exclude because they are irrelevant at t. User spend in our example, since it will always be zero.
Those which cannot possibly be recreated.
The features in the final category will, I’m afraid, have to be junked. This is fine as you are building a predictive model rather than a model for understanding relative contributions to an event.
The other option I mentioned is simple, but rarely available as it requires a friendly, patient client. This is to simply start taking snapshots of the data at all future ts until you have enough data for a model. The nice thing about this approach is that you can think in advance about what might be predictive and therefore worth capturing.
Another difficulty, for both approaches, is in deciding which data is going to be representative of t. Are you going to use all your customer registrations to train the model, or some subset - those within a certain recency window for example - or a sample, or some kind of weighted sample? Again it helps to think of the deployment of your model. I try to visualise it sitting patiently on the data pipeline, and then springing to life with each registration. What does the data look like from its perspective? How can you make the data that you train it on as much as possible like the data it will face? This mental exercise should help answer the question about representative data. For example, if your business is in a rapidly evolving sector where behaviours often change their significance, or if the market has recently undergone a major shock that has radically altered behaviours (e.g. a pandemic) then you may need to shorten the time window on your training data. Another example, if you anticipate a change in the demographic profile of new customers (perhaps new legislation has been passed, perhaps product adoption is increasing among older people) then you should upweight the appropriate groups in your sample.
And this is just one scenario. Every time a model is dropped into the real world it faces a stream of data that is unique to its particular situation. It requires imagination to think through what it will be up against. Being a good data scientist requires empathy!
Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia
From Coppelia
It’s interesting to think about the future of plugins in the age of coding agents. Instead of spending hours searching for exactly what I need, I find myself just asking Claude Code to build it for me. And then I’m too lazy to share the result. LLMs seem to be particularly good at writing code for backend tasks. Will we all end up with completely customised working environments?
On the topic of Claude Code, thank you to my Melt colleagues for pushing me in this direction. I’m moving away from Cursor, which has been driving me nuts with its overly bullish agents. I need to feel more in control, and there’s nothing better for this than the command line - it’s in the name. I think of Claude Code like a thoughtful, boundary-respecting butler: “I’ve prepared this data set for you sir! Would you like a moment to digest it?”
Prepare yourselves for the imminent AI course correction. Put LLMs to one side. Coppelia is now doing training courses in symbolic and neurosymbolic AI (relevant Python packages proglog, SymPy, pyKEEN). Let me know if you are interested.
Lastly, as far as AI correctives go, I have to point you to the second season of Shell Game, in which the journalist Evan Ratliff (pitch-perfect deadpan) takes the tech bros at their word and sets up a company staffed entirely by LLMs. Thanks to Niazy for the recommendation. Rise and grind Ryan, rise and grind!
If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.
Most statisticians know George Box’s famous “All models are wrong but some are useful“. David Cox agreed: “The very word model implies simplification and idealization. The idea that complex physical, biological or sociological systems can be exactly described by a few formulae is patently absurd.” (See the commentary in Chatfield, Chris. “Model Uncertainty, Data Mining and Statistical Inference.” Journal of the Royal Statistical Society. Series A (Statistics in Society), vol. 158, no. 3, 1995, pp. 419–66. JSTOR)




Brilliant. How do you see this 'old widsom' neccessarily influencing current AI model explainability, and what an insightful read this is!