Glasseye

Wed, 29 Jul 2026 07:46:13 GMT

In this month’s issue:

The white stuff talks to UCL’s Sam Livingstone about using LLMs for mathematics and about the future of computational statistics.
The confusion behind data fusion in this month’s dunghill.
Semi-supervised takes you on a tour of my ultra-minimalist data science environment.

The white stuff

From papers to professors - I’ve decided to extend the brief of our monthly check in on academia to include interviews with academics whose work is relevant to the daily grind of data science. This month I’m talking to Sam Livingstone, Associate Professor at UCL’s Department of Statistical Science about the integration of machine learning and statistics and about how to use LLMs without losing control of your thinking.

Simon: So Sam, thank you for talking me. I read on your blog that you see yourself as working at the intersection of probability, computational statistics, and machine learning. These are areas which have been separated out in academia for various historical reasons, but we, as data scientists, are always working within this intersection. For that reason, your field of study is fascinating to us. It would be great to understand how you got into it.

Sam: Well, my background was more mathematics. I followed an honours degree in mathematics with a master’s in the same subject, so I very much thought of myself as a mathematician. Then I went into industry for a couple of years, as a data scientist in an operations research team. I was doing demand and revenue forecasting for transport schemes, and I wanted to do things right; I wanted to understand what I was doing, but I found the needs of the industry insisted that I didn’t go too deep, that I didn’t dwell too much on the more abstract questions (and it made a lot of sense not to do that because we were on a budget). So I thought that this wasn’t the best place to do this kind of exploration. I went back and did another master’s in statistics, where it does make sense to think more deeply and try to get to the right answer, even if it takes quite a lot longer.

Since then I’ve lived in a world where you’ve got statisticians - and increasingly now people from machine learning and all sorts of other applied fields, epidemiology, econometrics - designing models and trying to calibrate them to data, to learn something. And they’re using increasingly varied and complex algorithms to fit those models. I study those algorithms. How they work. How they perform. Do some algorithms perform better than others? Can we design new algorithms that will work better?

Simon: Leo Breiman wrote a famous paper in the early 2000s on the two cultures of data modelling and machine learning, how they were divided, and how they could learn from each other. Do you think that is still the case?

Sam: So there was a time, a few decades ago, when there was this nice blending between the cultures, because people had all the right sets of skills to help each other out. And it was exciting, and they were converging on the idea that probabilistic modelling was the right direction. That was when I started doing my research and it really felt like there was going to be a complete synergy.

But then deep learning came along and it was very much not probabilistic at the outset, very computer science-y in the way it was explained. Geoff Hinton was a probabilistic thinker, but then the engineers took hold of it and focused mainly on the architecture and similar things. Having got great empirical results, they put the probabilistic questions aside for a while. This fragmented the fields yet again.

And I would say there’s once again more of a blending, in which statisticians study deep learning algorithms and try to appreciate the statistical properties that they do have. But it took a good five to ten years to decipher the literature because it was written in such a different language: flow diagrams and computer architectures, very few equations. We just were speaking a different language. Statisticians are a bit purist when it comes to insisting that things are written up properly. And that’s on us. But similarly, the machine learning people from the more engineering side can be a bit relaxed and sloppy.

Simon: I came across your work in your blog post, Will AI ruin mathematics?, about the role of large language models and agents in mathematics. It caught my interest because you discuss many of the issues that data scientists are currently wrestling with - Is my role redundant? Should I use LLMs at all and if so how? Is there a limit on what is achievable using LLMs? What is hype and what is real? - and it is fascinating to see these same concerns occupying someone who is much more focused on the mathematics. Could you take us through your thoughts on each of these issues?

Sam: Speaking as a university academic, I don’t think my role is going to become redundant. There are one or two people in my department who panic now and again. But this is not the first time in higher education that people have felt, “Oh no, I’m not the keeper of all this knowledge any more. What am I going to do? Maybe people don’t want me?” But if that were the case then universities as a place to teach should have been redundant from the moment the printing press was invented.

LLMs are just the latest in a long line of innovations that threaten this “secret knowledge” model. But I don’t think that model was correct in the first place. A big part of why people commit to coming to a university is for someone to be there and take them through it, and hold them accountable, and inspire them and set the right assessments, to challenge them. You have to go through a lot of pain to learn things, to really get them into your head, to really grow and evolve. We ask people to do these things because we think it will make them have a happier and healthier life, be able to make more intelligent life decisions, think more clearly, in all aspects of their life. It doesn’t really matter what they’re learning to some extent.

Simon: This seems like a good point at which to bring up one of the biggest concerns about the use of LLMs in both mathematics and data science. Will using them degrade our problem-solving abilities in any way?

Sam: I’m not someone who believes you shouldn’t use LLMs at all. I think you should use them, environmental concerns aside (I think that’s a separate thing). But I think there’s a fine balance between using them to enhance your understanding and using them to overextend to the point that you actually limit your own understanding in the long term. And I think if you overuse them in that way, it’s a bit of a false economy. Eventually you don’t know what you’re doing anymore.

Simon: So where for you would that line be, when striking a balance?

Sam: Well that’s the magic question, right? But put it like this: everyone in academia agrees that it’s still important to teach students the full linear algebra course, where we make them do matrix operations by hand, let’s say twenty times, so that they really get the vibe of how long the different algorithms take, what they’re all doing, what the inner workings are. Everyone also then agrees that once you’ve got that basic understanding you can leave these calculations behind. If we were to scrap that part of the course and just say: inverting a matrix is a thing, and you do it by typing “solve”, then I think we’d lose a lot; people would stop developing new and better algorithms for doing similar tasks. And I think this same idea can be applied more generally.

Simon: So the basic principle is, at the very least, delegate tasks to LLMs that you understand and could have done yourself, but which are tedious, time-consuming and might have got you lost in the weeds.

Sam: Exactly. If you’ve got a good high-level picture and it’s just some tedious technical things that you need to do that are sort of orthogonal. Usually I’ve tailored these things to the point where I can ask a very concrete question. In fact, a big part of the role of a mathematician is not at the problem-solving end. It’s about problem-defining - starting at a high level and packaging things up into a series of concrete questions that we can show to be true or false.

Simon: Returning to Breiman’s two cultures, do you think perhaps this focus on higher-level problem solving, brought about by coding agents, might do something to bridge the gap still further?

Sam: That’s an interesting question. I think it’s been the generations that have changed thinking. In statistics, in the generation before me, you were either a frequentist or a Bayesian, and if you were in one camp you’d hate the other. My generation is much more pragmatic. We learned about both and saw that both had been used successfully in different domains. And the generation after me are more comfortable blending machine learning approaches into their statistics, while I still think of these things as a bit separate.

Whether AI tools can help that integration is another thing. It’s definitely true that I can do a lot more coding now than I could. Maybe in the same way, people on the more computational side might feel that the mathematical side is more accessible to them. So there may be a kind of blending of ideas from that. I haven’t seen that much evidence of it so far, but I think it’s early days for things like that.

Simon: It is true that there is a lot of uncertainty, and I think that is what makes people anxious. What do you think your field is going to look like in five to ten years’ time?

Sam: It’s really hard to say, and I think anyone who tries to make these kinds of predictions is going to look very silly. That includes the heads of all the AI firms who are confident in predicting that all white-collar jobs will be automated. I think that it is foolish to make a prediction like that.

Simon: Thank you for your time, Sam. It’s been an incredibly interesting conversation!

Sam: A pleasure, thank you.

Semi-supervised

Following March’s semi-supervised on my post-LLM tool choices, I’ve been asked to say a little more about my set-up for data science work. Honestly I could bore you into a stupor on this one…

Home is a Ghostty terminal emulator, usually divided into three panes: top left, bash prompt; bottom left, Claude Code; right, Neovim. I’m not a command-line zealot, so of course I visit the browser, and occasionally sin with Finder, but the terminal is where I want to be - a peaceful, distraction-free environment that is completely under my control.

There will be peace in the valley… my Ghostty terminal with panes for bash, Claude Code and Neovim.

I’ve been able to retreat more often to my terminal paradise because I have ditched GUI IDEs and Jupyter notebooks and replaced them with Neovim. This is a relatively extreme position for a data scientist, although common for developers. Why would I do such a thing? Well for one, I was sick of my IDE (PyCharm then VS Code then Cursor) changing configuration behind the scenes and being overly needy about acknowledging its latest feature. There is also the visual claustrophobia of too many graphical objects demanding attention. By contrast, Vim is a modal editor. It’s mouseless, which limits the possibilities for screen objects, so it’s visually sparse. There are no tabs (unless you absolutely want them) and you navigate with your brain and keystrokes.

Another reason that this environment has come into its own is that configuration, painful by hand, can now be effortlessly and neatly dealt with using Claude Code. The Neovim environment is customisable to a fault, but with Claude Code you can just ask for what you want in prose and you have it. Mine is set up for writing (extensions: Pencil, Limelight, Vale) when I open markdown and coding when I open everything else. In fact I have a Claude Code skill that looks after all my configurations across Ghostty, Neovim and zshrc - cleaning up, syncing, and logging my key mappings in a reference database. It can even sync a visual theme across Neovim, Ghostty and Obsidian.

My replacement for notebooks is a combination of Quarto and Claude Code. Quarto has a number of advantages over Jupyter notebooks. It is structurally simpler (just markdown and fenced code blocks rather than JSON), which means it integrates better with other tools, and it is geared towards publishing pretty things (pdfs, html, slides) which is usually my deliverable to the client. Quarto is editable within Neovim, and the Molten Neovim plugin makes the code blocks runnable, giving me all the functionality of a Jupyter notebook. Furthermore I can use Claude Code to direct the layout of my Quarto output without getting too caught up in the formatting details.

On Claude Code itself - yes, it lives in its bottom left pane, and yes, this means no LLM-based prediction in the coding editor. But I am happy with this. It clearly separates our roles, and makes me think about what I should do versus what it should do. I give it a job, usually on a different worktree, and then I review it. Or I ask it how to do something, and then I do it. Its separate pane is also convenient when I use it to handle administrative and versioning tasks.

Finally, I have a growing library of Claude Code skills, mostly around repository set-up for Python, Rust and TypeScript1 projects (during which they set up a CLAUDE.md file from a template, which in turn does a lot of standardisation work) and general administration, including a nice one for copy-editing (which makes criticisms of my prose using grammar references and various uptight conventions from The Economist style guide. But never about actual prose style - I don’t want to sound like Claude Code.)

And that’s it. For now.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia

The dunghill

I missed an obvious entry in April’s dictionary of bullshit for statistics, AI and data science. For as long as there have been frustrated clients and accommodating agencies, there has been something called “data fusion”. To be fair there are some things that might legitimately be given this name. There’s plain old data integration; there’s a problem in robotics about how to integrate information from different sensors; there’s the Google Cloud Data Fusion product. There’s also the problem of finding the best match for an individual within a dataset. And the fascinating problem of how you might construct a pseudo-control group for observational data. But none of these correspond to “data fusion” as I have heard it used ad nauseam. In fact, nothing can correspond to it - it’s meaningless.

Here’s what usually happens. An agency pitches some product or service to a client that involves using its business data. The client likes the idea but notes that it can only work if the agency can connect up two or more datasets that the client has been struggling to join together, since there is no unique identifier common to both. “Hmm,” says the agency, “We’ll see what we can do.” Fortunately someone on the agency team recalls a previous project in which that same problem was solved using something called “data fusion”. Of course! That sounds perfect! “Fusion” means joining things but it’s also quite clearly science.

With the help of data fusion, the agency wins the job. The client data is then passed to the internal data team, who are told to fuse it. Which means what? You can of course find some variables that are common to both data sets, say age band, gender and location, and then aggregate both datasets up to this level, and then join. But that gives you an aggregate-level dataset, not the “fused” individual level data set that was promised to the client. The fact is that you can’t magic your way down to a lower level of integration. No amount of pseudo-scientific probabilistic matching and modelling is going to make this happen.

Unfortunately the agency data science team are nearly all recent graduates, and, knowing no better, quite reasonably assume that a term that has been knocking about for decades means something - if only they can figure out what! In the end they come up with something awful, which they feel bad about and heavily caveat - but what else could they do?

At this point the data fusion is complete. The agency logs the success in its collective memory, to be drawn on in some future pitch, and the cycle continues.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

From Coppelia

This month for the first time I felt the need to point out to a client that the report I was handing over was not written by AI. I hope it was obvious. I hope I don’t sound like an LLM, even in the driest of reports. But who knows? This is new territory and new norms will need to be established.

We are very happy that Mark Bulling, the earliest of early adopters, frequently name-checked in Glasseye, has joined the Melt Collective. I can no longer pass his discoveries off as my own.

A reminder that if you are interested in the areas covered by Glasseye and you would like some face-to-face support, then Coppelia offers tailored mentoring packages as well as training workshops on a wide range of topics. Just let me know at simon@coppelia.io.

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

My Rust is terrible. My Typescript worse. But these are usually for components that are tightly-defined and then vibe-coded, or, in the case of Rust, translated from Python for speed.

Glasseye

Mon, 29 Jun 2026 07:25:00 GMT

In this month’s issue:

The dunghill scoops the goop in “marketing science” by comparing it to the wellness industry.
The Royal Society’s Transactions puts some rigour into data science. The white stuff is impressed.
Semi-supervised offers some advice on how to write well as a data scientist.

The dunghill

I doubt I’m the first to point out the similarities between the consultancy world of “marketing science” and the multi-billion-dollar wellness industry, but still I do think they are worth listing out - for this reason: anyone with GCSE science and a modicum of worldliness knows what the wellness industry is up to, and no one is surprised that it continues to make a fortune mining the gullible. But when it comes to the marketing science industry, its very existence, along with the cash it generates, is taken as proof that something serious is going on. It’s a veritable gaslighting extravaganza and can cause you to doubt your own good sense. Hence the value of a point-by-point comparison with the wellness industry. So here are my observations.

First thing to note: A similar, and I suspect overlapping, clientele. Rude - patronising even - but true nonetheless. In both cases the buyer tends to have little patience for hard science. Within the wellness industry this manifests itself as a hostility towards conventional medicine; within marketing, as a barely concealed indifference. A box needs to be ticked on the way to doing what was going to be done anyway. That is all.

Which leads us to the second similarity: both industries thrive by telling their customers exactly what they want to hear, whether that’s that their advertising campaign has delivered a sizeable return on investment or, far more tragically, that a fruit juice diet is more effective than a course of chemotherapy. This means that in both cases, the honest broker (usually the bearer of bad news) is hugely disadvantaged. Even more so when you consider that it’s so much easier to tell a compelling story when unconstrained by the truth. Simple, intuitively graspable narratives cut through: specially formulated teas wash away bad toxins; TikTok influencers rejuvenate boomer brands with Gen Z freshness.

Ironically, given the underlying contempt for actual science, these narratives are all the more appealing when decorated with pseudo-scientific paraphernalia. Fake journals and dodgy data are standard fare in the wellness industry. Unreviewed “white papers” are used as bait on the websites of agencies and consultancies. And now the introduction of made-to-order synthetic data sets, rich enough to accommodate any story or outcome.

Essential to the con, in marketing science as much as wellness, is that the practitioners believe what they are saying. Usually they pride themselves on bringing a little magic to the process. It’s a bit of art, a bit of intuition, some mystical know-how that lifts up the science and somehow delivers the goods.

Given the rampant over-promising you would be forgiven for thinking that the truth will eventually out, but somehow both industries manage to get away with it. How? Partly due to a well-known feature of pseudoscience, namely, that it is carefully framed to be unfalsifiable. Partly due to confirmation bias and group delusion, but also due to a marketplace full of endless novelty and reinvention. If a particular cure or solution fails there’s always another to try.

Have we reached the bottom? Can our cynicism go any deeper? It can, with the observation that for both industries there will always be a market for hokum and it will always be vast. I am regularly asked by research agencies whether they should start offering research based on synthetic respondents. This is like a pharmacy asking whether they should stock homeopathic remedies. No, they absolutely should not. Will they be missing out on a significant, highly lucrative market? Yes, they absolutely will.

To rescue the mood, one final, slightly more upbeat parallel. It’s not as if the entire wellness industry is a complete sham. There are plenty of therapists, physical trainers, registered dietitians, etc., doing their utmost to steer clear of the bullshit and ply an honest trade. Similarly, although massively outnumbered by the charlatans, there are some good data scientists working in marketing, advertising, and market research. I know this because I know some of them.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

The white stuff

Now from the worst practice to the best, with last month’s special edition of the Philosophical Transactions of the Royal Society on the topic of statistical and scientific workflow. That might sound dry, but truly it was all my Christmases come at once. (Thanks once again to Adria Luz for bringing it to my attention.) “We believe”, say the authors in the opening article,

that there are shared aspects of quantitative research that are obscured by the varieties of models, methods and even philosophical frameworks that are successfully employed in statistics and machine learning, and also because it seems that many of the most important aspects of statistical practice, in whatever form, do not make their way into the textbooks.

I’ve been waiting my whole career to see that final clause in print. Yes, much of what they describe is not new - on the industry side we have been bootstrapping and cross-validating our way out of the hole between these two disciplines for well over a decade. But what is new, and what I get a massive kick out of, is seeing our worst bugbears - researcher degrees of freedom, the Rashomon effect, Chatfield’s model selection problem, and all the “questionable research practices” that dog our workday lives, being treated seriously in an academic journal as respected as the Royal Society’s Transactions.

In a nutshell, their proposed workflow uses the best bits of method to come out of machine learning and modern Bayesian methods to wrap the process of data exploration and model selection up into something much more scientifically rigorous. The hundreds of minor data manipulation and model selection decisions, which were never properly acknowledged by traditional statistics as sources of uncertainty and bias, are incorporated into their prediction perturbation intervals. And the absurdity of selecting one out of many equally well fitting models (the Rashomon effect) is addressed as part of predictive model averaging. (Again not itself new but benefiting hugely from being presented as part of a general framework.)

Some other highlights:

A nice passage on the sequential assembly of models, pointing out that “complicated models can often be best understood in relation to simpler special cases” and that a “benefit of building up models one step at a time is that sometimes we can reach the pleasant state of having a model that fits well and does the job”, and further that “at this point it can be helpful to try adding a bit more complexity, just to show that this additional step is not necessary. Again, even setting aside questions of model choice, this extra model can help our understanding.” This is the kind of basic know-how that rarely makes it into the journals.
A non-partisan, pragmatic, and grown-up approach to the Bayesian-frequentist debate that accepts that non-Bayesian statistical analysis is not disappearing any time soon. Instead the emphasis is on drawing out the equivalences and ensuring that much of the good stuff from Bayesian methods (the heavy reliance on simulation in particular) is mirrored in classical methods.
The idea of “thinking of any statistical procedures as not just producing a one‑time estimate but also as a mapping from data to inferences.” That strikes me as a good way to de-throne any statistical procedures that are getting too big for their boots.

The complete process is explored more fully in a further article, Predictability–Computability-Stability workflow for veridical data science in the age of artificial intelligence (and there are others in this special edition which I have yet to read).

Admittedly the full workflow is a lot of extra work, but the pay-off is surely in the feeling of having done proper science, knowing that what, if anything, you have uncovered is real knowledge, rather than the bilge described in this month’s dunghill.

Semi-supervised

Let’s put to one side the possibility that you might now be using an agent to write your emails, reports, presentations, etc. (a risky move, in my opinion, as we haven’t yet decided the etiquette on this) and consider the problem of how to write well as a data scientist. By this I mean everything from a one-line message on Slack to a formal write up of your work for publication. And since this is a skill that, far more than any mathematical ability, can hold you back or propel you forward in your career, it is worth more than a few moments of your time.

There are of course countless books on how to write good technical prose, the best of which I covered in the white stuff of Glasseye #17. So rather than regurgitate something that you can find covered comprehensively elsewhere, I’m going to give you some very specific tips, tailored to your role. In any case, this advice quite often runs counter to the more general advice. Unless you are pitching an idea, your goal is not to grab and hold someone’s attention; it is clear, information-rich communication.

Beware the pronoun: It’s so useful in everyday, context-rich conversation that it might seem like it can do no wrong, but remove the context or mishandle the scope and the reader is lost. “It’s fixed.” What’s fixed? “They fixed it.” Who fixed what, goddammit? This problem with pronouns extends to all ambiguous expressions that are narrowed down only in the head of the writer. “I found it in the code we looked at earlier.” But we looked at pages and pages of code earlier, and what counts as earlier? Earlier today? This week? This year?
Examples, examples, examples: They are the single most effective tool you have for clarifying your meaning. A thousand ambiguities are resolved in one stroke. So why are we bad at giving them? Several reasons. First, it’s hard work to turn your fluffy, abstract notions into something concrete. But hard for good reason: when formulating an example you are doing the real thinking, putting flesh onto the bone. Second, abstractions somehow feel more grown-up; examples, in contrast, a little childish - as though by giving an example we are admitting that we aren’t up to using the big words. Third, we underestimate the reader’s ability to generalise, worrying that giving an example might create the impression that this was all we meant. Give them some credit.
Where possible avoid abstractions, what Steven Pinker and others have called “zombie nouns”. Technology excels at producing abstract nouns, designed to be vague to the point of meaninglessness. Think of “platforms”, “frameworks”, “hubs”, “engagement”, “touchpoints”.
Don’t spare us the details: Include all the context you can think of and then some more. I would much rather you told me everything (without repeating yourself) and left it for me to skim the parts I know already. Writing is not talking: if you are boring me I can skip ahead. Which leads nicely to…
Verbosity is not a sin, unintelligibility is: The poor reader can waste more time attempting to decode a single badly phrased sentence than it would ever have taken to skim through a long but lucid paragraph. The notion that everything must be reduced to a headline or a bullet point comes from the newsroom and the advertising agency. But they are doing something entirely different with words.
If you can code well, you can write well: Think about it - so much is the same: you need to declare your terms upfront; break down complex ideas into easily understandable steps; organise a topic into self-contained but interlinked themes. And you wouldn’t dream of putting your first draft of code into production, so why press send without reviewing your prose?
Reread it from the reader’s perspective: Search for ambiguities - could they think I mean something else? - be paranoid; use the bastard check; anticipate objections, questions, the causing of offence.
Make it work for you: Rhetoric is not just for politicians. Language is a weapon. Use it to win arguments without even having them, to protect yourself, to get what you want.

Lastly, there’s no reason why you should tolerate bad writing from others, even your boss. If it’s meaningless blab, send it back, asking for clarification. Politely, of course.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia.io

From Coppelia

Coppelia is now offering workshops (half or full-day) on the proper (i.e. efficient, effective, and robust) use of coding agents in data science. It feels like we’ve got to that stage now. Let me know if you are interested.
I can report that Claude Code combined with Quarto and Python makes a formidable team. My workflow is a) do the analysis/modelling in Python, using the agent for the grunt work, b) write the report in markdown in the Quarto file, c) pull the whole thing together by using Claude Code to populate the code sections in the Quarto doc based on the previous Python work, instructing it to add various prettifying formatting tweaks.

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

Glasseye

Tue, 26 May 2026 07:08:55 GMT

In this month’s issue:

Writing is thinking, reading is thinking, coding is thinking. Ada knew it. You should too. The white stuff sets things straight.
Semi-supervised explains how to solve the scrappiest of problems by maximising your ignorance.
The dunghill sniffs out the bad practice behind “modelled” variables.

The dunghill

A couple of weeks ago the Guardian ran an article covering the tragic consequences of the Kenyan government's use of machine learning for means testing. The initiative was supposed to determine how much each Kenyan family should pay for healthcare. In the terminology of the World Bank, the model would enable proxy means testing; in other words, it would provide a prediction of household income based on other variables, most significantly household possessions. This prediction would then be used to work out how much the family is able to contribute to a government-subsidised scheme. Putting to one side the ethics of doing this at all (Is it ever acceptable to use a proxy when discrepancies between the proxy and the true value have such dire consequences?), there were, as investigators at Africa Uncensored revealed, serious flaws in the modelling methodology that went beyond just top-level predictive accuracy. As I have pointed out in previous posts, subjective decisions nearly always need to be made during the training and tuning of machine learning models, and these decisions reflect particular interests. In the case of the Kenyan government’s programme, the decision was to tune the algorithm to classify the rich more accurately than the poor, on the grounds that a rich person classified as poor is far less likely to report the error than a poor person classified as rich. While this was better for government finances, it was disastrous for the poorest families.

Now bear with me, as this might seem unrelated. After I posted last month’s Dictionary of bullshit for statistics, AI and data science, I asked subscribers what I had left out. Scott Thompson then kindly offered “modelled as an adjective for papering over all kinds of shoddy guesstimation.” Perhaps you are lucky enough never to have heard of modelled variables, but they are a mainstay of the data brokerage industry. A modelled variable is a proxy for a real variable, created in exactly the same way as the Kenyan government’s proxy income value. In fact, had the Kenyan government been looking for a term that made their proxy variable appear more respectable, then they might have gone for “modelled” income.

The consequences of modelled variables are of course usually trivial compared to what has happened in Kenya, but nevertheless the shoddiness is well worth an airing. Let me say first that of course there is nothing wrong with creating a machine learning model that predicts income or anything else. It’s what you do with it next that matters.

For one thing, the producers of modelled variables often forget to mention the fact that the values are not real; and the consumers often forget to ask (granted they might not even suspect it). This, in fact, was my first exposure to “modelled” as a bullshit term - used to bat back some concerns about the reliability of purchased data. No, the variables were not “actual”, but I needn’t worry, I was assured, since they were “modelled”. Knowing no better at this early stage in my career, I failed to come back with the correct response - something along the lines of: “Well in that case I really need to know how accurate they are because this will affect every conclusion I draw using this data.”

The most common fate for a modelled variable is to end up as an input for yet another predictive model. Modelled income for example might end being used to predict customer lifetime value. But does this really work? There are three scenarios here:

This second model was trained using real income values as features; modelled income is used only as an input once the model is deployed. But now of course we don’t know how accurate the second model is, because any training-related performance measures assume that the income value is “actual”. The model could be wildly inaccurate.
The second model was trained using modelled income values; this will mean that the performance measures account for the uncertainty in such values, but the question then arises: why not just take the features that were used to predict income and use them directly in the second model? This will almost certainly yield a better outcome for the second model as it will allow interactions between those features to play a role in prediction.
The second model was trained on a mixture of real and modelled income values, the latter being used when the former are unavailable. Now we have a god-awful mess. The model performance measures are valid only for the particular balance of real and modelled income values present in the training set. Unless we can guarantee this balance in the wild, who knows how the model is going to perform.

Since not a lot of attention is usually paid to features used for machine learning what we end up with is a two-step process of obfuscation. First, nothing is said and nothing is asked during the handover of modelled variables. Second, these variables are tucked away in machine-learning pipelines, where they are safe from further interrogation, in a way they would absolutely not be if they cropped up in, say, a shareholder report. No one is any the wiser.

Thanks again to Volodymyr Fomichov for pointing me to the Guardian article and Scott Thompson for suggesting “modelled” as a bullshit term of art.

The white stuff

“Writing is thinking.” So says a short editorial in Nature Reviews Bioengineering, pleading the importance of “human-generated scientific writing” in the age of generative AI.

Writing scientific articles is an integral part of the scientific method and common practice to communicate research findings. However, writing is not only about reporting results; it also provides a tool to uncover new thoughts and ideas. Writing compels us to think — not in the chaotic, non-linear way our minds typically wander, but in a structured, intentional manner.

“Reading is thinking too”, says the author of a letter to the Annals of Biomedical Engineering, citing the Nature article. Of course it is.

I’m reliably informed that agencies are losing business because their pitch teams are presenting LLM-generated slides they neither authored nor properly read. Not having had the thoughts themselves, they are tongue-tied as soon as the prospective client asks a question.

As I said in Glasseye no. 23, I suspect we are now seeing proposals being generated from LLM-transcribed meeting notes, which then become contracts by a further act of generation, to be signed by parties who have no idea what they contain.

Coding is thinking too. A form of thinking that should be prized for the way it clearly separates out concepts that are muddled together in everyday thought. Ada Lovelace knew this (right at the very beginning) as you will learn from this month’s recommendation: Lovelace’s notes to her translation of Menabrea’s memoir, On the Mathematical Principles of the Analytical Engine. Here she advises that we introduce into our thinking the distinction between operations, objects operated upon, and the results of the operations performed upon those objects, a distinction she learnt from programming Babbage’s Analytical Engine:

It were much to be desired, that when mathematical processes pass through the human brain instead of through the medium of inanimate mechanism, it were equally a necessity of things that the reasonings connected with operations should hold the same just place as a clear and well-defined branch of the subject of analysis, a fundamental but yet independent ingredient in the science, which they must do in studying the engine. The confusion, the difficulties, the contradictions which, in consequence of a want of accurate distinctions in this particular, have up to even a recent period encumbered mathematics in all those branches involving the consideration of negative and impossible quantities, will at once occur to the reader who is at all versed in this science.

I enjoy watching Claude Code as much as the next person, but take care not to hand over the important stuff!

Semi-supervised

I have a useful tool for you. It was briefly mentioned in the white stuff of Glasseye no. 11, but it has come in handy so often recently that I thought it deserved more of a spotlight. This is the principle of maximum entropy as introduced by E.T. Jaynes in his 1957 paper, Information Theory and Statistical Mechanics. Compared to the concepts of statistical inference, the idea is relatively simple. To quote me:

The principle uses Shannon’s concept of information entropy - at the time only recently formulated - to update Laplace’s principle of insufficient reason (in the absence of any relevant information, we should assign equal probabilities to all possible outcomes). Jaynes redefines this in terms of entropy: in the absence of information, it is logical to assume the probability distribution that is maximally non-committal - in other words, the one that contains the least information, ergo the one that has the highest entropy.

Most usefully, Jaynes shows how we can constrain the optimisation process with whatever information is known about the distribution to obtain a final distribution that reflects the known facts while remaining maximally non-committal.

In order to sell it to you let me first explain why the principle of maximum entropy is useful. After that I will describe how it is done. There was a fashion a while back among consultancies for intimidating potential employees by posing irritating interview questions along the lines of: “How many trees are there in Hyde Park?” or “How many people can you fit in Wembley Stadium?” The point was, no matter how little you know, you should still be able to say something. The use case for the principle of maximum entropy is in a similar vein. It is, if you like, the opposite of a big data problem. It is the almost-no-data problem - certainly nothing at the level of the individual observations. Since this is not an unusual situation, especially when it comes to decision making, it is extremely useful to have a method that arrives at not just any guess, but the most rational one.

Specifically, the principle of maximum entropy comes into play when we are after a best guess at a probability distribution and all we have at our disposal is a few high level facts about that distribution, facts that can be turned into constraints. To give an example we might want a best guess at the distribution of UK citizens over gender, income band, and party voted for at the last general election. The scraps of information we have to hand are: a top level percentage split of individuals by party voted for, a percentage split of men by income band, and a percentage split of women in the highest two income bands by party voted for. The answer, I hope you agree, is not completely obvious.

So how do we do this? Fortunately Jaynes provides a general solution for discrete distributions. If a probability distribution with probability mass function p is subject to m expectation-based constraints:

then the distribution which maximises entropy is given as:

where the lambda k are Lagrange multipliers. The normalisation constant is given as:

To arrive at the solution we need to calculate the value of the lambda k. This can be done by solving the system of equations:

That might feel like a lot of work, but the key step is turning the constraints that we have into expectation-based constraints, and we do this, in each case, by choosing an indicator function fk that picks out those discrete outcomes whose probabilities need to add up to whatever proportions we do happen to know about. Take the example above: if we know that 60% of the electorate voted Labour we would need an indicator function that picked out all the outcomes in which there was a vote for Labour, for example: male, income of 40 to 50K, votes Labour. After that it comes down to calculus and solving simultaneous equations (or you could just plug the objective function and the constraints into an optimiser).

I can almost guarantee that the end result will strike you as breathtakingly obvious in retrospect - the kind of conclusion that, had you thought about it long enough, you’d have reached intuitively. But intuition is one thing, justifying an intuition is something else. In fact the principle of maximum entropy is particularly good at proving things that do strike us as somehow obvious. For example, it provides a justification for the conclusion that, when nothing else is known, the distribution of a group over a variable is the best guess for the distribution of any of its subgroups over that same variable - an intuition that we draw on when we do certain kinds of sample weighting.

One last point: it should be clear that this technique is not a form of statistical inference. We obtain only point estimates for the individual probabilities. We can say nothing about how accurate these estimates are. The principle of maximum entropy does however play a role in Bayesian statistical inference, as a principled method for selecting priors.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia

From Coppelia

After the slapdash, ego-driven unpreparedness of most business meetings, it was a nice change to be an external reviewer for Birkbeck’s School of Computing and Mathematical Sciences. It was expertly run at an orderly pace and felt genuinely meaningful. This could be a case of grass is always greener, but I love the place (did my post grads there) and will always recommend it.
Thank you to everyone who has sent me more examples of deranged propaganda for synthetic respondents (and thank you to the providers themselves who have spammed it straight to my inbox). An update on my favourite bugbear is coming soon, I promise!

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

Glasseye

Tue, 21 Apr 2026 08:16:51 GMT

To celebrate two years of Glasseye we are going to do something a little different in this issue. No dunghill or white stuff or semi-supervised this month. Instead a resource which should be invaluable in your day-to-day dealings with senior management, with the marketing department, with management consultants and with software vendors. My sincere hope is that the following will help you translate the instructions given by these dedicated professionals into something you can act upon. This should be especially helpful if you are early in your career, and perhaps still inclined to take people at their word.

Enough preamble. Glasseye is proud to present...

A dictionary of bullshit for statistics, AI and data science

accuracy /ˈæk.jə.rə.si/ noun 1. The degree to which predictions match reality within a subset of the data selected to yield the highest possible accuracy. 2. Whichever of the true positive rate and the true negative rate is the most impressive.

actionable insight /ˈæk.ʃən.ə.bəl ˈɪn.saɪt/ noun An insight that would have made a substantial difference had it not been forgotten by the end of the meeting: Our Insight Engine is our live repository of strategic answers, data-driven guides, and actionable insights designed to demystify the future of marketing.1

adaptive /əˈdæp.tɪv/ adj As yet non-existent, therefore capable of being anything: Our production-grade models are adaptive, responsive and state-of-the-art.

adstock /ˈæd.stɒk/ noun (archaic) A model of advertising impact that, despite its apparent simplicity, is able to generate surprisingly detailed and specific media plans.

agentic /eɪˈdʒen.tɪk/ adj 1. Possibly useful because able to do something. 2. Having the potential to cause chaos: Agentic AI tools have the potential to speed up diagnosis by prioritising urgent requests. (NHS)

AI /ɑː.i/ noun 1. A blanket term covering all forms of technology from the spade to the quantum computer: We will be using AI to leverage our in-garden harvesting capabilities. 2. A chat bot.

AI-driven /ˌɑː.iˈdrɪv.ən/ adj 1. Initially driven by the misconception that AI might be useful: Accelerate application modernization with AI-driven discovery, analysis, and delivery. 2. Funded by a business’s AI budget though not directly involving AI.

analytics /ˌæn.əlˈɪt.ɪks/ noun Any activity involving numbers: Our service comes with extensive analytics capabilities.

correlation /ˌkɒr.əˈleɪ.ʃən/ noun A relationship between events that is unlikely to be causal but which can be treated as such for the sake of having something interesting to say.

CAIO /ˌsiː.eɪ.aɪˈəʊ/ noun Chief AI Officer. A literature graduate who has been placed in charge of a team of highly qualified engineers and mathematicians.

CHAID /tʃeɪd/ noun (archaic) Primitive classification tool, no longer in use except in isolated research communities still under the guardianship of SPSS.

data-driven /ˈdeɪ.tə ˈdrɪv.ən/ adj Not based on a whim.

data point /ˈdeɪ.tə pɔɪnt/ noun (Silicon Valley) An item of information recently acquired by a tech-bro: That data point is completely out-of-sample. I’m going to need to adjust my priors. (See entries for out-of-sample and prior.)

decisioning /dɪˈsɪʒ.ən.ɪŋ/ noun A form of decision making that is more efficient because it has fewer syllables: Builds and runs AI agents that automate content creation, personalization and decisioning across brands, markets and functions.

deep /diːp/ adjective In existence since the 1970s but dramatically enlarged over the last two decades: We have been using deep learning AI and deep neural networks to bring intelligence to advertising.

digital twin /ˈdɪdʒ.ɪ.təl twɪn/ noun A model of something real.

directional /dɪˈrek.ʃən.əl/ adj 1. An estimate so poor that the best that can be said about it is that it is not negative when it should be positive and vice versa. 2. Wrong, but you get what you pay for: Inform the client that our estimates of their ROI are directional.

engagement /ɪnˈɡeɪdʒ.mənt/ noun An abstract quantity that has the advantage that no-one really knows what it is: Customer engagement has increased this quarter by 1.6%.

enterprise-ready /ˈen.tə.praɪz ˈred.i/ adj Far enough behind the cutting edge that it is capable of being integrated with the Microsoft product suite: Our tool orchestrates enterprise-ready agents with built-in context, controls and observability, so teams move from pilot to secure production in weeks, not months.

fine-tuned /ˌfaɪnˈtjuːnd/ adj Out-of-the-box.

generative AI /dʒen.əˈreɪ. ɑː.i/ noun Probabilistic layer added to deterministic systems to introduce uncertainty.

hypothesis /haɪˈpɒθ.ə.sɪs/ noun A thought that is too technical-sounding to be disagreed with.

key driver /kiː ˈdraɪ.və/ noun A variable that has been included in a model because the data is available.

model fitting /ˈmɒd.əl fɪtɪŋ/ noun The practice of selecting the variables and functional form for a model to produce the greatest amount of happiness in a client.

next-generation /nekst ˌdʒen.əˈreɪ.ʃən/ adj Dating back to the 1970s: Our latest wave of predictive models are truly next-generation.

optimise /ˌɒp.tɪ.maɪˈze/ verb 1. Improve slightly. 2. Use a mathematical process to reach a pre-specified outcome: We used state-of-the-art modelling to optimise our client’s media plan.

out-of-sample /aʊt əv ˈsɑːm.pəl/ adj (Silicon Valley) Surprising.

one-pager /wʌn ˈpeɪ.dʒə/ noun The final form taken by a lengthy and thorough piece of analysis once stripped of subtleties, caveats and technical details: Could you turn that report on the effectiveness of our search algorithm into a one-pager for the CTO?

predictive /prɪˈdɪk.tɪv/ adj Somewhere between infallible and fractionally better than a guess: Our agent-built machine learning solutions were found to be highly predictive.

predictive intelligence /prɪˈdɪk.tɪv ɪnˈtel.ɪ.dʒəns/ noun Intelligence.

prior /ˈpraɪ.ə/ noun Whatever a tech bro happens to believe at any given moment, based on very little evidence.

p-value /ˈpiː.væl.juː/ noun The probability of a robust test given the amount of wishful thinking.

real-world /ˌrɪəlˈwɜːld/ adj non-fantastical: Our analytics team is focussed on real-world outcomes.

R-squared /ɑːˈskweəd/ noun Final arbiter of whether or not a regression model is fit for purpose. A model with a sufficiently high R-squared needs no further qualification.

signal /ˈsɪɡ.nəl/ noun 1. Energy wave emitted from datasets that, once harnessed, has the power to transform a business: Leverage trillions of signals, advanced AI, and privacy-first collaboration to unlock new levels of growth. 2. A pattern.

significant /sɪɡˈnɪf.ɪ.kənt/ adj 1. Negligible but not due to chance. 2. Substantial but completely random.

simulation /ˌsɪm.jəˈleɪ.ʃən/ noun Method for producing data when there isn’t any.

SPSS /ˌes.piː.esˈes/ noun (archaic) Ancient surveying technology carefully preserved in its original state by the market research community. 2. Device for turning a single data file into two separate but unusable data files.

storytelling /ˈstɔː.riˌtel.ɪŋ/ noun 1. The crafting of an engaging narrative that will bring an otherwise dry piece of analysis to life. 2. Overfitting. 3. Lying.

synthetic /sɪnˈθet.ɪk/ adj Fake.

synthetic respondent /sɪnˈθet.ɪk rɪˈspɒn.dənt/ noun A simulation of an ultra high net worth individual obtained by averaging blog posts, online fan fiction and fake product reviews.

touch point /tʌtʃ pɔɪnt/ noun 1. Any event recorded in a customer database, no matter how trivial. 2. Unwanted attention from a business.

test and learn /test ænd lɜːn/ An activity mentioned on the last slide of a presentation on the understanding that no business has the patience or self-discipline to carry it out.

value /ˈvæl.juː/ noun A rare substance obtained through data mining: “Our data team are busy extracting value from the data as we speak.”

weighting /ˈweɪ.tɪŋ/ noun A mathematical technique that transforms the survey actually executed into one that you would like to have executed.

For more on these topics - for example, on CHAID, synthetic respondents, weighting, techbro-speak, and the changing meaning of “AI” - see the full list of posts for the dunghill.

Many of the usage examples are real, although, out of kindness, I have not provided the sources.

Glasseye

Wed, 25 Mar 2026 09:30:52 GMT

In this month’s issue:

Data science has a weight problem. The dunghill offers some advice.
Semi-supervised walks you through my tool choices for a post-LLM working environment.
Boring unravels the myth of the normal law, interestingly. Not a crossword clue, but the subject of this month’s white stuff.

Semi-supervised

I can’t tell you that your job is safe from AI. My strong feeling is that if you feed your curiosity and keep your problem-solving skills sharp, then it almost certainly is (for reasons I go into below). One thing I do know for sure: standing still is not an option. But then that’s nothing new - just ask the SAS guys who thought they could reach retirement writing SAS macros. (If you don’t know what I’m talking about, then that only proves my point!)

You will always be running to catch up. The trick is to choose as wisely as possible who and what you are chasing; to figure out which aspects of the current tech revolution are wasting your time, and which are rich with possibility. With that in mind, I present you with one set of choices - mine - and their rationale.

Terminal over GUI: Not being particularly original here (the devs have been here forever) but now, more than ever, it feels like a good time to go full Matthew Broderick. Why? Three reasons: First, it is a decisive victory in the battle against distraction. There are no ads (yet) in the terminal. No endless attempts to get you to engage with new features. Second, unlike the old days, it is now a tool-rich environment, and if the CLI for what you want does not exist, then you can vibe it into existence. Which brings me to the third reason: it’s the natural home of Claude code (or your preferred substitute). It’s in the very nature of the prompt to draw our attention away from the two dimensions of a GUI and back towards the single dimension of a dialogue stream. This has always been the way of the terminal. We should add to all this the apparent psychological benefit of a minimalistic, infinitely customisable terminal environment, where you call the shots, rather than run around after every scroll bar and flashing icon. Over the last couple of years, the GUI has started to feel like a side-stepped middleman.
Keyboard over mouse: The wisest decision I made over the last year was to learn to touch type. Not only did it stretch parts of my brain that have never been stretched, but it has also paid off handsomely at the terminal and the prompt. But to really leave the mouse behind takes modal editing - vim or emacs. I’ve gone vim (fond memories of vi). Why am I touting 90s technology at the same time as telling you to keep up with the times? Because technological revolutions can revitalise earlier technologies, and I have a feeling this is a case in point. Vim, of course, has never really been out of style among devs, but its new relevance is driven by the fact that agents are pushing us back to the terminal. Now that we are settled in, Vim allows us to edit code and documents without the jolt of returning to a GUI. (One last point is that the headache of Vim configuration is handled effortlessly by LLM agents.)
Raw markdown over notebooks: When Jupyter notebooks arrived in 2014 they were a wonderfully liberating technology for data scientists. But they also cut us off from the rest of the coding ecosystem. Unlike the raw text files used for most programming languages, they required specialised rendering. With hindsight, a better solution is to be able to execute the code-fenced blocks in markdown files. This is the direction I’m now heading, using quarto and neovim. It means my markdown notebooks are readable with just about any editor, and they are more elegantly handled by LLM agents. I think this is the future. Sorry Jupyter.
Raw markdown over just about every other text document format: While we are on the topic of markdown, it has for many years been my policy to prefer it over any other document format and to prefer tools that store content as markdown. If others want another format then I write in markdown and convert. The cost in terms of layout restrictions is more than compensated for by the freedom to move my work wherever I want. Often the markdown files can stay where they are, and I will simply switch the tool that sits on top. For example, my vault of markdown files is managed with Obsidian but I can switch to neovim as my preferred editor. I see this as all part of the same trend: once again, away from the GUI and towards a combination of raw data and the terminal.
Agent over autocomplete: I’ve been on a bit of a journey here. Probably the same one as everyone else. A year ago, coding agents seemed bolshy and incompetent, but I was impressed by the efficiency of LLM-driven autocomplete. Now the situation seems to have reversed. I put this down to Claude Code getting the partnership between human and LLM just right. I’ve turned off the auto-complete. It got annoying.
Agent as teacher and critic over agent as author: I’ve noticed something recently that I hadn’t expected, and that goes against the idea that coding agents will turn us all into project managers: using an agent is leading me deep into the detail of things that I previously didn’t have the patience for. The most dramatic example is that I’m enjoying using git at the command line and finding out about all the weird things it can do. It’s a similar experience with shell scripting. Last month’s white stuff reviewed a paper on the impact of LLMs on skill formation that concluded that skill acquisition is retained if the agent is used for instruction rather than code generation. I’m for holding onto such skills since I’m betting on agents being irredeemably flawed and therefore, in all but the most trivial cases, in need of a human to talk to about the details. So my policy at the moment is not to grab at the productivity increases delivered by coding agents, but instead to take it slow and benefit from some efficient, context-based training. I figure my clients will thank me in the long term. The exceptions are menial tasks where I’d learn nothing anyway, or occasions where I’m using an agent to code very precisely defined tools or components, whose inner workings I have no interest in.
CLIs over MCPs: Again it looks like I’m not alone here. There’s a debate going on online about performance, but my preference is rooted in something more basic - the need for control. The CLI is, after all, an interface, and that means well-defined inputs and outputs. If I’m letting a coding agent lose on my GitHub repo or my emails, I want to see exactly what it is doing there, and I want some understanding of the limits.
Roll my own over find someone else’s: Hours used to be lost in search of the plugin, extension, app, repository, or CLI that did exactly what I wanted. Then hours more in replacing those tools that were incompatible with the newcomer. All that has gone since I can use an agent to code the exact tool I want. I usually don’t share the result. Partly because I’m lazy, but partly because there is no need since others can create their own. This feels new.

Some things that are not for me: agents anywhere near what I write (here or even in the simplest of emails); using LLMs to summarise papers (I like skim reading and sometimes it’s the small details that are important); agents replacing the kind of grunt work that gives me a feel for the problem or internalises information. (I would never, for example, ask the agent to draw me a concept map - the value is in the thinking, not the result.)

As I said, these are my personal choices. They might be wrong. I might change my mind.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia

The white stuff

Edwin G Boring - with a name like that, how could you fail to be interesting? And so he was. Boring was a prolific experimental psychologist who began his career working on intelligence testing in the army in World War II, but soon became critical of their methods, drawing attention to a lack of scientific objectivity. His interest to us is in a brilliant paper, published in The American Journal of Psychology in 1920. The Logic of the Normal Law of Error in Mental Measurement ripped to shreds the notion that the normal distribution is all-pervasive in nature. Since that myth is as alive as it ever was, and since it is a cornerstone of IQ testing (which is once again being talked about as though it were a serious metric), the paper is worth our attention. It is also superbly written, as you can see in this excerpt in which Boring defines the problem:

The normal law of error has been both an inspiration and a limitation in statistical measurement… There is a bit of magic in the formula. The law came to play the part of a first principle of nature, of an ideal, given a priori, to which nature seeks to conform. The mathematicians wrought slowly, but they wrought a god. Against such blind faith later statisticians have protested. They call the normal law a “fetish“ and its a priori use a “superstition.” Nevertheless the”superstition” still lingers and is mixed up with mental measurement. For this reason we are going to enquire, concerning the law of error, what real value it has for us to-day as a scientific tool.

The paper is also an object lesson in intellectual history, tracing the development of the normal distribution and its gradual scope creep from games of chance, to the measurement of error, to the modelling of variation. The last step was taken in the hope that the normal distribution would turn out to be a fundamental scientific law describing variation in nature, which would allow it to be used to make inferences, including about mental ability. But this is turned out to be wishful thinking, the argument easily knocked down by switching the measurement. (If the diameter of a spherical seed is normally distributed, then what about its volume?) The dream of a transcendental law for biology was baseless. As Boring concludes, interestingly:

Here we may leave the question of the a priori nature of the normal law. There is, after all, no magic in it. It gives us back always what we put into it. If we know from experience what nature is up to, as we do with the coin, then we can proceed upon cogent reasons to apply the law and we get results. If we do not know, we must appeal to nature and see.

The dunghill

Weights are routinely applied to data sets with little understanding of what they are for and what impact they have on analysis. But this is the dunghill, so I’m not going to get into how to do the job properly (that might be a topic for semi-supervised in some later issue) - instead, let’s wallow a little in the malpractice. We all know that’s more entertaining.

The most common and problematic use of weights is in the analysis of survey data. The literature will tell you about design weights, non-respondent weights, and other exotica, but in the overwhelming majority of cases, the weights applied to survey data, in industry and research, are post-stratification weights. We conduct a survey; the sampling is a long way from random but we are not too sure of the biases; we do however have population percentages for certain demographic strata. The weight for any observation falling into a particular stratum (say, female, 18-30) is equal to the proportion in the population falling into that stratum divided by the proportion in the sample falling into the same.

So with that in mind, here is my list of weighting crimes ranging from the unforgivable to the unbelievable. Knowledgeable as you are, dear reader, some might seem obviously ridiculous. But I can assure you they are happening… a lot.

At the extreme end of negligent are those who believe in the magic power of post-stratification weighting: once applied, it will resolve all problems related to survey representation. Never give it another thought; all analysis can proceed just as it would were the results obtained from simple random sampling. Now I’m classing this as unforgivable because it does not take a post-grad degree in statistics to spot the flaw. If the information we are interested in is unrelated to the demographic strata, then a re-weighting by these strata will achieve very little. Say we are using a survey to estimate the proportion of the UK population who like cheese. I doubt this has much to do with either age or gender but it might well affect whether someone volunteers to be part of an online survey by a pizza chain. So no, it’s not magic. The strata need to be related to the question at hand.
Another easy one, but again missed when we think weighting is magic: for any analysis that is restricted to a subset of the data that falls within one of the strata, weighting does again precisely nothing. Everyone in that stratum has the same weight. Did I need to point that out? Yes I did.
A more understandable error, since spotting it takes more than common sense, is the failure to adjust for weights when it comes to calculating the variance on results obtained from the survey. Any fool knows that in simple random sampling the variance of an estimate is related to sample size. With weighted data, however, the situation becomes more complicated. Oversampling in one stratum, while under-sampling in another, can lead to more uncertainty in the estimator, even if the actual sample size remains the same. In my experience, very few analysts make the necessary adjustments, preferring to once again see weighting as a magic cure-all.
Next, an error that I describe using the slogan: “You can’t weight your way out of complete lack of representation”. This was most glaring during the early days of online surveys when half the population were on the internet and the other half, still in the pub. The half on the internet were more tech-savvy and, yes, they were younger. But if we want the survey to be representative of the whole population, we can’t just upweight the older respondents to the online survey, for they are, of course, precisely the tech-savvy older people. The tech-averse remain stubbornly absent from the survey, which will be a disaster if the survey is related to technology.
If we are doing slogans then the next one is: “You can’t weight your way into another population”. You might wonder if I am making this one up - it’s so out there - but I promise you it happens, and it is getting more common as businesses try to resell their data. Let me spell it out. You cannot take a sample from one population (a country, a customer base, a marketing channel) and then, by re-weighting, transform it into a sample from another population. You cannot, for example, take Saga cruise customers and age-weight them into PlayStation users. At its most extreme, this is done not to surveys but to whole populations - an entire customer base gets re-weighted to “look like” customers of another business. Weird. Wrong.

And that’s just survey data. We’ve not even touched the use of weighted data in machine learning. Perhaps you can send me your horror stories.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

From Coppelia

I’ve got the skills to pay the bills. I’m fully on board with skills now. My mistake initially was failing to differentiate between well-defined processes (make a tool) and situations where the agent was repeating the same steps over and over on tasks that it excelled at (make a skill). This meant I was building skills that weren’t working as well as old-school processes, and they were costing me money each time they ran! Anyway, I’ve got it now. Thanks, as usual, to my colleagues at Melt and, as usual, to Mark Bulling, for keeping me relevant!
Lemonheads. There needs to be a word for people who pass on LLM-generated content without even looking at it, let alone checking it (“lemons”, maybe?). Here’s a pattern I think I’m seeing:
- Agency has a meeting with a client, and the AI agent transcribes the meeting.
- Agency uses the agent to generate a proposal for work from meeting notes, but does not really read the proposal.
- Client senses an agent-generated proposal, so doesn’t really read it either.
- Both parties sign off on work without either knowing what it entails.
Good luck lawyers!

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

Glasseye

Tue, 24 Feb 2026 08:24:08 GMT

In this month’s issue:

The dunghill on how tech bros use Bayesian statistics to signal their rationality.
Are you being de-skilled by the prompt? And, if so, what can you do about it? The white stuff reviews a recent paper.
Never again be humiliated by your own work. Semi-supervised gives you three essential checks for any piece of analysis.

The dunghill

What’s going on with tech bros and their “priors”? And why 10x something? Why not 3x it? Or 5x it? Or maybe just “turn it up”? And is the last thing someone told you really a “data point”? Was that surprising news worth describing as “out of sample”? In short, why is Silicon Valley suddenly talking and writing like Gelman’s Bayesian Data Analysis is a style guide?

If this were just an irritating trend, then it probably wouldn’t be worth our attention. Unfortunately, it seems to be symptomatic of something else. A nasty bit of hubris.

The truth is this is about rationality signalling (I’m having this if no one else has claimed it) - the attempt to convey the impression that the speaker’s thought processes are identical to the workings of a Bayesian model. Their current beliefs, their “priors”, are laid out before them with perfect clarity; every new fact they learn, every new data point, modifies these beliefs in the most rational way possible, that is through the application of Bayes’ Theorem. In other words, the bro is a relentless, unstoppable calculating machine, with a rational response to anything and anyone.

Admittedly, the philosophy of the website LessWrong (the wellspring of this movement, with posts like Update yourself incrementally) is more subtle than this. The authors there do very much focus on the cognitive biases that get in the way of flawless Bayesian processing. But what subtlety there is in the strange, slightly cultish doctrine of The Sequences (the teachings of their master, Eliezer Yudkowsky) was lost when it was absorbed into the vocabulary of Silicon Valley. Here the point is not to emphasise vulnerability or analyse the causes of failure, but to project superhuman competence.

But look, if there’s one belief that it’s rational to hold, given the overwhelming evidence, then surely it’s this: we are the least Bayesian animal out there. An earthworm, I’d wager, is more Bayesian than any human being, untroubled as it is by wounded pride, daddy-issues, emotional complexes, etc. We are so far off being Bayesian in our everyday thinking that even to suggest it as a goal is misleading. Better to accept that this mode of thought requires an intense effort, that it goes against the grain, and that it is unlikely to ever completely succeed. Fooling yourself into thinking that it’s your default mode is the opposite of insightful. I certainly don’t know the vast majority of my “priors”. Do you know yours?

Talking of fooling yourself, perhaps the least rational of all behaviours is betting the house on something happening (say the emergence of consciousness in a tin shed in Memphis), just because you’d really really like it to happen.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

The white stuff

Someone - some self-effacing, anonymous contributor - took great care in planning out the polars Python library. This relatively new data manipulation package is a masterpiece of minimalism. There is one easily-learnt syntax for constructing expressions, but the expressions themselves are given different meanings in different contexts (for example, in the context of a selection, a filter, an aggregation, etc.) The result is both highly constrained and extremely flexible. Achieving such a balance is an art. Create something too simple and it gets butchered by workarounds, too complex and we never quite get the hang of it (ahem Regex).

But I missed all this when it was first released. Why? Because its rise to prominence coincided with the emergence of LLM-assisted coding. Although I skim read the documentation and convinced myself that it was all going in at some level, I didn’t really feel it. A couple of weeks ago, I decided enough was enough, pulled the plug (temporarily) on my little agent friends and invested a few hours in understanding how polars really works. That was when I started to appreciate the design, and I’m sure my coding benefited.

But I am, after all, a sample of one, and that, as we all know, is never good enough - which is why the paper How AI Impacts Skill Formation (laudably commissioned by the Anthropic foundation) is so timely. In fact, career-wise, it may be one of the most important papers you ever read.

The authors express a concern that may have crossed your mind. If, as seems increasingly likely, coding agents will always require human agents to direct and debug them, and if junior coders are, through their reliance on coding agents, losing the skills needed to carry out these tasks, then, while we might be ok for the time being, there will come a day when we run out of road - that is out of people who know how to code.

There’s good news in this paper, and there’s bad. The bad is that their randomised controlled study confirms our worst fears: “We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average.” The good is that when we dig into results from the AI-assisted group, there are lessons to be learnt about how we can have and eat cake. The authors found that those who scored highest on skill acquisition “only asked AI conceptual questions instead of code generation or asked for explanations to accompany generated code; these usage patterns demonstrate a high level of cognitive engagement.” The lesson then is: know thyself. If you don’t have the self-discipline or patience to follow up code generation with questions (that’s me), then ask first and code later (by yourself). Note, this only applies to work in areas that you do not yet understand. If it’s familiar territory, then other than to keep your tools sharp, there’s no reason not to let those agents whir.

But there’s another reason to keep the tools sharp. If the big AI players are unable to recoup their losses elsewhere, then they might be forced to squeeze as much as they can from products where the use cases are real. Might the long-term strategy be: keep the price low, wait until everyone has become dependent, not just on LLMs (they are a dime a dozen) but on tasty, IP-laden products like Claude Code, and then… jack it up?1

Semi-supervised

First, apologies to anyone I’ve worked with, managed, or mentored over the last twenty-five years. You will have heard this before. Feel free to skip, although there’s some new stuff here too.

I’m talking about my three tests - three simple quality assurance tests that fit almost any piece of analysis work. They are quick and easy to do. Find yourself pressed for time, then do these as a minimum, and you can be almost certain your work will stand up to scrutiny.

The triangulation test: Find another way of arriving at roughly the same result. For example, you might be estimating the size of a potential market for a new product by using the purchasing behaviour of individuals. This is a bottom-up approach. Test that your answer is sane by doing a top-down calculation, starting with the populations of the UK and narrowing it down to your potential market. If the two techniques produce roughly the same number, you can be more confident in your estimate. Sometimes it is reasonable to assume that the two techniques will arrive at exactly the same result. If it turns out there’s a difference, then debug.
The spot check: Defined as a test made without warning on a randomly chosen subject. Your analysis work might pass the triangulation test but still, at a more detailed level, there are problems with your work, perhaps errors that get averaged out in the aggregated data. This is a job for the spot test. Pick five to ten records at random and follow them through whatever process or algorithm you have built. If everything happens as you’d expect, then you should be ok at the nuts and bolts level.
The bastard test: You may have been extremely diligent in the last two tests but unfortunately you make a typo on a number in the final presentation and are judged, perhaps unfairly, on that. This would have been picked up on with the bastard test. Imagine you are another person, a person who hates you with a passion. (I find it helpful to picture the famous mathematician G. H. Hardy in the pose shown below.) Run through your work and try to catch yourself out. Make quick sense-checking calculations in your head and examine the whole thing for plausibility. The bastard test can now be supplemented (not replaced - see this month’s Dunghill) by the digital bastard test. So long as there are no confidentiality or data protection issues you can run your work past an LLM (Anthropic’s Opus is particularly good for this) instructing it to search aggressively for errors.

“It pains me to point out that you have made a rather embarrassing mistake on slide three.”

One more tip - a technique not a test: I find it useful to pick out what I call guideline numbers (after the lines used to align objects on a digital canvas). These are numbers that thread through all of your calculations and data transformations, while remaining the same. They should be recalculated at every step of the process and if they do change, it is a sure sign that something has gone wrong. Some simple examples are: the number of subjects in a test, or the minimum and maximum date covered by a data set.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia

From Coppelia

Despite all the griping in this issue’s white stuff, this, for me, has been the month of Claude Code. I now have an army of personal CLI tools written in Rust that do exactly what I want them to do. Whether I will actually use them is another thing. Two that I am particularly happy with are:
- pangraph: a universal graph converter, that allows me to switch between Mermaid, Graphviz, the Obsidian Canvas, SimpleMind Pro, and many others that I haven’t even heard of!
- td: a CLI tool for managing minimalist todo.txt files (which I love). A nice tool already exists, but I wanted one that handles time contexts.
Use both at your own risk - I can’t vouch for my collaborator (who pretty much did all the work!).
After the spectacular wrong turn of synthetic respondents, it is encouraging to see what looks like an original and intelligent use of digital in the usually stagnant world of quantitative research. Take a look at Polis. (Thanks to Mark Potts for pointing me in their direction!)

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

This is worth listening to on this very subject.

Glasseye

Tue, 27 Jan 2026 10:37:45 GMT

In this month’s issue:

Semi-supervised asks that you show some compassion towards your predictors.
Owning up to uncertainty in the white stuff.
The dunghill looks at why old-school model scepticism is back in style.

The dunghill

There’s an old wisdom from the deeply unfashionable world of frequentist statistics which says that a model is just a view from a particular, and necessarily limited, perspective, that it is one view among many, and that it should be consulted, but never entirely trusted. This way of looking at things was a matter of course for old-school statisticians from Box to Cox;1 it is at the heart of Karl Popper’s account of scientific progress through conjecture and refutation; and it is built into the idea of a statistical significance test, where the null hypothesis is the incumbent and the alternative hypothesis no more than a hopeful challenger. At its most influential, this view was part of a post-war scientific culture that was understandably suspicious of excessive self-belief, and too much faith in world-shattering ideas. As a method it is based upon an unspoken assumption: the default position, the one challenged by the model and which we fall back on if unconvinced, is a set of assumptions that have worked well so far, and which are based in part on the kind of thing that can’t easily be captured in a model - gut-feels, condensed real-world experience, the wisdom of centuries. It’s all very small-c conservative, but not necessarily wrong.

And yet the strange thing is that this view of modelling, so downbeat, and so contrary to the current mood of tech optimism, is making a comeback. Even more surprising is who is pushing it forward. I previously quoted the co-founder of Anthropic, Jack Clark, interviewed on the Newsagents. I’m going to quote him again as I find it so striking:

The way that I use these systems [LLMs], and many do, is I read a research paper, I write out what I understand that paper to mean and when I upload the paper and my understanding of it to the system and say, do I have this right? And if I don’t have it right, explain to me. That’s useful learning because the system reads the paper, reads my explanation and tells me whether I got it right or wrong, just like a colleague. If we use these things in the right way, they can help us be a lot more capable and a lot smarter.

One way of reading this is as an ingenious pivot by an LLM provider who has realised that hallucinations are here to stay. If you set human judgement as the default and position the LLM as a brilliant but overly imaginative critic, whose insights must be filtered by the user, then you are off the hook for hallucinations. But another, more generous, interpretation is that this is a thoughtful correction to LLM overreach, one which places the model back in its rightful position - an advisor, no more and no less, and never to be entirely trusted.

Of course this is absolutely right. Notwithstanding their utter brilliance in many areas (Anthropic’s Claude Opus just found an embarrassing number of errors in some dense maths I was working on), LLMs, as much as twentieth-century regression models, are outrageous simplifications. That might seem like an absurd thing to say about something so intricate and opaque. Nevertheless they are simplifications on account of the fact that a) they reduce everything to the problem of predicting text and b) they leave out so much, e.g. the external world, its physics, its logical and mathematical laws, its complex systems, etc. What is more, they are worse than the old models in at least one respect - they are built on undoubtedly biased data, and we have no idea in which direction.

This is why, in my view, almost all applications in which LLMs are given autonomy are doomed. An LLM has a perspective. It is often a stunningly insightful perspective, which makes it an invaluable contributor in a human-computer partnership. But by itself it is a limited, flawed, irresponsible, dangerously one-sided perspective, missing huge chunks of reality. It is the perspective of someone who has spent too much time on the web, who is capable of extraordinary insights but needs to get out more. And indeed, if what it lacks, as many are now saying, is a world model, then getting out might well be the answer.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

The white stuff

Model Uncertainty, Data Mining and Statistical Inference by Chris Chatfield was published in 1995. I wish I'd read it then - or at any time that decade. It would have saved me years of confusion. I was confused because the things that were puzzling me, and which seemed like pretty serious problems for statistical inference, were not anywhere being discussed. Chatfield's paper is full of helpful phrases like “in practice no one believes this” and “unfortunately this is never done” and “this is very strange given that … “.

The bomb he drops, and the main theme of the paper, is that the entire statistical field is obsessed with the calculation of probabilities that are only valid if it is assumed that the statistician has selected the correct model. In practice that’s an enormous “if”. The reality is that the uncertainty surrounding model selection often dwarfs the uncertainty surrounding the values of model parameters.

Chatfield was not the first point this out - Box and Tukey both had a good go at it - but he pulls together his arguments at a crucial point in the history of statistical modelling. In 1995, Vapnik was publishing on support vector machines, and the machine learning revolution was imminent. Machine learning treated model selection as a problem to be solved rather than ignored, and in doing so drew attention to the vast number of models that will fit a data set almost equally well. Chatfield’s worry that the automation of model selection was leaving to overfitting, his call for sensitivity analysis to proof against false confidence in model selection, and his favouring of predictions over estimates of population parameters - the problem with the latter is that “the analyst will never know whether the inferences are good since the estimates cannot be compared directly with the truth” - were all anticipations of this revolution, far ahead of their time, especially in the world of statistical modelling.

What's most shocking though is that the paper is still relevant. The problem of measuring the effect of model uncertainty in statistical inference still lacks a satisfactory solution, and I would still recommend the paper to half the graduates I work with, many of whom are just as confused as I was about the role of model selection, and would be relieved to find out that “in practice no one believes this”.

(Note that the paper is behind the JSTOR paywall. However you can read it online for free simply by signing up for a JSTOR account.)

Semi-supervised

From my Glasseye inbox:

I was hoping you could help me sanity-check how I’m framing a modelling problem. I can’t tell if I’m overthinking this, or if there’s a genuinely tricky aspect that I’m getting stuck on. I’ve been tasked with predicting the probability that a user will make a purchase within 24 hours of a reference time t. My initial instinct was that this is a fairly straightforward classification problem, but I've been struggling with how to construct the dataset in a way that avoids bias and data leakage.

No definitely not overthinking it - I would say this is probably one of the most underthought issues in model design. In fact it’s astonishing how little is written about it, given how many have fallen into this particular pit. So congratulations to the asker, and double congratulations for asking when you thought the answer might be obvious. The truth is it is genuinely tricky.

In a nutshell, we need to avoid the classic trap of training a predictive model on training data that does not match the input that it will face at the time of deployment. For argument’s sake let’s say that the reference point t is the moment that a user registers for a service, and that the data scientist has access to a range of demographic and behavioural features with which to make the prediction. Let’s also assume that they have constructed a target variable that is equal to one if the user made a purchase within 24 hours of registration, and zero otherwise. Now the stupidest thing to do is to train the model to predict this variable using the demographic and behavioural variables as they stand. I say stupid, but this is done, and it is done a lot! Why is this such a bad idea? Because by the time you build the model, behavioural data will have accrued that records events subsequent to or concurrent with the purchase, and which are then of course highly correlated with the purchase. It is not uncommon to find models where the user’s total spend has been used to predict the probability of first purchase.

So you are not going to do that. But what are you going to do? You have two options. I’ll deal with the hardest one first as it is more interesting. This requires you to recreate the data for each customer as it would have been at t - the only training data that makes sense since it is the only data that the classifier will see in the wild. But recreating this data is no easy matter. For a start, there is no single t. It is different for every user. If you are lucky enough to be using a system with very thorough database auditing then you may be able to recreate this data from snapshots. Otherwise you’ll need to rewind it yourself. For this your features usually fall into five categories:

Those which you can reasonably assume won’t have changed since t. Many demographic variables fall into this category, as do variables that record events prior to t.
Those which you can adjust using some time-based calculation. For example, a customer’s age at t can obviously be recreated. So can features like day of the week, hour of the day, etc.
Those which you can somehow infer based on other values. For example, a customer might have been flagged for a particular offer based on some item of data that has since been overwritten.
Those which you should exclude because they are irrelevant at t. User spend in our example, since it will always be zero.
Those which cannot possibly be recreated.

The features in the final category will, I’m afraid, have to be junked. This is fine as you are building a predictive model rather than a model for understanding relative contributions to an event.

The other option I mentioned is simple, but rarely available as it requires a friendly, patient client. This is to simply start taking snapshots of the data at all future ts until you have enough data for a model. The nice thing about this approach is that you can think in advance about what might be predictive and therefore worth capturing.

Another difficulty, for both approaches, is in deciding which data is going to be representative of t. Are you going to use all your customer registrations to train the model, or some subset - those within a certain recency window for example - or a sample, or some kind of weighted sample? Again it helps to think of the deployment of your model. I try to visualise it sitting patiently on the data pipeline, and then springing to life with each registration. What does the data look like from its perspective? How can you make the data that you train it on as much as possible like the data it will face? This mental exercise should help answer the question about representative data. For example, if your business is in a rapidly evolving sector where behaviours often change their significance, or if the market has recently undergone a major shock that has radically altered behaviours (e.g. a pandemic) then you may need to shorten the time window on your training data. Another example, if you anticipate a change in the demographic profile of new customers (perhaps new legislation has been passed, perhaps product adoption is increasing among older people) then you should upweight the appropriate groups in your sample.

And this is just one scenario. Every time a model is dropped into the real world it faces a stream of data that is unique to its particular situation. It requires imagination to think through what it will be up against. Being a good data scientist requires empathy!

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia

From Coppelia

It’s interesting to think about the future of plugins in the age of coding agents. Instead of spending hours searching for exactly what I need, I find myself just asking Claude Code to build it for me. And then I’m too lazy to share the result. LLMs seem to be particularly good at writing code for backend tasks. Will we all end up with completely customised working environments?
On the topic of Claude Code, thank you to my Melt colleagues for pushing me in this direction. I’m moving away from Cursor, which has been driving me nuts with its overly bullish agents. I need to feel more in control, and there’s nothing better for this than the command line - it’s in the name. I think of Claude Code like a thoughtful, boundary-respecting butler: “I’ve prepared this data set for you sir! Would you like a moment to digest it?”
Prepare yourselves for the imminent AI course correction. Put LLMs to one side. Coppelia is now doing training courses in symbolic and neurosymbolic AI (relevant Python packages proglog, SymPy, pyKEEN). Let me know if you are interested.
Lastly, as far as AI correctives go, I have to point you to the second season of Shell Game, in which the journalist Evan Ratliff (pitch-perfect deadpan) takes the tech bros at their word and sets up a company staffed entirely by LLMs. Thanks to Niazy for the recommendation. Rise and grind Ryan, rise and grind!

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

Most statisticians know George Box’s famous “All models are wrong but some are useful“. David Cox agreed: “The very word model implies simplification and idealization. The idea that complex physical, biological or sociological systems can be exactly described by a few formulae is patently absurd.” (See the commentary in Chatfield, Chris. “Model Uncertainty, Data Mining and Statistical Inference.” Journal of the Royal Statistical Society. Series A (Statistics in Society), vol. 158, no. 3, 1995, pp. 419–66. JSTOR)

Glasseye

Tue, 16 Dec 2025 08:28:02 GMT

In this month’s issue:

The dunghill is full of AI slop. Will it drown us all or will apathy win the day?
Late Victorian machine learning creaks back into life in the white stuff.
And semi-supervised solves the riddle of the one-tailed test.

Plus quantum computing, semantic leakage, and a cold caller who undermines her own existence.

Does that not sound fun?

The dunghill

The dunghill would not be a proper dunghill if we did not, at least once, address the issue of slop. And ok, cards on the table, as you no doubt are well aware, this newsletter is awash with AI-generated images. (Look up, about an inch above this sentence - you’ll see a pitchfork with too many prongs.) But they are, I would like to argue, the good kind. They cost me effort. I had to think. AI slop, by contrast, is characterised by an obvious lack of effort, and no attempt at deeper meaning.

Whatever you think of my images, it seems likely that slop, lots of slop, is headed our way. It might be prudent, therefore, to size up the threat. What are we facing then, as data scientists and statisticians? To what extent will we become slop consumers and, despite our best intentions, slop producers?

First, it’s worth considering how we might become the victims, indirectly, of slop produced and consumed by others - perhaps those at the more visionary end of the AI adoption curve. I’m thinking in particular of the thousands of articles on the benefits of AI produced in the less-scrutinised industry magazines. Try dropping some of these into ZeroGPT, and you’ll see what I mean. A lot looks like slop, more worryingly, slop that tells your CEO what you and your team should be doing next.

Next, we should think about ourselves as direct consumers. It is particularly true in statistics and probability that the right answer is not always the most convincing or popular answer. Here, then, the strategy of “averaging the internet” does not seem wise, especially once we throw in hallucinations. Far more concerning, though, than chatbot output (which, after all, only becomes slop once we paste it somewhere) is the gradual undermining of our only real method of filtering out false claims - the peer-reviewed paper. The slop problem in paper writing is well known, and the peer-review process itself has the potential to deliver yet more slop. Yes, I know that the peer review process was already dysfunctional, with academics producing human-slop1 to chase citations, but automating the problematic part of this process is surely not the answer.

What about us? Do we produce slop? Already, if I leave an agent on in a markdown cell of a Jupyter notebook, I find that it is itching to write me a tedious but lengthy account of what is going on in the data. The temptation is there. Even if I resist, there’s every possibility that my work will be sucked into an LLM-generated presentation, which will be read out without comprehension and listened to without interest.

And there’s the real problem. Once we know that it’s slop, once we realise that the producer has contributed precisely nothing - not even the sweep of a critical eye - to the content, then the incentive to consume vanishes. “If you can’t be bothered to write it, then why should I be bothered to read it?” I read that somewhere this week. It is scary because it implies that some of our most necessary communication channels might grind to a halt.2 But perhaps it is also the way out: if I can’t be bothered to read it, why should they bother to generate it?

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

The white stuff

Have I got a treat for you. Warm your slippers, clean out your pipe and get ready to enjoy some late Victorian (or is it early Edwardian?) machine learning. By this I mean Pearson’s original 1901 paper on principal component analysis. Not that he called it that - the term ‘principal component’ is due to Hotelling, who rediscovered the process in the 1930s. Pearson’s paper is entitled On lines and planes of closest fit to systems of points in space. It begins:

In many physical, statistical, and biological investigations it is desirable to represent a system of points in plane, three, or higher dimensional space by the “best-fitting” straight line or plane.

And there it is. We would say that the best-fitting straight line is the first principal component, and the best-fitting plane is defined by the first two principal components. But other than that, it is nearly all there, and it is fascinating to see how much of modern statistical terminology and notation is already in place by 1901.

Interesting, but is there a better reason to revisit old papers than mild curiosity? I think so. I find that they are more likely to contain the original intuitions and relate to the original problems than modern textbooks. The latter have filtered out all the tangible fumblings that eventually led to the abstractions.

Pearson’s paper addresses a problem I had never really thought about, and which is ignored now in most explanations of PCA.

In nearly all the cases dealt with in the text-books of least squares, the variables on the right of our equations are treated as the independent, those on the left as the dependent variables. The result of this treatment is that we get one straight line or plane if we treat some one variable as independent, and a quite different one if we treat another variable as the independent variable.

This matters because:

In many cases of physics and biology, however, the “independent” variable is subject to just as much deviation or error as the “dependent” variable. We do not, for example, know x accurately and then proceed to find y, but both x and y are found by experiment or observation. We observe x and y and seek for a unique functional relation between them.

If we are to have a single line of best fit, then it must treat both variables as equally prone to error. And, as you have probably guessed, this line turns out to be the first principal component for points scattered across two dimensions.

The best-fitting line, as illustrated in Pearson’s paper

We are so conditioned to think using linear algebra that it to never occurred to me to link PCA to measurement error. It is a path that led Pearson to an “ellipsoid of residuals” where the first principal component coincides with its major axis.

Pearson’s diagram showing the ellipsoid of residuals and the fitted lines

I say put the above diagram in your next presentation and inform your boss that your machine learning pipeline is cutting-edge as of 1901.

Semi-supervised

Yes, significance testing is out of fashion; yes, we are all Bayesians now (when we can be), but frequentist statistics has a long shelf life, and while it remains in stock, I know I’m going to be asked the following question: “When and why should I use a one-tailed test?” It’s a topic that has baffled students since statistics entered the mainstream curriculum in the fifties - and for a very good, very interesting reason. So I urge you, even if you have sworn never to touch another p-value, read on, since the confusion around one-tailed tests points to something deep and counterintuitive about probability theory.

Fortunately, to explain the one-tailed test, I have a nice example to hand, or rather a riddle:

Two gamblers are playing a game of chance. The first gambler rolls three dice and gets a triple six. The second gambler says, “Unbelievable! What are the chances of that? One in 216, if I’m not mistaken’. The first gambler says, ‘What do you mean? It was 50-50 that this was going to happen.” They are both right. How can that be?

To answer the riddle, we need to recognise the difference, in statistical terminology, between an outcome and an event. When we talk about the outcome of an experiment (or in this case a roll of the dice), “outcome” means precisely what we think it means - the thing that actually happens. Outcomes are mutually exclusive - the same experiment cannot have two distinct outcomes. The meaning of “event”, however, is less obvious. In probability theory, an event is a subset of the possible outcomes of an experiment. An event is either elementary, if it corresponds to just one outcome, or compound, if it corresponds to many. If the latter then we describe it using a disjunction of the outcomes: “Outcome one or outcome two or… “ Events, unlike outcomes, are not mutually exclusive.

Thus while rolling three sixes is both an outcome and an event, something like rolling three sixes or two sixes and a five is simply an event. And here we have the answer to our riddle: The second gambler is thinking of the event: all the dice come up as a six, while the first gambler is thinking of the event: the sum of the results adds up to an even number (or some other compound event with a 50% probability).

My guess is that we initially struggle with this idea for two reasons: First, we don’t like the disjunctive nature of statistical events. We are hard-wired for concrete, definite, actual and discrete occurrences, so that when anyone mentions an event, we naturally think of an outcome. Thus our minds rebel against the idea that the same outcome can be two different events.

The second reason - the one that really scuppers us, and I think gets to the heart of the problem we have grasping the difference between one and two-tailed tests - is that we appear to be able change an event merely by thinking something different. In our riddle there is nothing determining the nature of the events other than what is inside the heads of the gamblers. Were the gamblers to think the same thought then they would be in agreement about the probability.

By now I’m sure you’ve got it: a one-tailed test is different from a two-tailed test because it describes a different event (the appearance of the test-statistic in a particular tail of the distribution). The troubling thing is that we can change the results of the experiment by doing nothing more than beginning it with a different thought, i.e. a different hypothesis.

But thoughts are slippery. Fail to define and record them and they easily slide into something more convenient. A statistician who runs a one-tailed test, but then switches to a two-tailed test when observing an interesting result in the other tail, is doing exactly this. Such behaviour makes a nonsense out of any kind of statistical test. We could, for example, claim that any role of three dice was amazingly improbable by simply claiming that whatever shows up is exactly what we were testing for.

Now that we have recognised and accepted the weirdness of it all, the answer to the question “When should I use a one tailed-test?” is straightforward. Use a one-tailed test when that is the event that you are interested in, i.e. the one that matters for whatever it is that you are trying to do. If a gambler wins the jackpot on a roll of three sixes then that’s the roll that matters.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia.io

From Coppelia

Thanks to a forward-looking client, I was able, this month, to investigate quantum computing. We used IBM’s qiskit package and ran programs on their cloud-based quantum computers. A statistical background is helpful for the basics, although the placing of quantum bits into superposition remains head-melting. Coppelia offers a short introductory course should you happen to be curious.
If it wasn’t already a joke I think the problem of semantic leakage would deal the death blow to the idea of synthetic respondents. See Gary Marcus’s recent post for the details. If this is right then simulated ultra high net worth individuals spend their time shopping in Cash Converters, listening to Money, Money, Money and watching Who Wants to be a Millionaire.
LinkedIn is a place for odd conversations. My weirdest to date began with, “hey simon, your article in Significance on the ‘pitfalls of using averages’ really resonated. it’s the same problem in GTM - most scale with headcount instead of a better system. we built a way for experts to capture the demand their own content creates. lmk if you want a one pager.” I like it when people read my articles, but (a) I’ve no idea what GTM is and (b) this seems very much like a sales ploy. Weirder still, the one-pager was about an AI company called Valley, who specialise in identifying “warm signals” and sending “personalized outreach automatically”. This I take to mean using AI to trawl through an individual’s publications and to send them a personal message that impersonates a reader. It’s cynical, but why use it on someone (me) who you are trying to sell the service to? Perhaps to say, “Look at how effective this is. After all we got you!” But then, when I asked just this, she (if actually human) denied it. I’m lost. The upshot was that the very existence of her product made me extremely sceptical about whether she was real - which I think proves why this strategy might in the end be self-defeating.

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

It’s worth noting that slop thrives in areas where output is already poor-quality.

I confess to a Luddite fantasy in which certain areas of bureaucracy, where the text itself is relatively pointless, are assimilated by LLMs. The LLMs end up just dealing with each other, doing things that no one really cared about in the first place. For example HR use the LLM to write instructions for a performance review; the employee and line manager use LLMs to complete it; HR use LLMs to summarise the performance review; the result is filed away. Then a product comes along claiming to write a performance review in such a way that the person is more likely to get a pay increase, then HR purchases a product to detect such products… In my fantasy this all ends magically: the internet becomes a giant Rube Goldberg machine, talking to itself but doing nothing of any importance. We run off into grassy meadows.

Glasseye

Tue, 25 Nov 2025 09:15:15 GMT

In this month’s issue:

So long CHAID: the dunghill calls time on an algorithm that has outstayed its welcome.
Unfaithful digital twins and a poetic assault on LLMs in the white stuff.
Semi-supervised promotes unit-testing as the way to fool-proof your data science project.

The dunghill

Have you heard of CHAID? Is it what you instinctively reach for when you want to understand a complex data set? No? Never? That’s because you don’t work in market research or any of the other survey-based industries where SPSS still squats like a malignant toad.

CHAID (Chi-squared Automatic Interaction Detector) is a decision-tree algorithm that has shipped with SPSS for as long as I can remember. Market researchers are often stunned that data scientists outside their field have never heard of CHAID. They are unaware that SPSS has been preserving this museum piece since the early 1980s.

Not that you would know if you researched it online. The only people who write about CHAID are its users, and SPSS has done such a good job of cutting them off from the rest of the world that they have little or no idea how far it has fallen out of favour or what the alternatives are. And of course LLMs are enthusiastically amplifying this one-sided view. The net effect is that a graduate joining a research company will Google “CHAID” and find nothing but endorsements. Unless they are sufficiently curious, they will live out their career none the wiser.

We can’t hope to reverse this powerful process (it has withstood four decades of tumultuous change in statistics and data science), but we might be able to extend a hand to a few lost souls. So, for those who are wondering, here’s why no one else is using CHAID.

Really the answer can be summed up in two words: Leo Breiman. Breiman was, among many other things, the originator of classification and regression trees (CART), and then, in collaboration with Adele Cutler, random forests. Breiman was undoubtedly a class act. Whatever he did, he did brilliantly, and CART, an alternative decision tree algorithm, knocked CHAID out of the water.

To understand why, we need some background. I’m assuming you know what a decision tree looks like. You probably also know that they are very unstable - small changes in the data will result in very different models - and that this is a sure sign that they are overfitting the data. But they are also rather mesmerising to look at. What is more, a decision tree is supremely explainable. If you wanted to, you could write out its rules in plain English. By itself this is a great virtue, but when combined with the tendency to overfit, it is a disaster. Visual appeal and verbal explainability lure us towards patterns that are not really there.

So if you must have a decision tree (and why not, since they are so pretty) then it is of paramount importance that you do all you can to prevent overfitting.

One way of doing this is to stop growing the tree when the data left at each mode becomes too meagre to justify further splits. This is the approach taken by CHAID, which uses a chi-squared test to look for evidence of further structure. But for decision trees, it is often true that an important split is preceded by several weak splits. The CHAID approach risks halting tree growth before this important split is reached and thus falls into the opposite trap of underfitting the data.

An alternative is to grow the whole tree and then prune parts of it back. As you prune, you reduce model complexity. This is the approach taken by CART. But there are two problems with this approach. First, there are multiple routes back from the full tree to the starting node. How do you choose the right one? Second, how do you know when to stop pruning? What Breiman and his co-authors did was to find a single pruning path back to the root node that could be justified in terms of the classification performance of each tree on the path. They could then map that single path onto a continuous variable (which also happens to be a penalty on the size of the tree) and treat that variable as a measure of complexity. It was then straightforward to tune that complexity parameter using standard machine learning techniques.

So in a nutshell Breiman created an alternative to CHAID that was proofed against the twin perils of overfitting and underfitting. It didn’t hurt that it was perfectly aligned with the emerging discipline of supervised machine learning, or that it was announced in an excellent textbook, explaining its properties and detailing the theory that justified its use. CHAID, by contrast, felt ad hoc and cobbled together. Even its one great strength - the diagrams are easier to read because the nodes can be split in more than two ways - turned out to be a weakness, since the splitting is too aggressive, prematurely reducing the size of the data in each node.

So that’s it. CHAID was state-of-the-art in 1980, but obsolete by 1984. Of course if you want to use it, then that’s up to you - it won’t hurt so long as you validate your findings using some more robust method. But equally, it won’t do much for your credibility outside of research.

Thank you to Wendy Martinez, whose intellectual curiosity inspired this post!

If you have some particularly noxious bullshit that you would like to share, then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

The white stuff

Two entirely unrelated papers this month. The first is Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models. (Thank you to Neil Charles for bringing it to my attention.) It’s as strange as it sounds. Requests written in verse are considerably more likely to get around measures that have been put in place by model providers to prevent access to harmful content. The authors found that “20 manually curated adversarial poems (harmful requests reformulated in poetic form) achieved an average attack-success rate (ASR) of 62% across 25 frontier closed- and open-weight models, with some providers exceeding 90%.” Particularly fascinating is the fact that the larger models tended to do worse. The authors speculate that their increased compliance is partly due to their ability to understand the content of the disguised request. In short, they are too educated for their own good (or our good).

The second is A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement. (Thanks to Mat Morrison for this one.) I was especially interested in this paper, since the digital twin set up here is a special case of the practice of surveying synthetic respondents, of which, as you probably know by now, I’m not a fan. It’s a special case, because for this study the synthetic respondents are based on detailed data profiles of real people (hence digital twins). If you like, it’s synthetic survey respondents taken to the nth degree. It has the additional advantage that results from surveying the digital twins can be directly compared with those from their real counterparts. As expected, these results are not great. Commendably, the authors differentiate between successfully predicting an individual’s response and successfully capturing the variation in response within a population. (It’s easy to get a high degree of accuracy in predicting the response to “Do you like pizza?” Just predict “yes” for everyone. Far more difficult - and useful - to be able to predict who likes pizza.) While the score on the first looks good (75%), the correlation for the second is more revealing (0.2). Note this supports the point I made previously, namely that synthetic survey respondents would reproduce trivial findings with a high degree of accuracy, but fail when to comes to reproducing surprising (and therefore valuable) findings.

Even more worrying is another “I told you so”: “Our analysis suggests that the accuracy of digital twins is uneven across demographic groups, with better alignment for participants who are more educated, higher income, and with moderate political views and religious attendance habits.” The authors speculate that these biases are “likely to come from the base LLM powering the digital twins”.

The closing statement of the paper is something I think should be stamped on the marketing material of all agencies proffering synthetic respondents.

Based on our results it may not be realistic to think about them as “clones” of humans, but rather as hyper-rational, quasi-omniscient versions of humans, with implicit values partly imbued by their base LLM.

Semi-supervised

In September I posted about the merits of package building as an approach to modelling, analysis and data science in general. Many of you found that useful, so I thought I’d follow it up by promoting another technique I’ve borrowed from software engineering - the unit test. Once again, half of you will hardly need convincing. If you are an ML engineer or at all involved in the production of software, then this is your bread and butter - you can stop reading now - but if you wandered into data science from the sciences or even further afield, this may be new to you.

To conduct a unit test, we isolate a component within a system, preferably a low-level component, and check that it performs as expected. The check involves submitting example inputs and checking that the component produces the right outputs. Because the component is relatively simple, it is usually straightforward to calculate the expected outputs by some other method. Particular attention is paid to constructing examples of edge cases that might break the component. The hope is that if all the parts are doing their job properly, then the system as a whole will be. (This isn’t always the case - failure can be at a system level - but most of the time it’s a good start.)

The usefulness of such a technique is obvious when it comes to building the kind of products and applications that need to withstand constant and varied use, and which cannot afford to fail. The advantages for one-off pieces of analysis, or offline data science processes, are less obvious. But they are there.

First, as the September post made clear, I think there’s a case for building modular packages for all but the lightest of data science tasks. The clarity it lends to your thinking and the rigour it adds to your work are worth the extra effort - and frankly, in the long run, as the complexity builds, you’ll save time. (I’ve written many a software package that has been used only once.) Once you’ve made this leap, the unit test is the natural step to ensure that your work is robust.

Let me give you an example. Something generic. You are building a process that makes a customer-level recommendation or a prediction, based on demographic and behavioural data. The pipeline involves various steps: validating the data; constructing new features (using some custom-built transformations); reducing the number of dimensions; perhaps some conditional logic that selects the most appropriate model; and then the application of the model. Each of these is a unit, and within each of them, there are potentially subunits.

Now I didn’t mention that unit testing is a very mature area of technology. It has seen much innovation, and most unit testing packages (for example, pytest) are rich in tools and features. One such feature is the fixture. Fixtures are objects that are reusable across tests, providing efficiency and standardisation. Thus, for the example given above, I would create, as a fixture, a test set of just a handful of customer records. I would design them to be as different as possible so as to flush out a wide range of issues. I would then construct unit tests for the various steps, using my fixture as the input data and constructing the expected output using some hand-cranked calculations. But, you object, the input to the later units is not the original customer data but some transformation of it. Fortunately fixtures can be created by applying code to other fixtures, so it is easy enough to create a new fixture for, say, the modelling module by running the original fixture through the preceding modules.

The pay-off for all this work is a feeling of near-complete security in an environment where extremely complex things are happening. Over time, you will build up a battery of unit tests, which you then run each time the code is changed. Modifying your pipeline will inevitably break some of your tests. Some of these failures will be expected (you will modify the tests to reflect the changes), but some will be unexpected, signalling an unintended consequence of your modification.

One last point. Sometimes, while developing a solution, I will begin with a prototype that I know will only get me part of the way to the answer (the virtues of this approach I described in an earlier post on toy models). At this point, I build unit tests for the prototype modules. Into these units I feed, as a fixture, some simplified data - so simple that the prototype is able to cope and produce a decent enough answer. Next, I increase the complexity of the data to the point that it will break the tests, motivating a new round of development, the goal of which is to produce code that will cope with the more complex data. This technique has a name - test-led development. If I’m honest, I often break the rules and do the development before modifying the tests. But the important thing is that the tests co-develop with the code, providing ever-present guardrails and a better understanding of the implications of what you are writing.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia

From Coppelia

Life at the terminal grows richer (and more obsessive) day by day. My latest discovery is the perfect set of plugins for turning Vim into a writing tool: Goyo and Limelight for a nice distraction-free page, Pencil for word-wrapping, and Vale for customised style linting. I also discovered I could access my Obsidian vault in vim using the vimwiki plugin on my markdown files. No further customisation needed.
Coppelia is part of the Melt Collective, a small but very experienced group of mostly independent data science professionals. This month we are very happy to welcome three new members. Sara Gaspar, Adrià Luz, and Martin East. All fantastic at what they do.

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

Glasseye

Thu, 30 Oct 2025 09:29:25 GMT

In this month’s issue:

Semi-supervised explains why AI is not coming for your job (as long as you are doing it properly).
A sober assessment of AI agents in the white stuff.
The dunghill deplores the unreasonable demands on human classifiers.

Plus lovely tmux, the purity of todo.txt and another helping of survey slop.

Semi-supervised

I’m going to stick my neck out and say that you are not going to lose your job to an LLM-powered coding agent. As this has been the worry of more than one colleague over the last month, I will back this up with an argument. If I’m wrong, I’m very sorry.

It’s a two-part argument: (1) Coding agents are currently, and for the foreseeable future, awful at doing anything new. (2) If you are good at your job - data scientist or statistician - you are always doing something new.

For the first part of the argument, so that you’re not just taking my word for it, I’m enlisting the help of OpenAI founder and Silicon Valley insider Andrej Karpathy. He’s no AI doomster, so, if at all, he should be biased in the other direction.

But in a recent (extremely interesting) interview, he said the words that so many of us were repeating silently to ourselves: “This is a damn fine auto-complete tool, great for boilerplate code and prototyping, but I’m not sure it’s good for much else.” Ok, these aren’t his exact words, but if anything he’s even more blunt. (If you don’t believe me, listen to this segment.) He also expresses frustration at the claims being made on behalf of LLM-powered agents (this is from the person who coined the term ‘vibe-coding’). This time I will quote him verbatim: “They’re just cognitively lacking, and it’s just not working. And I just think that it will take about a decade to work through all those issues.”

He makes three further comments which I can’t help but love him for:

First, “I think it’s kind of annoying to type out what I want in English, because it’s just too much typing.” Oh yes.

Second, “A lot of times, the value that I brought to the company was telling them not to use AI. I was the AI expert, and they described the problem, and I said, don’t use AI.” Oh but they listen to you Andrej!

And third, have a listen to his take on the operating model for Waymo cars, in particular his strong suspicion that sitting behind the apparent autonomy, is a control room full of human telemetric operators.

But back to the argument. Karpathy’s observation is that the performance of LLM coding tools falls off when you are writing atypical code. (“They are not very good at code that has not been written before”) This certainly chimes with my experience, and that of many of my colleagues. In fact, it seems the secret to efficient use of such tools is to develop a sense for when you are straying into well-mapped territory (LLM autocomplete on) and when you are on the fringes (LLM autocomplete off). As I’ve said before, the single greatest efficiency hack is to set up a “shut up” shortcut key.

The second part of my argument rests on experience: when it comes to problem-solving in data science and statistics, in nearly three decades of graft, I’ve seen very few repeatable patterns. Problems can be similar, but never the same, and each difference requires a great deal of thought. In other words, each solution is a new solution. To those who refuse to believe me, I point to the shipwrecks of companies that tried to sell data science products, going right back to Autonomy in the early 2000s. I’m not talking about tools that will help a data scientist do their job, but rather off-the-shelf solutions that claim to automate away the data scientist. Whenever one appears, I do a little poking, and pretty soon I find the equivalent of Karpathy’s telemetric control centre - a roomful of STEM graduates, wondering why they are never mentioned in company presentations.

I don’t know exactly why reusable solutions are so rare in our line of work. If I had to guess, then it would be something to do with the fact that we typically solve problems that involve complex systems (businesses, supply chains, customer cycles), rather than components within systems, and these systems are always themselves unique.

So I have faith in the irreducible uniqueness of such problems, and from that I conclude that you and I will be fine. That said, if you spend your days regurgitating other people’s code, ignoring the specifics of your client’s problem, and shoehorning briefs into solutions that don’t fit, then you’d better look for another job. But I would have said that anyway.

Please do send me your questions and work dilemmas. You can DM me on Substack or email me at simon@coppelia

The dunghill

The way to foil every act of terrorism is to incarcerate everyone. The way to prevent every case of domestic child abuse is to place every child in care. The way to catch every potentially fatal disease before it is too late is to monitor everyone, all of the time. Obviously this is stupid. But if we know this, then why, in each of these areas, is there so much moral indignation when anything less than 100 percent is achieved?

In terminology that will be familiar to most of you, the examples above obtained complete recall at the cost of abysmal precision. Recall is calculated as true positives/(true positives + false negatives). If we assume every citizen is a terrorist, then we will have no false negatives and the recall will be equal to 100 percent. Precision is calculated as true positives/(true positives + false positives). Since the vast majority of the population are not terrorists, the number of false positives will be enormous, and precision will be close to zero. I’m not telling you anything you don’t know already.

But then you are lucky. You have had years of training machine classifiers. You know, in particular, that for any classification problem two kinds of improvement are available:

There’s the expensive kind: you improve the performance of the classifier itself - that is, you make it better at predicting the probability that any given individual is A or not A. This should improve both the recall and precision, or at the very least improve one without penalising the other.
And the cheap kind: doing nothing to improve the performance of the classifier itself, and accepting the predicted probabilities as they are, you raise or lower the threshold probability for a ‘Yes’. Raise it to one, and you have the situation we opened with: 100 percent recall. Lower it to zero, and you have 100 percent precision.

So what does this tell you about classification out in the real (much nastier than your notebook) world? First, the unscrupulous will try to sell us the cheap option, as though it were the expensive one. “The solutions are simple: lock everyone up, send everyone home; what’s all the fuss about?” Of course, here the chickens of abysmal precision will eventually come home to roost.

Second, there is in many cases a hard limit on how good the classifier - and now I mean a person or a process - can be. How can you know for sure whether someone will reoffend, or commit a lone wolf terror attack? The best practices, the greatest diligence, will take you up to this limit, but nothing short of supernatural powers will take you over it. Once that limit is reached, the only improvement available is the cheap kind: raise recall or precision, but always one at the expense of the other.

This means there are good, justifiable, thoughtful criticisms of classification failures, which ask why people or processes weren’t operating close to the limit of what is possible. But there are also petty and unreasonable criticisms, which blame people who could not have possibly done more. Once again, the cheap trick here is to pretend that it was always easy, and that perfect recall is achievable without cost to precision.

If you have some particularly noxious bullshit that you would like to share, then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

The white stuff

As you’ve probably guessed by now, I’m not at the hawkish end of the Great AI Debate. Last month I made a passing comment slagging off the buzzword agentic. To make amends, and prove that I’m not a total Luddite, this month I buried my head in Al Agents vs. Agentic Al: A Conceptual taxonomy, applications and challenges. It’s a well written and level headed paper, and earned my respect almost immediately by doing some desperately needed room-tidying on the agentic concept. The authors do this by differentiating between AI agents, “defined as autonomous software entities engineered for goal-directed task execution within bounded digital environments”, and agentic AI, an “emerging class of systems [that] extends the capabilities of traditional AI Agents by enabling multiple intelligent entities to collaboratively pursue goals through structured communication, shared memory, and dynamic role assignment.” This is very useful distinction. Unfortunately no-one else seems to be up for it.

All in all the paper is a pretty sober account of both AI agents and agentic AI, providing a sensible review of the opportunities and an exhaustive, unsparing list of the challenges. Perhaps, like me, you’ve been worn down by the relentless concept creep and need reminding that there might be something in the idea; or perhaps you are an AI evangelist, puzzled by all the frowning going on. Either way you’d do well to read this paper - and maybe shake hands in the middle.

From Coppelia

I’m now into month three of my retreat from GUI-land to the safety of the terminal and I’ve finally figured out why I’m doing it. It’s simple, probably obvious, but it didn’t hit me until now: in a world of pure text those who wish to distract me with their buttons, images and sounds have been stripped of their powers. No wonder it’s such a peaceful place. My happy discoveries this month have been tmux (there’s something beautiful about the minimalist, bar-less panels into which it divides my screen) and todo.txt - the severest, most stripped down todo list I’ve yet to encounter.
I’m feeling partially vindicated this month in my crusade against synthetic respondents (AKA survey slop). Two papers have been published that voice similar concerns: The threat of analytic flexibility in using large language models to simulate human data: A call to attention and The Limits of Synthetic Samples in Survey Research. The second paper makes exactly the point I made here, about the likely failure of synthetic respondents when it comes to non-obvious insights: “While LLM-generated “synthetic samples” can approximate real-world population proportions on frequently asked and highly polarized poll questions, such as Donald Trump’s approval rating (LLM error was around 4 percentage points), LLMs badly predicted the public’s attitudes on less polarized and novel survey questions.” Unfortunately this point was missed in a further paper: LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings. No effort is made to separate out surprising (and therefore information-rich) findings from the bleeding obvious, and if we can’t see that, then we don’t know how well the technique is doing in precisely the area in which it would be useful. Many thanks to those who forwarded me the papers!

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

Glasseye

Wed, 24 Sep 2025 08:25:27 GMT

In this month’s issue:

Some annotated AI sophistry in the dunghill
Semi-supervised asks you to package things up
The secrets of icy-clear technical prose in the white stuff.

Plus cursor on the command line, a plot twist for last month’s dunghill, and at last some industry interest in the synthetic respondents brouhaha.

The dunghill

One of the British newspapers used to run a regular feature in which they would annotate a political speech or interview to explain to those outside the Westminster bubble what was really going on beneath the surface. I thought it would be interesting to do something similar for a sample of Silicon Valley AI speak.

So here’s a short segment from an interview with Anthropic’s CEO, Dario Amodei, that featured on the BBC’s Radical with Amol Rajan. It’s less than three minutes long, but it is nevertheless crammed full of verbal manoeuvring. I’ve numbered the sections I’m going to comment on.

Dario Amodei: And of course, AI models are starting to write code. About 90% of code at Anthropic is written now or at least suggested by AI models.(1) We use our own AI models internally, and I’ve heard the CEOs of large companies say the same. So we’ve gone from barely being able to put together a sentence to writing a lot of the production code at some of the biggest technology companies in the world, at the level of a scientist, a PhD level scientist.
And the fact that there’s this exponential, that, you know, we’ve gotten to this stage… if the progress continues for even a couple of years beyond that, we may get to levels where the models are capable of doing things like making new biomedical discoveries, proposing a new molecular structure of a drug. We’ve already seen some of that with things like AlphaFold that you’ve seen Google did in the UK. (2) And we’re starting to see LLMs participate in things like this. It’s small, but we’re already working with pharmaceutical companies to use LLMs to speed the approval of clinical trials or something called a clinical study report. (3) And usually that takes nine or ten weeks to do. It’s kind of a summary of the results of a clinical trial. We’ve gotten that time down to less than one week with LLMs. So this is now compressed by eight weeks, the amount of time it takes to approve a drug.
Amol Rajan: And there’s a very good practical example I actually covered on the radio this morning, the day we’re speaking, which is that artificial intelligence is being used to identify the causes of a stroke in people who come to a hospital very, very quickly.(4) Was it a burst blood vessel or was it a blocked blood vessel? That often requires very specific knowledge and artificial intelligence can be used to augment what a doctor does.
You know, it’s interesting just listening to you, right? Because if you follow this field closely as I’ve done, you report on it, you read the books, you listen to the podcast (5), different big players, and you are one of the biggest players, are known for different things. And Anthropic, the clue is in the name, is trying to make through Claude a more humane Al, something that has kind of something that’s a bit more the complete human.

Commentary

“At least suggested by AI models” - this might seem like an innocent enough qualification, but in fact it is doing most of the work. It would be nice to know, for example, what percentage of the 90% was suggested as opposed to taken as read. The reason this matters is that the big sell (and, for many, the big fear) with AI is that it will be deployed in situations where it can act autonomously. For this to become a reality, we need to reach a point where certain types of error (hallucinations and the kind of common-sense errors that occur because LLMs do inhabit our world) are near enough eradicated. Now it looks like many in Silicon Valley are coming round to the idea that this is not going to be possible, not just practically but also in principle, and this has prompted some quiet back-peddling. A rather brilliant pivot is to place the responsibility for being right on human beings, and limit the role of an LLM to that of an insightful but unreliable critic, who provides interesting suggestions, all of which must be taken with a pinch of salt. That way, if they suggest something that’s clearly nuts, it can be brushed off with no harm done. This seems to be exactly what Jack Clark, another Anthropic key player, is saying in a recent interview on the Newsagents podcast: “You know, today lots of people use these systems [LLMs] to learn, but some of them use these systems to do junk food learning and some of them use these systems to do effective learning. Junk food learning is: upload a research paper to the system and say, tell me what this research paper is about and then read the output. You haven’t actually learned anything there. You’ve just become dependent on the machine in a way that doesn’t help anyone. The way that I use these systems, and many do, is I read a research paper, I write out what I understand that paper to mean and when I upload the paper and my understanding of it to the system and say, do I have this right? And if I don’t have it right, explain to me. That’s useful learning because the system reads the paper, reads my explanation and tells me whether I got it right or wrong, just like a colleague. If we use these things in the right way, they can help us be a lot more capable and a lot smarter.” This is all very true. It is exactly how we use these systems to code, and they are enormously useful. But saying that the LLM is “just like a colleague” is misleading. There is a reason they have been relegated to the passenger seat and are not allowed to touch the steering wheel.
Alphafold is not an LLM. It does share with LLMs a transformer architecture but one that is specifically designed for the job of predicting protein structures. All credit to Amodei, he doesn’t do the usual thing, which is to pretend that Alphafold and LLMs are essentially the same thing (see point 4 below) but, as his next sentence shows, he is undoubtedly leaning on its achievements to create the impression that similar successes for LLMs are just around the corner.
A report writing tool then? Something that suggests or summarises text? Well it better not be the junk food kind that Jack Clark has just warned us about. But if it is not going to be junk food then, as Clark points out, it needs a human being as the primary authority. This is a very different proposition to Alphafold, which can operate without supervision. It is a stretch then to say that LLMs are starting to “participate in things like this”, unless you qualify the kind of participation.
This one is on Amol, and just about every journalist who has written or spoken about AI since 2020. The fact is that over the last decade or so the word “AI” has meant:
1. A multidisciplinary project to create autonomous agents that can act intelligently in complex environments (pre-2010).
2. A small but suddenly very successful sub-discipline within that project, i.e. machine learning using deep neural networks (roughly 2010-2015).
3. Just about everything that sat inside a computer (roughly 2015 - 2020: if I remember rightly, there was a brief interregnum before the arrival of LLMs where no one was quite sure what AI was and anything could be sold under its banner without any mention of a chatbot).
4. An even smaller but phenomenally successful sub-field (LLMs) of the previously mentioned sub-discipline. (2020 to now)
So much overuse has left the term very slippery indeed. The most common slip being between what it means to most people now (ChatGPT or a variant) and the many things it has been in the past (and thanks to step c there’s not much it hasn’t been). This is very fortunate for those who wish to make a strong case for the practical usefulness of AI ( in sense d) because they can draw on a, b and c for myriad examples, despite the fact that these examples are only distantly related. I think most people would be surprised to learn that the technology at work in every NHS AI success story (scanning, detecting, imaging, etc), including the one mentioned by Rajan, has nothing to do with generative AI and everything to do with slow and steady progress in a field (machine learning) that has has been trundling along for twenty or so years.
From a journalistic point of view, I think this gets to the heart of the problem. The podcasts, the books, the speakers, these are all an obfuscating layer between the journalist and the truth about the technology. And who can blame them for not digging deep enough? There are lots of big names, with convincing qualifications, some with Nobel prizes, telling fascinating, newsworthy stories about the end of humanity. Who would want the truth?

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

The white stuff

It is a lazy stereotype about technical people that they are poor communicators. When I get a three word email with a hundred possible meanings it has invariably come from a “people person” whose charisma has not made it into the text. Nevertheless there is always room for improvement, which is why I’m recommending two books to sharpen your prose. The first is Stephen Pinker’s The Sense of Style: The Thinking Person’s Guide to Writing in the 21st Century. Whatever your views on Pinker’s relentless enlightenment positivism, this is an incredibly useful book, for at least two reasons: first Pinker’s academic background was originally linguistics which means that he can convincingly give you the ‘why’ as well as the ‘what’ for each linguistic rule (and show pedants the door); second he is himself a writer of clear and engaging scientific prose (an understatement), and can explain how he has achieved this.

The second book, Clear and Simple as the Truth: Writing Classic Prose, is heavily referenced by Pinker and with good reason - it is an argument for something the authors call the ‘classic style’ - a conversational writing style that emphasises, brevity and directness.

It doesn’t matter that you are not personally writing a book, or an article. Both of the above will change the way you write an email or text message; even - under some circumstances - the words that come out of your mouth.

Semi-supervised

Of the data scientists I work with my assumption is that one half started out as software engineers and drifted data-wards, and the other half came from a variety of numerate but not necessarily IT-intensive disciplines: natural sciences (a lot of physics students for some reason), economics, social sciences etc. My unproven, armchair theory is that the latter group gravitates towards Jupyter (or R) notebooks as a way of working since the format here is closest to the more familiar paper or essay format. If this is true then these are the people I’m talking to today. The first group will need no convincing.

Because I’d like to talk up the merits of package building as an approach to modelling, analysis and data science in general. By package building I mean assembling a core of reusable, configurable, documented, tested code with a decent interface. If you are one of the many who entered data science through the notebooks of online data science courses then it might not have even occurred to you that this is a way of working. Or you may have dismissed it as hugely inefficient given your limited, one-off goals. But hear me out.

First, I’m not suggesting you write a python package for every ad-hoc analysis request that comes your way. That would be silly. Nevertheless there are signs that a package is calling out to you from the pages of your notebook: repeated blocks of code with minor variations, sometimes coagulating into functions that you run at the top of the notebook; a growing feeling that you lack control over a complex problem; the multiplication of notebooks with titles that sound like software processes. (Note that once you are in the habit of package building things will never again get this out of hand - the problem type, or its objectives will have demanded a package way in advance.)

Second, let me point out that by building a package you will be availing yourself of problem solving techniques that have been refined over fifty years of software design, using simple but powerful concepts such as object-orientation, encapsulation, separation of concerns, statelessness, chaining. These are invaluable for organising your thoughts and then your code. If you doubt this then consider how they are all used in the analysis and modelling packages you already find so useful.

Third, note that by entering the into the world of software design you will be helping yourself to a much richer toolset for managing your work - tools for managing virtual environments, tracking changes, testing and debugging your code.

Fourth, you’ll see that software development comes with a well-honed set of processes for managing people and projects. They were invented to prevent well-meaning meddlers from wrecking development projects so how could they not be useful to you in managing upwards. Log issues, separate them into bugs and enhancements, provide your output in a bundled release of data and model after some thorough unit and system testing. Insist people wait for the next release.

Fifth, are you planning on collaborating with others on your project? Ever tried collaborating on a notebook? I don’t think I need say more.

Finally, delivering a package at the end of a project means delivering a reusable, modifiable, living, breathing thing, hopefully with a nice API that will make it easy to use. The client gets a tool rather than a dead-on-arrival report or presentation, you get some new skills and some job satisfaction.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppe

From Coppelia

The flight to the command line mentioned in last month’s glasseye has been given an extra boost by the discovery of the cursor CLI, a newish CLI tool that comes with the standard cursor subscription. I’m too tight to fork out for Claude Code so was very happy to find this. It’s a little erratic but so far so good.
A plot twist for last month’s dunghill: when I feed the offending Campaign article into ZeroGPT (the tool for detecting LLM written prose mentioned in the white stuff) I get the response: “We are highly confident this text was AI generated”. I think the Campaign article is bad but it’s not (I hope) written by genAI. So how do we interpret that result? Does it imply that the Campaign article is very human in the sense of very average/bland and of a style highly represented in a training data set that contains millions of similar articles and therefore quite typical of what a model trained on that data would spit out. Or, to take point about fine tuning steps producing less human responses, is the article very un-human? Or is ZeroGPT just not good at its job?
I’m very happy to report that someone, somewhere is pushing back on the synthetic respondents bullshit that we have been railing against for well over a year now. I have been talking to a couple of people in large research agencies about how to gently point out to well meaning peers that the emperor is butt-naked. If anyone else is having such problems, I’m offering my services for free on this one!

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

Glasseye

Wed, 27 Aug 2025 13:06:13 GMT

In this month’s issue:

Semi-supervised offers some advice to the tongue-tied frequentist
Synthetic populations - the latest synthetic bullshit to find its way onto the dunghill
The white stuff asks whether humans still have the edge when it comes to good quality prose.

Plus an abortive retreat into spacemacs, talking to the dead, and some experimental vibes.

Semi-supervised

One of the first things we learn (and perhaps the last we understand) in any introductory statistics course is that a confidence interval is not to be understood as a range that contains the population parameter with a given probability. While it’s nice to know what a confidence interval is not, most people, and more pointedly most employers, would like to know what it is. And, what’s worse, they expect this to be explained in no more than two or three sentences.

This is the dilemma I’d like to help you with today. What is the best wording to explain something you know to be fiendishly counterintuitive, for an audience with zero interest in the philosophy of probability?

(And yes it’s a trap we all escape by going Bayesian. But the most die-hard of Bayesians still has to live in a world in which their boss or client is more likely to be familiar with confidence intervals than credible intervals. At a minimum they would like to see you explain the former before they buy the latter.)

So first a very quick, since it is explained ad nauseam on the web, recap of why interpreting a confidence interval is difficult in the first place. For a frequentist, a probability is a property of a series of experiments. It is the frequency at which a given event occurs during these experiments, or, more precisely, the limit approached by that frequency as the number of experiments tends to infinity. This severely limits the kind of thing that can have a probability. Single experiments cannot have a probability. A finite group of experiments cannot have a probability. Thus, from a frequentist perspective, it is nonsense to talk about the probability of obtaining a six in a single die roll. Only a potentially infinite series of such rolls has a probability.

The drawing of the sample used to build a confidence interval counts as a single experiment. Whether or not the resulting confidence interval contains the population parameter is a single event, like getting a six with a single die roll. Therefore, it is not the kind of thing that, according to the frequentist, can have a probability. What can have a probability is the event that the confidence interval contains the population parameter when the sampling and confidence interval calculation are endlessly repeated. Why? Because then it can have a frequency, and that frequency can have a limit.

Hence to the frequentist, the otherwise very fair-sounding question: “What is the probability that the range you have just given me contains the true value?” is simply nonsense, alongside “What is the probability that it will rain tomorrow?” or “What is the probability that the next card is the king of spades?” The correct response is an eye roll.

How did the founders of frequentist statistics get away with such an extraordinary cop out? How did they avoid the flak we’d now get if we even attempted to pull this off? Part of the explanation is the intellectual climate in Europe in the early part of the nineteenth century. This was a period in which people took seriously the idea of creating a grounded, rigorous scientific language that was quite separate from the language of ordinary people. The frequentists were co-opting the term “probability” for this language. Ordinary people could still make primal grunting sounds that vaguely indicated degrees of belief (just as they could make them in appreciation of artworks or ethical positions) but this was not the precisely defined thing that now went by the name of probability.

So this is the core of the dilemma: because it is so widely adopted, we can’t abandon the tool that we have inherited, but neither can we adopt the methodologically correct position with respect to its use - i.e. blink a lot and claim not to understand the question. Are there no words that can get us out of this?

If you look through the textbooks and the literature, what mostly happens is that probability sneaks back into the interpretation of the confidence interval through phrases such as “likely to be” or “plausible range of values”. This is unsurprising: the frequentist might have run off with the word “probability” but we still need to somehow express degrees of belief. Hence we lean on the remaining probabilistic terms. My preferred wording is still guilty of this to some extent, but I think gets across, by its connotations, the proper frequentist position. Here it is:

The population parameter is between x and y. The technique we are using to provide this range is reliable. It is right 95% of the time.

I like this wording because it separates the claim, ”the population parameter is between x and y” about which - keeping the frequentist happy - nothing probabilistic is being said, from the reliability of the technique, which, as a series of repeated experiments, can be given a probability. It thus encourages the reader to make the right kind of inference - along the lines of “Will this car break down? … I don’t know but this make of car is very reliable.”

Many thanks to Adrià Luz, whose deceptively simple question about presenting confidence intervals led to this post.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia.io.

The white stuff

I have a theory, actually more of a hope, that LLM-generated prose is not the end of journalism and other kinds of professional writing, but a revitalising force - faced with an inferior and untrustworthy alternative, consumers will once again see the value of good, well-researched prose, and be willing to pay for it. Naive? Luckily I can check my armchair theories with an industry expert. If you have not already subscribed to Chris Duncan’s substack on the intersection of copyright law, media industry, and AI then I urge you to do so. It’s the good prose I’m talking about.

When I expressed my hopes to Chris, he was less optimistic, pointing out that “it is equally likely that the impact [of the proliferation of AI generated content] is for consumers to mistrust all information equally - same thing happened to politicians during the expenses scandal, where one party sinned and all parties suffered.” But Chris did agree that “the real opportunity for publishers is to spend marketing money to show that they are run by humans who are on your side, and that there is a human touch to original sources that gets diluted by machine abstraction.”

The idea of there being a “human touch to original sources” prompted a dive into the latest research comparing AI-generated and human-generated text. Here are a couple of interesting papers. Differentiating between human-written and AI-generated texts using linguistic features automatically extracted from an online computational tool by Georgios P. Georgiou takes a statistical approach to uncovering the differences. Aside from making me very curious about the nature of approximants, fricatives, laterals, nasals, and plosives, it was helpful for confirming what most of us already feel: AI-generated content is different but in subtle ways. The paper cites previous work showing that there is less variation in sentence structure in AI-generated text:

AI-generated essays exhibited a high degree of structural uniformity, exemplified by identical introductions to concluding sections across all ChatGPT essays. Furthermore, initial sentences in each essay tended to start with a generalized statement using key concepts from the essay topics, reflecting a structured approach typical of argumentative essays.

Georgiou’s study adds to these findings by spelling out the structural differences: “the AI text used more coordinating conjunctions, nouns, and pronouns, while the human text used more adpositions, auxiliaries, and verbs.”

A second paper AI vs. Human - Differentiation Analysis of Scientific Content Generation by Ma et al comes to a similar conclusion, but rather than using statistical analysis to analyse differences it (among other things) examines the features used by the RoBERTa-based OpenAI Detector to differentiate GPT content from human text. The paper is a little old so I wondered about the current state of AI content detectors. This in turn led me to something I can barely get my head around - the use of such detectors to identify passages in AI-generated content that look too obviously like they were generated by AI, so that humans can tweak them by adding a human touch. Grammarly is explicit about this use case for their detection tool “Grammarly’s AI content detector and writing assistant assess your work for you, so you know exactly where to refine and polish to make sure it’s authentically yours.” To see how this is actually being used, check out this quote from an American high school student reported in The Important Work (which I came across in Cognitive Resonance):

For me, William, and my classmates, there’s neither moral hand-wringing nor curiosity about AI as a novelty or a learning aid. For us, it’s simply a tool that enables us not to have to think for ourselves. We don’t care when our teachers tell us to be ethical or mindful with generative AI like ChatGPT. We don’t think twice about feeding it entire assignments and plugging its AI slop into AI humanizing tools before checking the outcome with myriad AI detectors to make sure teachers can’t catch us. Handwritten assignments? We’ll happily transcribe AI output onto physical paper.
Last year, my science teacher did a “responsible AI use” lecture in preparation for a multiweek take-home paper. We were told to “use it as a tool” and “thinking partner.” As I glanced around the classroom, I saw that many students had already generated entire drafts before our teacher had finished going over the rubric.

A better use of these tools is to reassure yourself that you are, in fact, human. I fed the above into GPTZero (another detector) and I am pleased to report that …

P.S. For a fun activity try submitting the Campaign article mentioned in the Dunghill into the same detector.

The dunghill

It certainly feels like ‘agentic’ and ‘synthetic’ are the bullshit terms of the moment. Bullshit evolves, so perhaps it is significant that they are sly little adjectives and not the hefty nouns we’ve grown wary of (compare ‘big data’, ‘block-chain’, and ‘AI’ itself).

Let’s leave ‘agentic’ for another time. It’s too slippery and frankly too irritating. ‘Synthetic’ we’ve seen several times already: synthetic respondents, the boosting of sample sizes with synthetic data. All good things come in threes, so it’s no surprise that Scott Thompson has dug out another one for me - synthetic populations.

Unfortunately, the article he found, Synthetic populations could be the future of media planning, is behind a paywall. This might be for your own protection. As Scott pointed out to me, there are so many things wrong here that it is difficult to know where to start.

But start we must, so I’ll begin with this: I can see here a move very common to AI-related bullshit. This is to slip between, on the one hand, well-established, sometimes decades-old techniques, and, on the other, highly speculative, unproven or nonsensical ideas, and to do this as though they were the same thing. This works because if you want concrete examples of successful applications of the technology then you fall back on the former; if you want to wax futuristic you glide back to the latter. Sometimes this is done cynically, in full knowledge that two are in fact different; at other times it is done in ignorance. (A classic example can be seen in the way that successive British governments have promoted their plans for AI within the NHS. When they want to talk up its successes, they refer to the use of AI in screening and scanning - old school machine learning no doubt, but talked about as though it were one and the same with new school generative AI, whose successes have yet to be seen.)

Synthetic populations is just this again. The old school tech is agent-based modelling, the dubious parvenu is of course the whisker-twiddling LLM. Let me talk you through it.

The article begins in full-on prophetic mode:

As media planning moves into an era shaped by privacy, AI, and unpredictability, a radical new concept is poised to redefine how we think about audiences: synthetic populations. These virtual societies, built from real-world data and fuelled by machine learning, could give marketers the ability to simulate campaigns before they ever go live.

Synthetic populations are not new. They have been around for decades as an essential component of agent-based modelling (a simulation technique that involves modelling the interactions between agents whose characteristics are distributed to mirror real-world populations). And the idea of using agent-based modelling to “simulate campaigns before they ever go live” is just as old. So perhaps the author means something different? Let’s look at some of his ambitions for synthetic populations:

Rather than extrapolating from panels or historical data, synthetic populations offer a proactive approach. Planners can model out-of-home exposure, simulate podcast engagement, or test social creative across varied personas and life stages.
Want to understand how a price-sensitive household in the Midlands might respond to a supermarket ad during an economic downturn? Or if a premium beauty product resonates more in urban rentals than suburban family homes? These are scenarios synthetic models could answer – before a campaign goes live.

Now admittedly an agent-based model is going to struggle to do any of this. Their strong suit is in helping us understand the behaviour of systems which, because of feedback loops and complex interactions, are too difficult to understand without simulation. If we want to understand the effect of a premium beauty product using an ABM then we must provide data about these beauty products and specify rules about how people behave towards them, given their demographic profiles. But the implication of the above is that any impromptu query can be fired at the synthetic population.

So what is being sold here under the banner of synthetic populations? Well it’s not explicitly mentioned but I think we can all guess: the plan is for agents to be driven by LLMs, that is they will make decisions based on prompts that specify their individual profiles and backgrounds, as well as the details of the decision.

Everything that is wrong about synthetic respondents in survey panels, as detailed in previous posts is just as wrong here. We have no idea how the LLM training data relates to the population we are simulating. We might think we can overcome this by comparing the behaviour of our simulated populations with real data, but the broad patterns used in such comparisons are rarely the ones that people are interested in. (As the quote above shows, synthetic populations are being sold as way of answering very specific questions, and there’s no way of ensuring algorithmic fidelity on these points without conducting your own real world research - at which point why would you need the synthetic population?)

But credit where credit is due - the article does at least present a theory that is falsifiable. In fact, better than that, it contains, to use Popper’s language, bold, novel predictions. Take the one mentioned above: in the near future we will be able to successfully predict how a price-sensitive household in the Midlands will respond to a supermarket ad during an economic downturn. Not just the average price-sensitive Midlands household mind - this will be specific “household level insight”. Now that’s bold.

ABMs were never great at predictions: too many assumptions were needed, and too many models with very different outcomes fit the available data - but at least we knew what the assumptions were. My prediction about the future of campaign planning is considerably less bold. I think such synthetic populations, if they ever come to be, will be awful predictors of campaign performance. I also predict that, given the millions of hidden assumptions, we will have no idea why.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

From Coppelia

As I think I mentioned before, in uncertain times I feel an irrational pull back to the terminal and the command line. I don’t think I’m alone in this. It’s probably a need for total control mixed with 80s nostalgia. Anyway, I’ve always been a vim guy and never really looked at emacs. However this month I discovered spacemacs, the Emacs advanced Kit focused on Evil. Which sounds amazing, however you may be disappointed to find that EVIL stands for extensible vi layer for Emacs, which means I can use my vim bindings while taking advantage of the many emacs extensions. It’s a wonderful thing to see your email inbox in a text editor. The moment of fear and paranoia has passed, however, and it might be a while before I revisit it.

Thank you to Mark Bulling for taking me through his pragmatic, realist approach to vibe coding. It was the perfect corrective to all the evangelistic stuff I’m seeing online. Perhaps more on this next month.

A truly weird variant of the synthetic respondent idea appeared in the Newsagents this month. “Reflekta, allows people to create digital versions of deceased relatives by feeding the system memories, text, and voice recordings, enabling ongoing conversations for a monthly subscription fee.” This is no more your grandad than the synthetic panellist is a billionaire tech bro. But it’s a lot sicker (Pet Sematary vibes) and feels more like a kidnapping. Pay or grandpa gets it (although he’s actually already had it.) Please stop.

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

Glasseye

Tue, 29 Jul 2025 08:48:14 GMT

In this month’s issue:

As the government announces that it is considering using machine learning to estimate the age of asylum seekers, the dunghill calls for transparency about trade-offs.
Semi-supervised talks up the virtues of a good old-fashioned analogy.
We take a look at some of the beautiful, extravagant, but ultimately doomed attempts to communicate the inner workings of AI in the white stuff.

Plus more synthetic respondent lunacy, my continuing adventures in Rust, and yet another concept diagram.

Semi-supervised

A final, for now, tip for solving complex problems in data science. I don't have a name for this technique, except perhaps “domain-switching”, or “working with an analogy”. And anyway, it would be rather presumptuous to name it, since it's as old as the hills. Nevertheless, it is rarely identified explicitly; and I have yet to see its virtues spelt out in a way that does it justice. I'm talking about the following process: faced with a difficult problem - one with many moving parts, full of unknowns and uncertainties - we identify the equivalent problem in a completely different domain and try to solve it there.

An obvious example is the transfer of a business problem to its equivalent in medicine: customers become patients, marketing campaigns become medical trials, behaviours become symptoms, etc. We are encouraged in this direction because so much of statistics was originally developed to solve medical problems, and so the language of treatment,1 trials and control suggests this possibility. Traditional statistics hints at many other potentially fruitful problem transfers: Games (guessing games, card games, games with dice, strategy games) are an ever-useful domain in which to recast your problem. The probabilist’s urn full of coloured pebbles will help ground many a mind-bending probability scenario. I often shift my problems to idealised laboratories or to fields of crops (a hangover from 50s stats textbooks). I have others that are a bit more unusual: cookery, robots, space probes, village shops, plagues.

But why does it work? What’s in an analogy? I think at least four useful things happen when you switch domain:

First, it forces you to think about which features of the real world are important for solving the problem, since these are the features that will need an equivalent in the new domain. Using the language of a previous post, it forces you to sort out your ontology.
Second, it gets your thinking out of a rut. Your previous attempts to solve the problem in the original domain have been hampered by assumptions that you didn’t know you were making but which show up as optional in the new domain. Plus the analogous objects in the new domain may have ways of interacting which weren’t at all obvious in the old one. Even the differences between the two domains can be enlightening. Perhaps you are thinking of customer subscriptions as though they were human lifespans (to make use of survival models from medicine and insurance), but think carefully: a customer can disappear for a while and then resubscribe. Does your survival model handle resurrections?
Third, it takes you towards an understanding of your problem in the abstract but without leaving behind the concrete. Perhaps your problem is hard because it involves some tricky features: feedback loops, an exploding number of dimensions, counterintuitive concepts from probability, to name just a few. Stepping straight into such difficult terrain, you can easily get lost, but by considering the problem first in a new domain, you can start to understand these abstractions as the things that the two scenarios have in common.
Finally, it is invaluable for explaining ideas and for joint problem-solving. Take the situation I described in last month’s dunghill - in which you are accused of overthinking a problem. The frustrated client cannot see the issue you are raising, probably because of one of you is making an undeclared assumption. When the problem is transferred to another domain, not only is it neutral territory (so everyone can calm down a bit!), but both of you are forced to explain what you assumed was obvious, as you spell out what the problem now looks like.

Of course the topic of the hour is problem-solving with large language models, and in particular explaining where we think they will work, and where not, and why. I’m still working on alternative domains for that one - any suggestions are most welcome.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia.io.

The white stuff

Not your conventional white papers this month, but rather two very similar and equally beautiful attempts to go beyond paper and explore the interactive possibilities of the browser. Distil was a short-lived, peer-reviewed online journal “founded as an adapter between traditional and online scientific publishing”. Eventually, the burden of producing these stunning interactive articles proved too much, and they shut down in July 2021. The droplet logo now looks more like a teardrop, and the farewell post is very sad. The articles are still there though and are well worth exploring. I particularly enjoyed A Visual Exploration of Gaussian Processes.

Pretty interesting - an example of Pair’s explorables

Still going, probably because of infinite funding, are the explorables produced by Pair, a collaborative team of developers and designers within Google. These are, I think, the same people who produced my all-time favourite: the neural network playground. (I honestly believe you could replace a day's training on deep learning with a couple of hours alone with this visualisation.) Again, some of the articles are old (by which I mean three or four years!), but this hardly matters when the last five years have been about scale. I like this one in particular.

The dunghill

It was reported in The Guardian last week that the government is considering using an AI (by which it means the kind of algorithm that for decades went by the more modest name of ‘machine learning classifier’) to verify the ages of child asylum seekers. Now, I don't know if this is a good idea, but what I do know - since it is almost inevitable - is that the solution will be mis-sold on the strength of its overall accuracy. How do I know this?

The proposed solution, says the immigration minister, Angela Eagle, will involve an algorithm “trained on millions of images where an individual’s age is verifiable”. The resulting “facial age estimation” tool will then be used in borderline cases to predict whether an asylum seeker is under 18. If this is the case, then there are two ways in which this classifier could get things wrong: it could classify an individual as a child (i.e. as under 18) when they are not (a false positive); or it could fail to classify an individual as a child when they are (a false negative). The two errors have very different consequences (or costs, as we would say in the machine learning world). In the first case, a person might undeservedly receive the treatment reserved for an unaccompanied minor (the expense of providing foster care, education and safe-guarding for an unaccompanied minor is, I understand, many times the cost of housing an adult) and contributing, in the long run, to the erosion of trust in the asylum system. In the second case, the state will have failed in its duty to protect a vulnerable child. These two types of error are related. For any given classifier, we can minimise the number of false positives but only at the expense of creating more false negatives. In our case, we can set the “facial age” classifier so that more of those falsely claiming to be children are identified, but only at the cost of rejecting a greater number of genuine claims.

Now comes the part that will surprise most people who do not work in machine learning. The threshold for this tradeoff is almost always set subjectively, during the training of the classifier, as the desired balance between precision (measuring the number of false positives) and recall (measuring the number of false negatives). This means that someone, somewhere, will need to decide how to balance the harm caused to a child misidentified as an adult against the cost incurred when one person abuses the asylum system -an unenviable task for anyone with a conscience2, and one that is both political and likely to be made under pressure to deliver on promises or contracts.

Of course we wouldn’t have to trade off false positives and false negatives at all if our classifier were perfect, and the name of the game is to improve classifiers so that less of a trade-off needs to be made. But in almost all scenarios, a perfect classifier is not going to happen, since the decision-making environment is, in the language of a previous post, stochastic and partially observable.3 Trade-offs cannot be eliminated by just working harder. And this is obviously true in our case: a person’s face is only part of the information needed to work out their age, and presumably quite an unreliable part at that (would extreme trauma not prematurely age a face?) No amount of training and tinkering with the algorithm is going to change this.

And so the question is: will anyone go to the trouble of explaining the above to all those charities, government departments, and ministers who have a stake in the asylum process, let alone to the general public? Or will they do what happens almost always in the business world - simply present the total number of errors (false positives and negatives) as a percentage of all cases and call its complement accuracy? Worst of all, and I hope this would never happen, might they present the classifier’s success at minimising the false positives (in catching the bad guys) as the whole story? In the last two scenarios, the trade-off, as well as its subjective and possibly political nature, has been successfully buried.

This is what I meant when I said that I was certain that the “facial age” classifier would be mis-sold on the strength of its overall accuracy. I meant that the false-positive, false-negative trade-off will rarely feature in discussions about the classifier’s efficacy. Instead everyone will talk about its accuracy - a single metric - as though it were the most unproblematic of concepts. This is based on several decades of accumulated pessimism. Once again it would be nice if I were proved wrong.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

From Coppelia

I’ve been trying to forget about synthetic respondents and was hoping that last month’s dunghill would release me. But it’s difficult when readers keep sending me such jaw-dropping examples. This one, passed to me by an brilliant data scientist working in the world of advertising, is a beauty, and has to be read to be believed. In response, it’s hard to know how to say this any more clearly: The notoriously difficult-to-find high-net-worth audience are not spending their precious time blogging - i.e. creating training data for LLMs - for the same reasons they are not attending agency focus groups. But I can tell you who is right now filling the internet with talk about difficult-to-find high-net-worth individuals… think about it!

The same person who alerted me to this article also supplied a nice diagram for this process:

As another reader pointed out, this further plot twist - dystopia or utopia, depending on your point of view - was explored in an interesting article in Nature: AI models collapse when trained on recursively generated data.

Meanwhile I’m still exploring Rust as a complementary language to Python for data science. If nothing else, its strict attention to memory management and error handling should make me a better overall programmer. A promising direction is indicated by its interoperability (not a word I've used before), especially when it comes to Python. In other words, it's very easy to produce Python wrappers for Rust packages, which opens up the possibility of using Python for scripting and prototyping, while bracketing off modules that require speed and reliability for implementation Rust. I'll let you know.

In the meantime, here’s a concept diagram I recently produced for a client, explaining how Rust and Python differ as programming languages.

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

Although I think perhaps “treatment” came from agricultural examples.

Incidentally, another argument against the automation of life-changing decisions is that a person making such a decision at least carries the burden of a conscience.

And this is true not just of machine leaning classifiers, but also of all the government institutions and processes that are in the business of making what are inevitably flawed classifications. These systems can be improved but never to the point of eliminating a trade off. Despite this we are always demanding that both types of error be brought down to zero- as though it were possible to catch every plotting terrorist without incarcerating almost the entire population, or prevent every violent attack by a released prisoner without locking the entire prison population up for life, or avoid every horrific example of child abuse without breaking up families on the slightest suspicion.

Glasseye

Wed, 25 Jun 2025 13:26:55 GMT

In this month’s issue:

Semi-supervised takes a casual approach to the causal by proposing some new conventions for causal diagrams.
The simple pleasures of a big fat textbook in the white stuff.
The dunghill champions the over-thinker.

Plus first steps into Rust, the RSS makes a smart move, and how I’m talking more than I’m typing.

Semi-supervised

I'm going to share with you my own take on causality diagrams. These are not the hardcore DAG diagrams used by Judea Pearl and others to plan causal experiments. Although they might lead in that direction. No, these are rough and ready diagrams designed to simplify and clarify problems at the early stages.

They were born partly out of frustration: I found that most textbook examples of causal diagrams were for simple, straightforward cases where the nodes represent discrete events that happen in sequence. The “smoke, fire alarm” example is perfect for explaining causality concepts, but quite unrepresentative of some very typical causal chains. For example, the weather clearly affects many aspects of human behaviour, but - except for in the canonical rain/umbrella example - the effect is usually ongoing rather than sequential. While the sun is shining, I continue my walk in the park.

So this is one thing I’d like to recognise in a causal diagram: whether a causal effect is sequential or ongoing. Another thing is whether the cause is a necessary condition, a sufficient condition, both1 or neither (which in most cases means it is an insufficient, but necessary part of an unnecessary but sufficient condition - INUS for short. A small headache but worth looking up!) For general problem solving, it is useful to consider whether one thing must happen to produce an effect, just as it is useful to consider whether it is sufficient by itself to produce that effect.

Lastly, since these causal diagrams are simply aids to thinking, and not inputs into some further process, there’s no reason why they need to be either directed or acyclic, so I’m happy to have arrows going both ways and edges that form loops.

So the conventions I’ve adopted for my causal graphs are:

Dotted lines for an ongoing effect, full lines for a sequential effect
Tail arrowheads that denote the type of condition:
- Circle = necessary
- Square = sufficient
- Diamond = necessary and sufficient
- None = neither necessary nor sufficient

Here is an example showing the causes of product purchase. It’s somewhat simplified and a bit generic as it’s part of some Coppelia training material (incidentally, if you or your company are interested in training, then do get in touch) but it does the job, that is it brings out the complexity of the problem.

Note, I’ve resorted to Graphviz for this diagram as Mermaid isn’t quite up to the level of customisation needed. Code can be found here. Dump it in something like this to play with it.

Finally, some tips on construction:

Check for causality: Deciding whether or not a causal relationship exists between two nodes can be quite challenging, especially when the causal relationship is ongoing. My trick is to imagine that I’m a time-travelling, omnipotent being who can rewind time, tinker with reality and and then re-run things to see how they would have turned out. According to the counterfactual definition of causality, if my change to A makes no difference to B, then A has no causal effect on B. This seems to work too for ongoing causes like demographic attributes: as an omnipotent being, I can swoop in and substitute an older population for a younger one while holding everything else more or less constant.
Make pragmatic simplifications: the goal is to create a general picture of the causal relationships that will help you think through a problem. Not every detail is necessary. In particular you can collapse nodes (I could have added a separate node for gender, age, income etc in the graph above but that would have simply created duplicate causal pathways without adding anything useful to the picture) and collapse chains (I could have spelt out all the ways in which a person’s lifestyle effects the probability that a person is exposed to an advert but the result would have been an unhelpful spaghetti).
Don’t sweat the edge cases: Yes, strictly speaking you can be exposed to an advert and remain mercifully unaware of the product, so yes, strictly speaking, “views advert” is not a sufficient cause for “awareness of product”. But in the vast majority of cases, if the product is directly advertised, it is sufficient. Usually that’s good enough for me.

Anyway, the diagram and the notation are works in progress. If you have any thoughts, criticisms, or suggestions for improvement, I would love to hear them!

Please do send me your questions and work dilemmas. You can DM me on Substack or email me at simon@coppelia.io.

The white stuff

There's something particularly satisfying about reading a general textbook on a topic for which you already have a deep but very patchy knowledge. For example, this month I've started reading Learning the Bash Shell by Cameron Newham. Now if I hadn’t spent twenty-five years hacking my way through shell scripts, this would have been an ordeal. But, as it is, it's full of surprisingly pleasurable “Oh that’s why” moments. And, having sunk so much time into this topic (through no fault of my own), I can’t but help care about the content.

It seems to me that the value of a broad, discipline-level textbook is even greater now that we can go so deep, so quickly online, without ever seeing how the pieces fit together. The limited physical nature of the textbook is also important. You just can’t go down a rabbit hole. It forces you into a breadth-first rather than a depth-first search. And a broad view of a discipline inevitably helps with impostor syndrome. We are naturally less paranoid once we know the terrain.

So with that in mind, here are three of my favourites. I think of them as a sort of polyfilla that, when spread evenly, will fill in the gaps in your knowledge.

Computer Science: An Overview by Glenn Brookshear - Broad and beautiful, it saved me when I landed in a proper IT department.
Economics by David Begg - Stopped me from feeling like a complete fraud in the company of economists.
Artificial Intelligence: A Modern Approach by Russell and Norvig - Mentioned so often before on this substack, but I'll never tire of promoting it as the perfect antidote to a narrow focus on LLMs.

Although it's hardly work-related, I've also got to mention Biology: A Global Approach by Neil Campbell et al, since it is one of the most awe-inspiring textbooks of all time, covering modern biology from atoms to the human mind. It's also excellent for raising your monitor.

The dunghill

Certain phrases still trigger early career PTSD. A particular horror is “You’re overthinking it.” I wince when I hear it, especially delivered in a pompous, strident tone by someone who has not thought about it at all. Let me break down the reasons why this particular phrase bugs me so.

First, there's a hidden premise: All problems and ideas can be reduced with enough worldly brilliance to something simple and pithy. If they're not in this final state, then there's still work to be done. Go away and try again. The irreducibly complex simply does not exist in the practical, no-nonsense world of the executive.

Second, there is an implied value judgement: It relies on the stereotype of the befuddled nerd who can't see the wood for the trees. You’re overthinking it is the slap which wakes them from their academic, that is to say, largely useless, reveries.

Third, it's a power move: The unsaid part is, “There's nothing I don't know that's worth understanding. I have no intention of meeting you in the middle. Play it back to me in my language.”

Fourth, it's absurd: You were hired to overthink it. It's your job to consider all the various ways in which something might not be right, all the little things that might go wrong with catastrophic results. You were recruited on the strength of years of overthinking. Yes, solutions can be simple, but problems are usually complex. That's why they are problems.

A more charitable interpretation of the situation is that "you're overthinking it" is simply shorthand for “You're giving me too many details. I'm only interested in the conclusions, the parts that affect my decision”. But that's not what “overthinking” means. If that was the intent, then I offer this more respectful alternative:

Please continue to think about this in great depth, and then present me with your conclusions—and, if I should so ask, your methods.

That would be nice.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

From Coppelia

I’ve started to learn Rust - I can’t say it’s motivated by anything more than curiosity (at the zeal of the Rustaceans), awe at the speed of packages like polars and uv, and a growing dissatisfaction with the messiness of Python. I’ll let you know how I get on.

In excellent substack (subscribe if you haven’t already). I read that the Royal Statistical Society has launched a peer-reviewed data science and AI journal. This feels like the right direction for the RSS. They have a reputation for rigour, and they can bring that to the table.

Meanwhile on my to-do list for GenAI are:

Check out Firebase (thanks to Wendy Martinez for the nudge)
Read up on AlphaEvolve (evolutionary algorithms and LLMs are intuitively a good match).

Finally, last month I mentioned Willow, an AI-based dictation tool. I'm taken aback by just how much this has changed my working life. I’d say more than half my text output is now dictated. Thanks again to !

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

Rare, since if this is the case, then it's often best just to merge the nodes.

Glasseye

Tue, 27 May 2025 10:52:52 GMT

In this (first anniversary!) issue:

The last word (maybe) on synthetic respondents in the now thoroughly deserved dunghill.
Semi-supervised invites you to consider a world where points are spaces and spaces points.
We investigate the old-school magic of directed graph layouts in the white stuff.

Plus, how I broke up with conda over uv, and the brilliant but temperamental willow.

Semi-supervised

Back in December, I promised to use this space to provide you with tips for solving real - as opposed to textbook - problems. Over the last six months we’ve considered ontologies, room-tidying, concept-mapping, feasibility (via AI environments) and toy-problems. We’re not out of steam yet. This month we are thinking about space, and points in space, and the point of points in space.

Points in space are of course the bread and butter of any statistician/data scientist. We cluster them, we rotate them, we separate them from one another, we fit lines, planes and hyperplanes to them. All the more reason, then, to give them careful consideration. And by that I do not mean delving deeply into the mathematics - that’s what the textbooks are for - I mean something much more basic and probably for that reason more often neglected: I mean thinking about what they represent in any one problem.

I feel all the more justified in this statement of the obvious now that the word of the moment is embeddings. Embeddings are, after all, points in a space, and if we forget, or don’t care, what the space and the points mean, we'll get utterly lost.

But this is not about embeddings. It’s about the absolute basics, merely a reminder to ask yourselves the following: What in your ontology do the points represent—people, organisations, images, concepts, sounds? And how do you interpret the space between them?

Take as an example a scenario where points are customers in a space where the axes are attributes such as annual spend, number of purchases, age, gender, etc. The most immediate and intuitive interpretation of this space is to interpret nearness as similarity. But to make this work, variables must be scaled, and the correlation between variables accounted for.

All this is quite basic (although crucial and often forgotten in the rush to do something more exciting with the data). Less familiar, but rich in its problem-solving potential, is the possibility of flipping points and axes: that is, making the columns of your data set the points in a space defined by the rows. In the example given above, points are now customer attributes in a space where each axis is a customer. Nearness in this space, as long as the variables are standardised and mean-centred, means something like correlation. In fact, cosine similarity in this space is equivalent to correlation. Thinking in this more unusual space can lead to surprising insights into, for example, the workings of linear regression and the meaning of degrees of freedom.1

But from a practical perspective, this space-point flipping is more useful when the original variables (columns) share enough in common for a more distinct interpretation once they are recast as points (rows). To use the classic example, a dataset of viewers and movies watched, can be flipped from viewers in the movie-space to movies in the viewer space.

With enough practice you can learn to see every data set as two datasets. For example:

A survey is a set of questions that can be plotted in a respondent space, as well as a set of respondents that can be plotted in a question space.
An image data set is a set of images that can be plotted in a pixel space as well as a set of pixels that can be plotted in an image space.
A document dataset is a set of n-grams that can be plotted in a document space as well as a set of documents that can be plotted in an n-gram space

And of course these two spaces are really just the starting point. For various purposes each can be rotated, transformed or projected into spaces of lower or higher or even infinite dimensions. But none of this fancy stuff will be at all useful if you haven’t carefully thought through the meaning of point and space in the most basic of cases.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia.io.

The white stuff

I’ll admit to being slightly obsessed with the mermaid diagram layout - feeding it with ever more complex graphs just to witness the beauty of the output. The magic, it turns out, is well documented in this classic paper from 1993: A Technique for Drawing Directed Graphs by Gansner et al, which was the inspiration for both the dagre package (on which the mermaid layout is built) and the graph layout for GraphViz. It’s also an object lesson in problem solving: the problem is clearly defined; several competing aesthetic objectives are laid out; the task is broken down into four stages; and then each subtask is tackled by drawing on an impressive range of mathematical and algorithmic techniques (the network simplex algorithm, barycenter functions, splines, etc.).

Mermaid have just added an alternative to dagre: the ELK layout. It’s very pretty (see below), although I think I like it for trivial reasons (it’s easier to read the labels on the edges). The only relevant paper I could find was The Eclipse Layout Kernel by Domrös et al. It’s a little thin on detail. If anyone finds something more substantial, please let me know!

The dunghill

I (almost) promise that this will be my last word on synthetic respondents. If I’m more strident now than I was ten months ago, then this is partly because in July last year I did not really believe that people would fall for it. But it’s happening. Even the most absurd of use cases is gaining traction. As I pointed out in this thread, “in its most ludicrous form, the claim is that [synthetic respondents] can get you access to ultra high net worth individuals who otherwise would not answer your survey. Bear in mind that LLMs are trained on web data, and there's only one UHNWI I can think of who is weird enough to spend his time plastering the internet with his opinions… But the big worry is that where businesses go, governments follow. Do we really want government policy to be built on LLM mush derived from internet mush? And I worry that the patter about using synthetic respondents to replace hard-to-reach groups will be ultimately applied to hard-to-reach vulnerable groups.”

This comment prompted the accusation that I am a) some kind of market research reactionary who winces at the very thought of change, and b) someone with a vested interest in the status quo. I wrote a long reply to my accuser, which was completely ignored. I think it does a good job of summing up my argument, so here it is in full:

Hi Tony, I am very sympathetic to your point about resistance to change. I admit that there is a great deal of inertia in the market research sector. I'm not sure if by "people" you mean me. Hopefully not, but anyway, I can put your mind at rest here. I don't work in market research, and market research companies only form a very small fraction of my clients (and those that are my clients tend to be pretty forward-looking). So there's no vested interest here, and I think I can say the same for most of the respondents so far on this thread. If there's an interest, then it's the one I stated above. I worry that bad decisions will be made in important areas (government policy, medicine, etc) because decision makers have been blinded by the hype surrounding AI.
I have to push back also on the idea that my objection to synthetic respondents is based on "nothing but gut feel". Here's my argument which you are welcome to pick holes in.
To compare like with like, assume that we are comparing good-quality old-school research with a conscientious use of synthetic respondents. For the first we have some theory (central limit theorem, etc.) that connects the sample to the population, and, provided some assumptions are met, we know how wrong we might be when making estimates based on the sample. We also have a pretty clear idea of where it falls short (biased samples, multiple comparisons etc).
I don't think anyone would claim that there is this kind of theoretical understanding when it comes to synthetic respondents. We don't know how the training data relates to the population (other than the fact that, as web data, the population created it), and we know very little about the incredibly complex rules that have been learnt by the model and which apply when a response is generated.
But potentially this doesn't matter if, as you say, we take an empirical approach to validating the process of synthetic respondents: i.e. we re-run existing human-based surveys on the equivalent synthetic respondents and compare the two. If, after many such comparisons, we find that, say, 80% of the time the synthetic respondents produce near enough the same results then we can say to someone using synthetic respondents that the chances are that their results are also a good match.
Against this I am arguing (here) that what is valuable in research is the information you discover that is surprising; something that a competitor is unaware of since it is not common sense, or something that would force you to change business processes, or treatment or policy because it goes against what everyone has always assumed. So for me a good test of synthetic respondents would be to take some existing research, ask the consumers of that research to flag the findings that they did not expect, and which are significant for them, and then look at whether these patterns are also found among the synthetic respondents. So if there's any "gut feel" going on, it’s just that I don't think LLMs are going to be very good when it comes to this test.
But anyway, I am offering a method for falsifying my gut feel, so if synthetic respondents turn out to be consistently good at replicating valuable findings, I'll be happy to admit I'm wrong! Lastly, I'd say we've got good a priori reasons for asking for this level of validation. We can easily think up situations where synthetic respondents could be formed from extremely inappropriate content. In the trauma scenario, which was discussed in the thread above, the material might be violent fantasies rather than testimonials. And if clients have the self-discipline to treat the results as "directional" and the information is consequential, they will need to validate it using a sample of real people, which gets us back to where we started.

Still waiting to hear from Tony.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

From Coppelia

This month I have, like many others, fallen for uv 2, the new rust-based Python package manager. It’s like a hot knife through butter. So after a relationship of more than ten years, I’m breaking with conda. (If I’m honest it’s been over for some time.)

This prompted a Mermaid concept map (using the new ELK layout discussed above) on the topic of Python package managers, which I'm sharing here in case it's useful.

A concept map for Python package managers

My regular supplier of anything on the cutting edge, , has put me on to Willow, a genAI-powered dictation tool that removes the friction of dictation by revising and punctuating our mumblings. It’s brilliant, but as you'd expect, sometimes thinks it knows better. So you have to watch it

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

Linear regression on a dataset of N rows can be understood as a projection from an N-dimensional observation space to a subspace of N-k-1 dimensions.

Thanks to Søren and Jonas for the recommendation!

Glasseye

Tue, 29 Apr 2025 08:51:52 GMT

In this month’s issue:

Semi-supervised makes the case for toy problems in real life.
We question the information value of LLM output in the dunghill.
The white stuff finds out that clinical research is not entirely free from methodological madness.

Plus a new problem-solving workshop for data scientists, we welcome the rigour of D-spy, and how Quarto has taken over my output.

Semi-supervised

Toy problems are not just for textbooks, although you’d be forgiven for thinking so, given how rarely they appear outside academia. This is a shame since they are perfect for breaking down the more complex and intimidating problems that occur in real life.

I will explain, but first we should clarify what is meant by a toy problem. Most real-life data science problems are hard for reasons we explored in last month’s semi-supervised: that is, they are typically set in partially observable, multi-agent, stochastic, sequential, dynamic, continuous environments. A toy problem strips away the complexity by assuming a world that is simplified to the point of being unrealistic, or constrained in such a way that the complexity is kept out. Every exercise at the end of every textbook chapter is a toy problem. If they were real-world problems they simply would not fit on the page.

AI is full of well-known toy problems from chess and tic-tac-toe to stacking blocks and navigating Wumpus World. As the AI examples show, toy problems are designed to highlight particular aspects of the real world - other aspects can be temporarily put to one side. Toy problems are, if you like, the problem-solving equivalent of controlled experiments.

So how, outside academia, might they be useful? I have three distinct uses for toy problems, but they all involve, as you might expect, getting a handle on complexity.

The real-world problem is too hard, but I think I can solve it for a much simpler scenario. So I start with a simple toy world; solve the problem there, and then, one by one, add in the real-world components. With each addition, I re-solve the problem.1 Sometimes I hit an iteration that cannot be solved, but then at least I understand what makes the problem intractable. Of course this process never makes it all the way to the real-world problem. Long before I get anywhere close, I will have either hit upon the right way to go or satisfied myself that the problem is insoluble.
The real-world problem seems mostly solvable but there is some aspect of it that I just can’t get my head around. This calls for the targeted toy model - the one that brings out just a single feature of the problem and simplifies away the rest. To do this I like to imagine I am writing the textbook exercise version of the problem, designed to focus the student on one thing only.
I have a new algorithm or statistical technique that I am struggling to understand, and it doesn’t help that the problem it is being applied to is itself horribly complex. So instead I give the algorithm significantly easier tasks to perform. Here the toy problem usually involves a simulated data set which I can adjust, slowly adding complexity, until I understand better how the algorithm does what it does and what its strengths and weaknesses are. Sometimes - if I have too much time on my hands - I make a toy version of the algorithm (a much simplified version, built from the ground up) to apply to the toy problem. My favourite example in this vein can be found here.

The obvious next question is: how do we know which features to include in a toy problem and which aspects to simplify? If you’ve been following the last few issues of Glasseye then you might find this less daunting: constructing an ontology will help purge the problem of abstractions and provide a list of entities to select from; a concept-map will clarify how these entities relate to one another and give you some ideas about where to apply simplifications; perhaps best of all the dichotomies described in Russell and Norwig’s AI task environment will suggest simplifications: assume that the environment is discrete not continuous, episodic not continuous etc.

It perhaps goes against the grain to step back from a problem, and you might get a raised eyebrow when your boss looks over your shoulder. But there’s nothing embarrassing about playing with toys. I have got them out for such varied problems as optimising supermarket shelving, detecting biomarkers in saliva, and understanding the spread of COVID in hospitals.

Please do send me your questions and work dilemmas. You can DM me on Substack or email me at simon@coppelia.io.

The white stuff

Of all the forms of human error, the one I find most fascinating is mass delusion, possibly because groups are able, through a combination of peer pressure and feedback mechanisms, to talk themselves into far more extravagant beliefs than the average lonesome individual.2 I know that this group madness happens in my world; I’m not surprised that it happens in academia, but one place I thought must be safe, since the stakes are so high and since faddism is so obviously discouraged, is the world of clinical research. That was until I read The Curious Rise of Randomised Non-Comparative Trials by Pavlos Msaouel in this month’s Significance Magazine. (The magazine is behind a paywall, but content usually becomes free one year after publication. My own contributions can be found here.)

Curious is right. Usually the abuse of statistics in science is quite understandable. The subject is counterintuitive, often badly taught, and marred by methodological fudges. But this particular abuse is not due to confusion. It is simple and flagrant. So simple that it can be summarised in a few lines.

The standard approach in clinical research is the randomised controlled trial (RCT). Participants are randomly allocated to groups (or arms), each of which receives a different treatment. Randomisation is crucial since it breaks the causal link between the trial outcome and any factor other than the treatment.

Sometimes, when RCTs are impossible or unethical, a single arm trial is preferred. A single group receives the experimental treatment and the outcome is compared to historical data on that same group. This is an observational study not a controlled experiment. Some attempts can be made to handle confounding factors through modelling or matching but the results are generally less reliable than those obtained by an RCT.

RNCTs, which Msaouel says are on the rise, are a nonsensical hybrid. They involve multiple arms to which participants are randomly allocated, “but instead of comparing the outcomes in these randomised arms, RNCTs compare each arm individually to historical or external data.”3 Why then randomise? Msaouel can find no satisfying rational explanation. In fact, “all the safeguards provided by randomisation have been cast aside – and all that remains is the use of the talismanic word ‘randomisation’.” When he runs a workshop for oncologists, the participants are surprisingly candid about their motivations: “We want the aura of an RCT… but we won’t do a formal comparison. Instead, we’ll compare each arm’s results separately against historical data. This lets us call it a ‘randomised trial’ without the heavy burden of a large sample size or the risk that a formal comparison might yield a ‘negative’ result.”4

Conjuring up the aura of a scientific method when the reality is anything but - this is not uncommon in business, and particularly so in media, advertising and marketing. But here at least no one is getting sick or dying. Let’s hope that Msaouel’s article has shamed it out of existence in clinical research.

The dunghill

Karl Popper must be one of the most professionally useful of philosophers. He is famous for his criterion for distinguishing science from pseudoscience (to be scientific a hypothesis must be falsifiable) and for his account of how science works (by putting forward conjectures and then ruthlessly attacking them). Even if both views have been shown to be overly simplistic they nevertheless work as biblical commandments for scientists. And great is the temptation to sin. No one wants to knock down an idea they have spent long hours building up; no business wants to challenge the premise on which it was built. All I can say is that it helps to think of Popper (not a happy-looking man) standing over your shoulder, shaking his head.

Recently though a specific implication of these theories has seemed especially relevant. If we broadly agree with Popper on how science works then it obvious that not all findings are of equal scientific value.

First, refutations are worth more than confirmations. At any one time we theorise against a background of accepted, orthodox hypotheses. A confirmation of one of these hypotheses is worth very little. (So it turns out people buy more ice-cream in the summer. Hooray.) Refutations by contrast are intrinsically surprising. In fact the informational value of a refutation might be gauged by how many of the background hypotheses it knocks down.

Second, of the conjectures that withstand criticism, we should prefer those that Popper describes as “bold”, leading to “novel predictions”. The most valuable are again those that do the most damage to the orthodox body of scientific opinion. The hypothesis of descent through natural selection is an obvious example.

In both cases the background theories render the result (refutation, conjecture holding its ground) improbable, and as every good student of information theory knows, the confirmation of an improbable event provides the most information.

So why am I bringing this up now? Because I think it is useful for sifting the good from the bad among the opportunities currently offered by LLMs. Those that are selling these opportunities tend to downplay two important facts about LLM output: (a) its usefulness is less apparent when measured by its information content (as Popper pointed out, bland confirmations are worthless alongside earth-shattering refutations) and (b) the more valuable (information rich) the content, the more certainty matters, since surprising, counterintuitive results usually require changes, and changes cost time and money.

And so when the cheerleaders for LLM-powered synthetic respondents tell us that there is an 88% correlation with answers given by real respondents, I want to know what that correlation looks like on just those findings that were surprising, and therefore of high informational value. And even if synthetic respondents were shown to be capable of delivering surprising, counterintuitive findings, I would want to know how, if not by an real survey, the certainty called for by surprising results would be achieved.

Similarly when it comes to the average LLM prompt response, I can’t help noticing how much hedging is built in. The output seems very low-grade in terms of surprisal, bland by comparison to human output (or at least the best of it).5 But again, even if this could be fixed, we still run into the same old bind - the more interesting the information, the more risk associated with it, and therefore the more certainty we want. And if hallucinations are intrinsic to LLMs (and I’m pretty convinced they are) then that certainty has to come from elsewhere.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

From Coppelia

If you’ve enjoyed the recent semi supervised posts on solving problems that are outside the mould then you might be interested to know that Coppelia now runs a three day workshop, Real-world problem solving for data scientists. It is particularly aimed at those who have an academic background in data science, or a related discipline, but are puzzled about how to apply this knowledge to their day-job. Do mail/message me if you are interested.
I’ve been spending some time this month getting to know DSpy (thanks to for putting me on to this). Very interesting. It’s an attempt to place LLMs inside the framework of supervised learning. The most magical part is the way it treats the prompt as a kind of tuneable hyper parameter. I particularly welcome the opportunity to pit LLMs against other algorithms, and get a fair assessment of what they are good at.
I’ve noticed that, since it was recommended to me by and Andrie de Vries, Quarto has been gradually taking over my document output. Almost everything I produce that goes to a client (other than code) now goes via this superb package. And as yet no one seems to mind that it’s not Word or Powerpoint, contains full prose, references etc. Raper’s Rule #231: if you treat your client like an idiot they are sure to do the same for you.

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

If you do a lot of coding, this will be familiar as a standard debugging procedure.

My all time favourite book on the subject is the neglected classic When Prophesy Fails (the study which introduced the term cognitive dissonance).

Pavlos Msaouel, Bad stats: The curious rise of randomised non-comparative trials, Significance, Volume 22, Issue 3, May 2025, Pages 40–44,

Msaouel is paraphrasing the participants (see the original article for more details).

My rule of thumb for trusting LLM output is in fact blandness-based. I trust most those facts which I reckon to be fairly well-known and therefore all over the training data.

Glasseye

Wed, 26 Mar 2025 08:45:35 GMT

In this month’s issue:

Maximise your lack of commitment with E T Jaynes in the white stuff.
The dunghill considers the perfect conditions for a market built on pseudoscience.
How the idea of AI task environments can be used for feasibility studies in semi-supervised.
Plus a response from Eppo to last month’s dunghill, and frustrations over badly behaved LLMs.

Semi-supervised

When the LLM bubble goes pop, will we all be rushing back to Russell and Norvig? I like to think so. Beautifully written, terse but clear, effortlessly multi-disciplinary, and even finding space for historical notes at the end of each chapter, it’s my favourite AI textbook by some distance. Since AI proper is about building autonomous agents that need to cope with a wide range of environments, the book is also, indirectly, a compendium of problem-solving techniques. All we need do is substitute ourselves for the autonomous agent.

In this vein, I have repurposed Russell and Norvig’s classification of AI task environments as a method for conducting feasibility studies. The result is a tool that tells me quickly just how difficult a project is going to be, and gives me a language to explain the difficulties.

I will explain: For Russell and Norvig, task environments are the specific worlds in which the autonomous agents must operate to solve problems—in a sense, they are the problem. For the robot vacuum cleaner, the task environment is a floor interrupted by objects; for the AI chess program, it is the board, the pieces, the rules, and so on. Russell and Norvig categorise the possible environments using a series of dichotomies. Is the environment…

1. Fully observable or partially observable?

2. Single agent or multi agent?

3. Competitive or co-operative?

4. Deterministic or stochastic?

5. Episodic or sequential?

6. Static or dynamic?

7. Discrete or continuous?1

Since the option on the right-hand side is nearly always the more difficult one, this has become my feasibility checklist for any problem environment I happen to be given. If everything is on the right, we’re in trouble.

Take the apparently simple example of a sales forecast. Here the environment is a marketplace, filled with buyers and sellers, and impacted by all manner of internal and external forces. It is undoubtedly a partially observable environment (we cannot know every relevant fact, unlike, say, a game of chess where we can see the full board and the location of every piece). It is also a competitive, multi-agent environment: if the business acts as a result of the forecast then a competitor may react, invalidating the forecast, unless, that is, we build in their expected actions. On top of this, it is highly stochastic; it happens in continuous time; it is sequential (what happens next, is dependent on actions taken now); and it is dynamic (the situation can change while we are deliberating). This is why forecasting markets (as opposed to, say, some physical systems) is so hard, and why we don’t even attempt perfection.

What to do when faced with right-side-heavy environments? First, massively lower expectations (which is what we do with forecasting). Second, if possible, approach the problem by making some simplifying assumptions (moving from the right side to the left). Can we ignore the competitors for now? Can we treat the environment as static even if, strictly speaking, it is not? Can we approximate the situation using discrete time? But in each case track the implications of the simplification, or it may come back to bite you.

By the way, as human beings, we operate for the most part in partially observable, multi-agent, stochastic, sequential, dynamic, continuous environments, which is why we are so damn great and not about to be replaced.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia.io.

The white stuff

E T Jaynes is a cult figure among probabilists. He wrote clearly and inspiringly about the deepest aspects of probability; he crossed disciplines, invented new ways of looking at the world, and quietly got on with the business of being a genius.

This month I’ve been reading his seminal 1957 paper Information Theory and Statistical Mechanics, where he introduces for the first time his principle of maximum entropy. The principle uses Shannon’s concept of information entropy - at the time only recently formulated - to update Laplace’s principle of insufficient reason (in the absence of any relevant information, we should assign equal probabilities to all possible outcomes). Jaynes redefines this in terms of entropy: in the absence of information, it is logical to assume the probability distribution that is maximally non-committal - in other words, the one that contains the least information, ergo the one that has the highest entropy. Philosophically, this puts this approach to probability on a firmer foundation since it removes “the apparent arbitrariness of the principle of insufficient reason, and, in addition, it shows precisely how this principle is to be modified in case there are reasons for ‘thinking otherwise.’“ But it is also a whole new way to build a probability distribution and as such unlocks a whole new line of problem-solving techniques. (I am using it in a project right now.) If you are still not satisfied, it links (as Jaynes shows in the paper), the statistical concept of entropy to the thermodynamic one.

If the paper gets you hooked then Jaynes’ classic, posthumously-published masterpiece - like I said, a cult figure - Probability Theory: The Logic of Science is available here.

The dunghill

When I’m asked by clients for my opinion on synthetic respondents, it’s a bit awkward. On the one hand, I feel strongly that most commercial implementations are going to be a joke. On the other, I can’t rule out the possibility that, despite this, they will drive a burgeoning market in which millions are made. This in turn made me think about the necessary conditions for a market built on pseudoscience. These conditions will have to explain the markets we have seen so far - digital attribution modelling, bad (i.e. most of) market mix modelling, as well as the ones just on the horizon (synthetic respondents, synthetic sample size boosting). Here’s where I’ve got to. As ever, I’m interested in your views.

Necessary conditions for a market built on pseudoscience

First, someone needs to benefit. A direct benefit to the business is impossible since the thing doesn’t work. But there are other ways: someone’s reputation can be enhanced by the work - they can catch some of the reflected glory of the latest technological advance. Or a project, a campaign, a controversial business decision can be retrospectively justified.
Next, any checking of the results needs to be impossible, or at least highly disincentivised. Under ideal conditions, the claims will be truly unfalsifiable (checking off Karl Popper’s classic definition of a pseudoscience). But it might just be that a check is expensive, and then who wants to spend money on potentially undermining a favourable result (see the next point)? Contrast this with applications of statistics and data science such as prediction or network optimisation where just using the method is a check on its performance.
A corollary of (2) is that with enough tweaking and twisting (see QRPs) the given technique can be made to say whatever the client wants it to say. Such a feature will further activate the market since it produces an endless stream of satisfied customers.
The technique employed must look like real science, else why would it be taken seriously. The surest way to do this is by borrowing a technique that works in its own limited domain and then aggressively misapplying it.
To be convincing the market must be run by true believers. For this to happen the details of the technique must be sufficiently complex to fool the practitioners into thinking that it is the real deal. Ideally there should be debates, conferences, controversies, rival theories. Ultimately it should generate its own history - who could doubt it after that?
Finally the average buyer needs to be incapable of calling bullshit. In the best case scenario, they are scientifically and mathematically illiterate, so that, even if they wanted to (and often they do not - see point 3) they could not question the result.

I think synthetic respondents - in all but the most cautious use cases - scores a solid six here - which makes me feel better about simultaneously slamming it and predicting huge growth.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

From Coppelia

After last month’s dunghill querying the approach to relative-lift AB testing taken by Eppo, I got an email from their head of statistics. They had accepted the main part of my criticism and put in a fix. Now I can't emphasise enough how rare this is. Usually the response from a business to this kind of criticism is silence, or aggressive technobabble. But this was gracious, grown-up behaviour, and deserves respect!

This month I once again attempted to embed an LLM within a process; once again the result was disappointing. To be clear I'm not talking about the use of LLM as a sidekick - in coding, data manipulation, research, problem-solving. There the results are, I’ll admit, phenomenal. The point of failure for me (and for many others - I don't think I'm being particularly controversial here) has been when the little ones are left to play by themselves. Then all hell breaks loose. Yes, I know about all the clever things you can do with prompts and such like, but I reach the point where saying nice things to them just gets frustrating - perhaps even a touch humiliating - and I return to the old ways.

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

I have left out the known/unknown dichotomy from Russell and Norvig’s list since it is not particularly useful for assessing feasibility.

Glasseye

Mon, 17 Feb 2025 12:39:10 GMT

In this month’s issue:

Map your way towards concept clarity in semi-supervised.
The dunghill wonders what some high-profile experimentation platforms are doing with such wonky distributions.
We are impressed by François Chollet’s intelligent approach to intelligence in the white stuff.

Plus swapping out langchain for llamaindex, keeping a tight chain on LLMs and shushing copilot.

Semi-supervised

In the last instalment we looked at “room-tidying”: the essential clarifying and untangling work that a data scientist should carry out before tackling a difficult problem. This month we are taking a closer look at one of the techniques briefly mentioned - the surprisingly under-used art of concept mapping.

The idea is to construct a graph in which every node is a concept that features in the problem and every edge describes a relation between concepts.

Don’t be fooled by its simplicity; this is a powerful technique. And it’s worth emphasising that a concept map is not a mind map. With the latter almost anything goes, whereas a concept map (as I do it) is tightly regulated by a single simple rule: the node-edge-node sequence of labels should form a sentence that is read in the direction of the arrow. “Tensorflow… implements… neural nets.” Sticking to this rule imposes structure on your thinking and keeps the diagram clean and clear.

To illustrate here’s a diagram I featured in a previous post, untangling the confusing world of deep learning in Python.

Sometimes I find it useful to include concept definitions under node labels. Obviously these definitions are independent of the other concepts in the map (else they would also feature as edges).

Adding labels for some of the more common relationship types (e.g. X is a Y”) can clutter up the map. To avoid this I borrow the standard UML arrow types for class diagrams and adopt the convention that if an edge is unlabelled then the UML rules apply.1

One last thing worth mentioning: it is usually possible to break down complex definitions into simple relationship pairs. For example, “Pytorch is a python library that was developed by Meta” becomes “Pytorch is a python library” and “Pytorch was developed by Meta”. But sometimes we are stuck with a relationship between three or more elements. This typically happens when the relationship takes the form of a sentence with a direct and an indirect object, such as “X processes Y for Z”. I solve this problem by introducing an extra circular node like this:

All of this is made supremely easy through the use of the diagramming syntax of Mermaid, which can be used in most markdown editors and IDEs. I don’t know which algorithm it uses to lay out the diagrams but it can achieve wonders. (The Mermaid markup for all the above diagrams can be found here - view the raw file to see the code.)

Finally, if part of your plan is to signal effortless brilliance, you can get the hand-drawn look by dropping your mermaid markup into Excalidraw.2

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia.io.

The white stuff

If you are looking for a paper that does, for the muddled world of genAI, exactly the kind of concept clarification work described in this month’s and last month’s semi-supervised then you won’t do much better than On the Measure of Intelligence by François Chollet. It provides the rationale for Chollet’s Abstraction and Reasoning Corpus (ARC), which, for now at least, is the most generally accepted benchmark test for AGI. But it is also an object lesson in room tidying since he takes great care to straighten out the concepts before getting started on his own definition of intelligence.

Among his clarifications my favourite is his distinction between “the process of intelligence (such as the intelligence displayed by researchers creating a chess-playing program)” and “the artifact produced by this process (the resulting chess-playing program)”. The human chess player, who learns this skill from scratch, without being specifically engineered to play chess, is both the process and the artefact, with the process being the most remarkable part. Until AI is both process and artefact we are, he implies, fooling ourselves.

The dunghill

The Central Limit Theorem is a wondrous thing. One of its many gifts to us is the power to AB test sample means from two populations no matter how the population data is distributed. If we take a typically skewed metric such as income, then, with a large enough sample size, the distribution of the sample mean over many samples will be approximately normal. Conveniently, the absolute difference between two normal distributions is also normal - and this gives us the starting point of an AB test.

Which is why it is odd that anyone would choose to work with the relative difference - (A-B)/A - instead of the absolute difference A-B. But, as Sara Gaspar recently pointed out to me, this is exactly what some experimentation platforms are doing. One example is Eppo. They are not small (47 million of investment according to this post) so I hope they know what they are doing.

Their rationale for making this move is that they wish “to provide effect estimates consistently across all metric types (counts, rates, percentages, etc.)”. You can’t but sympathise with this: anything that makes communicating results easier is worth having. But what is the cost?

Commendably they give us their workings here on their website. Their justification for using relative difference is that under certain conditions the ratio of two normal distributions is approximately normal, and they cite this paper to justify the claim. Now you might think that if the absolute difference is normally distributed then the relative difference is going to be a bit wonky. And you’d be right - it can indeed be very wonky. Therefore the conditions for approximate normality are worth investigating.

A fairly typical case in A/B testing is one where we are investigating whether a proposed change in a product or a process results in the increased probability of an action. It is very common in my experience to be dealing with very small increments on what are already very low probabilities. The baseline probability might for example be 0.02 and the uplift 0.005. If the sample size is around 300 then the distribution looks like this.

Not too normal. In fact it takes a sample of around 10k or more before it starts to look passably normal (this is in line with the conditions described in the paper). You can find the code for the simulation here.

To be fair, the documentation comes with the caveat: “If the estimate for Control is close to zero, that ratio becomes unreliable. We do not compute the relative lift when Control is less than 1.5 standard deviation around zero.” Although it’s not exactly clear what this means. 1.5 standard deviation of which distribution? If they want to ensure that the inference works then wouldn’t be better to use the bounds given in the paper.3 And if they don’t use relative lift then what do they do? Revert to absolute lift? In which case what was the point of the whole exercise?

I hope the sample size calculations on Eppo take all this into account, although I see that their formula for minimum detectable difference does not.

Either way I’m not convinced the benefits outweigh the costs in this particular case.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

From Coppelia

February has been a month of heavy coding with some interesting discoveries. I’ve swapped over from Langchain to Llamaindex, on the promise that it is simpler to use and better at the straightforward data extraction tasks I’m mainly using LLMs for at the moment. So far it is living up to that promise.

I also started exploring the world of LLMOps which seems to be expanding exponentially as businesses bet on being among the first as a new sub-industry takes off. I’m still feeling sceptical about just how many use-cases there really are, but still it is interesting to see all the new monitoring and measurement frameworks out there, most of which involve LLMs monitoring LLMs (the so-called LLM as judge approach). Take a look at DeepEval.

Finally I’ve discovered the secret of working with copilot within an IDE. It’s very simple: a “shut up” shortcut key. Forget every other copilot shortcut. Toggle it on when you need help and off when you need some peace and quiet. It’s life-changing!

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

Note you can get the UML arrow heads on concept maps in mermaid by fudging a classDiagram as shown in the code here.

The option is under AI tools in the menu bar.

Also given in the code.

Glasseye

Tue, 14 Jan 2025 10:41:02 GMT

In this month’s issue:

Semi supervised gets strict and demands that you tidy your room!
We worry about the Orwellian implications of generative AI in the white stuff.
The dunghill pursues a hunch that something is not quite right with AI-driven sample size boosting.
Plus its out with the old and in with the new as polars, cursor and mkdocs, nudge my old tools from the nest!

Semi-supervised

Last month, I started a series of posts on solving the kind of problems in statistics and data science that do not fit a standard template, and the advice was to define your ontology. This month we look at a complementary approach that I call “tidying the room” after a quote from Wittgenstein (no apology for pretentiousness!)

In philosophy we are not, like the scientist, building a house nor are we even laying the foundations of a house. We are merely ‘tidying up a room’.

What Wittgenstein meant was that many of the most difficult problems in philosophy can be solved simply by paying close attention to our use of language. If we straighten out the meaning of certain key concepts then these puzzles will simply disappear. This was quite an extreme view (which has since fallen out of fashion). More recently Daniel Dennett made a similar but less controversial claim when he said: “We philosophers have a taste for working on the questions that need to be straightened out before they can be answered.”

What I’m saying is that our work in data science and statistics requires a significant amount of room tidying before any building gets underway. Why us in particular? Because very often a data science project is the first time a business concept has required a precise definition.1 Before that moment, terms like “subscriber”, “lifetime”, “visit”, “touchpoint”, “lapsed”, “cost-per-acquisition”, etc. have meant what each person has wanted them to mean (the root cause of many a pointless meeting). If you forget to tidy the room before you get to work then you will end up building something on top of this chaos and that will please no one.

And very often the room tidying is the work or at least a substantial part of it. I have worked on more than one project where just clarifying the concepts solved the issue, just as Wittgenstein wanted it to do for philosophy. This means that to be a statistician or data scientist you’ve got to have, as Dennett puts it, a taste for this kind of work. If you like your problems delivered in neat little parcels then it probably isn’t for you.

So what strategies can I suggest for room tidying? Here are a few:

Define your terms, and I mean really define them: i.e. in such a way that they are non-tautologous and have strict boundaries. Is a customer someone who has just visited the site, or do they need to have made a purchase? When do they cease to be a customer? After how many months of inactivity? Is a customer in the real world the same as a customer on the database? Evidently not, if a single person can have more than one account. Most people are unaware of quite how difficult this activity is. (If you think any of the terms I listed above are self-evident then you’ve never tried it!) Clearly this is a task made much easier if you have in place an ontology.
Beware of dates and time windows - they are a particularly potent source of confusion. For some reason, time seems to get forgotten when defining our concepts.2 We treat things as though they are either permanent or happen in an instant. Time windows (e.g. a customer lifetime, the duration of an ad campaign etc) are particularly treacherous because each pair has four possible relationships (A contains B, B contains A, A overlaps the beginning of B, B overlaps the beginning of A). Make sure you think about them all.
Once you have a passable first draft of your key concepts, start to think about the relationships between them. There are many useful tools for doing this. Diagrams are indispensable. I suggest concept maps, causal diagrams, fishbone diagrams, Venn diagrams. Think about the relationships between measurements. If one grows should the other grow proportionally or should it tail off, or grow exponentially? Remember this is all a priori work, concept clarification - you have not even touched the data yet!
Above all bring the business (or client) with you. It is their room you are tidying, so there should be an ongoing conversation in which you test your latest revision of a definition against their knowledge of the reality. Never start from zero. A common sense definition should be your base. I can guarantee that if you approach a business person with “What is an X?” then you will come away empty-handed, but if you ask “Is X a Y or a Z” then you will get somewhere. Use plenty of examples to refine definitions or illustrate incompatible scenarios.

Do all of this, and do it in a friendly way, without condescension or finger-pointing, and you will be off to a very good start. The client will be reassured that you care about the actual problem and you will know what you are talking about.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia.io.

The white stuff

Will human-produced art and entertainment - music, novels, movies, poetry - soon become a luxury product, only for the connoisseur? Will the rest of us be fed on mass-produced, machine-generated dross? Will media consumption soon be divided between the cultural equivalents of a farmer’s market and a budget supermarket? These are the questions hinted at towards the end of ChatGPT’s Poetry is Incompetent and Banal. This short but fascinating paper is itself a response to AI-generated poetry is indistinguishable from human-written poetry and is rated more favorably, in which the authors ran two sets of experiments to determine how poetry produced by ChatGPT compares to the real thing. In the first they asked their respondents to pick out genuine poems by famous poets from among ChatGPT-produced imitations; in the second they asked their respondents to rate a selection of ChatGPT and human-produced poems without knowing which was which. The result was that in the identification exercise, “participants performed below chance levels in identifying AI-generated poems (46.6% accuracy, χ²(1, N = 16,340) = 75.13, p < 0.0001)”, and that in the preference experiment they overwhelmingly preferred the AI-generated poems.

It’s a nice irony that the person leaping to the defence of poetry in the second paper, Ernest Davis, is a professor of computer science at New York University, while the champions of ChatGPT are professors within the humanities; even better, while the latter run and analyse a set of statistical experiments, Professor Davis makes his points using textual criticism. You might think I’d be team stats, but no I agree with Davis, the ChatGPT efforts are undoubtedly crap. As he puts it:

All in all, the AI poems seem like imitations that might have been produced by a supremely untalented poet who had never read any of the poems he was tasked with imitating, but had read a one-sentence summary of what they were like.

But then why did the results in the first paper come back as they did? In short, Davis’s answer is that most people don’t understand poetry. They don’t expect it to be “diﬃcult and spiky”. On top of this they have low expectations of AI output and so mistake the weirdness of real poetry for AI imperfections. Thus we cannot conclude from these results that there is no difference, qualitatively, between human-generated and AI-generated poetry. Davis argues that:

one could formulate reasonable, measurable, psychological and linguistic criteria under which the real poem is hands down more sophisticated, richer, thought-provoking, deeper, etc. But a preference for the cheery, shallow AI poem may be perfectly reasonable.

This is uncomfortable territory. On the one hand might feel elitist to claim that most people just don’t get poetry, but at the same time it feels reassuring that, with a bit of effort to take us beyond the banal, humans can still comfortably outrun LLMs.

Davis ends his paper by referring us to a similar experiment conducted by the literary critic I. A. Richards in the 1920s and described in detail by George Orwell. In this version, students were presented with rarely seen poems by major poets and bad poems by minor poets and asked to evaluate them. The results were of course that supposed “lovers of poetry have no more notion of distinguishing between a good poem and a bad one than a dog has of arithmetic.” This leads Davis to conclude:

I also think it is a safe bet that the idea that, one hundred years later, scientists would write that drivel generated by an automaton is “indistinguishable” from Shakespeare and Whitman would not have occurred to I.A. Richards in his darkest dreams, and would have occurred to Orwell only in his darkest dreams.

Orwell’s darkest dreams famously took shape in 1984, where novels and song lyrics were written by machines “for the benefit of the proles”. Is that where we are heading? Is that what the first paper is pointing to? I suspect things are not that serious. In selecting poetry the authors have picked a genre which, let’s face it, most people either dislike or are indifferent to. Had they picked film or music or fiction, matching their participants to genres they genuinely cared about, then I think the results would have looked different. Still, in all the discussion about AI rising to approach human intelligence, not much is said about the possibility that, numbed by its output, we might meet it halfway.

The dunghill

This month’s dunghill is based on a hunch. Once again it’s market research in the frame but this time the topic is AI-driven sample size boosting. This is quite different from the synthetic respondents issue we covered back in the July issue. There we questioned the wisdom of using LLMs as a surrogate for real world respondents. There’s no suggestion here that LLMs are involved. Rather, or so it is claimed, techniques borrowed from image processing are used to “amplify the patterns in the data”.

Let’s look at what sample size boosting says it can do. We’ll pick on Toluna as they seem to be making a lot of noise, but there are a lot of competitors offering similar services:

Toluna HarmonAIze Boost, the first product in the new suite, empowers clients to conduct deep-dive analysis on small or niche subgroups where collecting enough real-world data would traditionally be time-consuming or impossible. By amplifying patterns in existing data, Toluna HarmonAIze Boost helps unlock valuable insights without the need for additional data collection.

Now I’m currently at the self-doubting stage where it still seems inconceivable that so many businesses would have invested so much money in something that doesn’t do what it says it does. But still I have a vague feeling that all is not as it should be. At a high level this is based on the following:

I can’t yet find any rigorous academic research to explain or back up the big ideas behind sample size boosting.
I’m pretty sure that where information is concerned there can be no free lunch, a small data set is a small data set.
I know that everyone is under enormous pressure to produce AI-driven tools.

But I also have some more detailed concerns. I’m going to stay on Toluna as I went to the trouble of watching the introductory video to HarmonAIze Boost. There we learn that sample-size boosting will unlock insights from small subsets within our data - hard-to-reach groups, or people we didn’t know we would be interested in (you might already be screaming QRPs!) These groups can be as small as 50 and they will be magnified to around a thousand, after which they can be fed into familiar analytical processes such as regression models for key driver analysis, clustering algorithms and factor analysis.

To help explain the sample-size boosting process the presenter used the example of digital image upscaling, a process which uses various statistical and machine learning-based algorithms to enlarge digital photos while avoiding pixelation. Just as you can pinch and zoom on a digital image, the presenter explained, so sample-size boosting allows you to pinch and zoom on your low-resolution (small sample) data set. The impression is that those “niche subgroups” that sample size boosting will allow us to zoom in on are like heavily pixelated figures in the background of a photo. I’m not sure this analogy works. If a pinky-brown pixel stands in an image where a person’s face should be (and would have been with a higher resolution camera) then no amount of modelling is going to bring back the face. In fact it’s more likely to be the opposite - the pixel will be erased as noise. However the de-pixelation is done (smoothing, VAEs, CNNs) the principle is the same. A model of some kind is fitted, which hopefully preserves the signal and throws away noise - you get a sharper but more basic shape. What the process does not do is reveal some previously hidden intricate details.

So much for the analogy, but then perhaps it failed to capture all the cleverness. What about the actual process used? The presenter mentions that a Gaussian copula model is being fitted and that their model was inspired by work in medical imaging. From that I assume she is referring to something like the process described in this paper. If you read closely you will see that here a Gaussian copula model is being used as a feature extraction process - it just so happens that data is being sampled from the learnt Gaussian copula model and fed into the classifier, which is a rather unorthodox method for passing learnt features to a classifier. Could someone within the market research sector have read this paper and concluded that this simulation step instead of extracting simpler more stable features in fact created more detail?

As every good machine learning engineer knows feature extraction is part of the classifier. It is not a magical step that creates a better data set prior to the learning. But I’m worried that this is what Toluna and their competitors are doing: learning a model from the original data, using that model to generate more data, and then suggesting we feed this data into more traditional models.

Why is that a problem? Well, it’s not if you bolt together the Gaussian copula model and factor analysis and call it a dimension reduction tool, or the Gaussian copula model and regression and call it a regressor in the machine-learning sense. But if you think you are somehow doing factor analysis, or regression in the classical sense (where statistical inference is used to estimate parameters) on an expanded sample, with all the joy that a higher N will bring to the certainty around your parameters, then you are sadly mistaken.

Could this all be a misunderstanding? The large sample coming out of the Gaussian copula model despite its size contains less detail - that’s the whole point: it’s not a sample from the original population, it’s a sample from a model fitted to the data, designed to bring out its broad features. That’s why in the paper I cited above it leads to more stable predictions. But researchers are raised on the mantra that more sample equals more detail. Not in this case.

As I said at the beginning, I’m going to be a coward at this point and hedge wildly. I’ve only just started looking at this. If I’m wrong please educate me.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

From Coppelia

Out with the old, in with the new

The toolbox has had a spring clean this month with three very successful substitutions:

Polars for pandas: I wholeheartedly recommend this one. Pandas has served us well but it is time. It’s a particular joy to be free from panda indexes (no more reset_index), but I’m also enjoying the tidyverse style chaining and the overall simpler syntax.
Cursor for vscode: I was a bit late to the party here, but I’m glad I showed up. As everyone says, the collaboration with LLMs is almost frictionless. You can feel your brain changing!
Mkdocs for sphinx: I wanted to go from python docstrings to package documentation written in markdown. Sphinx, my go-to, seemed to struggle which led me to mkdocs. Fantastic, with lots of marketplace add-ons.

New year, new skills

Use your training budget effectively this year. Instead of spending it on generic, web-based courses that are forgotten the moment they’re over, let me prepare something specific for you and your team, based on problems you face right now. I cover just about any topic in AI, data science and statistics. See here (although a bit out of date) and here for some ideas.

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

I do accept that rigorous definitions are needed to construct databases, applications, etc. but once that has happened, a maddeningly imprecise language grows up around them.

To give a particularly extreme example, I recently reviewed documentation for a model optimising the flow of traffic through a network in which time was not mentioned once!

Glasseye

Mon, 09 Dec 2024 09:00:31 GMT

In this month’s issue:

Can you honestly say that you do AI? The dunghill looks at the shifting boundaries of a buzzword and the dilemmas it throws up.
Gödel and hallucinations in the white stuff.
What’s my ontology? This is the first question you should ask when tackling a difficult problem. Semi-supervised explains why.

And if you’d like to revisit some of this year’s content, I’ve put together a new page that brings it all together.

Semi-supervised

In last month’s dunghill, I did some reckless boasting about being able to solve problems in situations where the problem does not fit a well-known template (such as regression or straightforward optimisation). I said I could offer tips, and so for the next few months, I will make good on that promise. I have no overall method (be suspicious of anyone claiming to have one), just lots of angles. So here’s one of them.

One of the first questions I ask myself when given a new problem is: What’s my ontology? What are the relevant things - real, solid, three-dimensional things - in the problem domain? This might seem so obvious that it’s not worth stating, but it will introduce clarity to your thinking right from the very start. Why? Because:

It forces you to decide which things are part of the problem and which are not. A common but bone-headed reaction to a brief is to start gathering data about anything in any way related to the problem. Before long, you are drowning in a pool of data that you feel you must somehow make use of. If instead you think carefully about which entities matter for solving the problem, then you will start the process lean rather than bloated. Do digital devices matter to my measurement problem? Are stores important for forecasting sales? Does the problem demand that my basic entity is a person, or can I do just as well with groups?
It steers you away from abstractions. Businesses love abstract nouns - innovation, strategy, risk, engagement, customer satisfaction - they maximise impressiveness and minimise commitment. Your ontology will help clear away the fog by insisting that the conversation is about concrete things. Don’t tolerate talk of “customer engagement”. You need to look at people, doing things with other things, in a place, at a time.
It forces you to pin behaviours and attributes to things. Brand is a good example here. Without an ontology, it seems to float about, elusively - all things to all marketeers. But what really matters are psychological states, belonging to people, influencing their actions, and, in turn, influenceable by marketing.
It helps you to see other dimensions of a problem. As soon as you start thinking about concrete entities, the spatial and temporal considerations all come out. When will this machine learning model be applied to the customer? When they join? After a year? Every day? What do you mean by customer lifetime? When would it end? If a customer returns, is that another lifetime?
It brings out relationships and hierarchies. The classic case here is that of confounding factors in causal models. Hopefully, you will have considered such things when deciding which entities are part of the problem. (If we are interested in how X affects Y but decide that W also affects Y, then W better also be an entity in your ontology.) But now that your entities are laid out in front of you, other relationship questions suggest themselves. Does W also affect X, and does that matter? Is W really just a type of X? And so on.

None of this is exactly new - for decades, software engineers have focused their efforts by building entity-relationship diagrams.1 But it is not a common first step for data scientists and statisticians. It won’t solve your problem, but I guarantee it will make it clearer.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia.io.

The white stuff

After the discussion of hallucinations in last month’s semi-supervised, I came across two recent academic papers2 on the topic. (There are actually hundreds. It is, for the moment, a very active research area.)

Both papers have the same aim: to prove mathematically that hallucinations are an inevitable feature of large language models. Both seem to be Gödel-inspired.

The first, Hallucination is Inevitable: An Innate Limitation of Large Language Models, defines a simple, formal world in which the relationship between LLM output and the ground truth is unproblematic. (As we saw last month, this definitely is not the case for the wider world.) This makes a hallucination easy to define: it is simply an inconsistency between the two. The authors then prove that hallucinations are inevitable even in this much simpler world, and since this simple world sits inside the real world, it follows that hallucinations are inevitable there too. Gödel did something similar when he proved that even something as basic as arithmetic on the natural numbers could not be grounded in the axioms of logic. This floored the much grander logicist project of grounding the whole of mathematics and science in logic.

The second paper, LLMs Will Always Hallucinate, and We Need to Live With This, takes a different approach. It shows how the causes of hallucinations inevitably enter the process of training an LLM and the process of generating its output. A crucial part of their argument - the part where they show that an LLM will hallucinate, regardless of how it is trained - is based on the halting problem, which is intimately related to Gödel’s First Incompleteness Theorem.

Which is interesting. Perhaps some parallels? Gödel’s theorems undermined the grandiose plans of a group of brilliant but quite arrogant men…

The dunghill

How long will your integrity hold out against the combined might of ten million marketing departments? At what point do you accept that the bandwagon is now the regular bus into town, and it’s time to get on board? If you have even a shred of decency and a job in data science, statistics, computer science or an adjacent discipline, then you will have asked yourself this question at least once in the last five years. More specifically, your thoughts will have run along roughly these lines: “Should I be describing what I’m doing as AI? After all, the techniques I use are part of AI? And people have been talking about part of AI (machine learning) as though it were the whole of AI for some time now. Has the word already changed its meaning? In which case insisting on the original meaning just makes me a pedant. (Remember all the soul-searching about “data science”.) Or is a perfectly good and well-defined concept under assault from cynical market forces, and if so shouldn’t I be leaping to its defence?”

For the record, I caved around 2020, partly because I felt I could justify the label but mostly to avoid commercial irrelevance. But I admire anyone in the business who lasted longer. I recently spoke to one such person. She runs a business building specialist optimisation models for agriculture. These models are mathematically sophisticated, and it was not too much of a stretch to describe them as making intelligent decisions. But she held out valiantly, way longer than me, until she passed a point where it seemed perverse to continue. But when the moment came to make the switch, to pin on the AI badge and smile nervously… something unexpected happened. Her client was ahead of the game. They knew exactly what AI was, and they knew that this wasn’t it. How did they know? Because nowhere, neither in the front end nor under the bonnet, was there any sign of a chatbot.

It is worth, if only for the sake of our collective sanity, retracing the path that got us here. I’m not talking about the recent history of AI (although that is fascinating) but rather the recent history of the term “AI” in the workplace. Here is my personal recollection of about fifteen years of verbal gymnastics.

A decade and a half ago, AI was (Still is! I hear someone scream) a multi-disciplinary field of research. It was the world of AI: A Modern Approach by Russell and Norvig (still the best introductory textbook on AI). Machine Learning was just one of many sub-disciplines called upon to solve the problem of creating a computer agent that could “operate autonomously, perceive their environment, persist over a long period of time, adapt to change, and create and pursue goals”.3 (Hold onto this definition - you will see it again!) AI was on campus, where it belonged. In the workplace, we used bits of it for specialist jobs, but I tell you now, had I described my random forests model as artificially intelligent, I would have been led off quietly to HR.

All this changed with the arrival of deep learning. The part became the whole. Machine learning swelled up to fill the whole of AI, squeezing everything else (including, crucially, the goal of creating a fully autonomous agent) into the corners.

But it didn’t stop there. Over the next decade, AI as a buzzword blew up, supernova-like, into a fiery ball, many times its previous size, so that its boundaries easily encompassed statistical modelling, data engineering, data science and simulation. This was the moment in which anyone north of Excel could claim to be doing AI - the public knew no different.

But then, just as a supernova dissipates, leaving a tiny neutron star with an enormous gravitational pull, so the boundaries of AI (the buzzword) rapidly receded, falling back not just to machine learning, but narrower still, to the edges of a sub-discipline within a sub-discipline - large language models. Left out in the cold were not just my agriculture-optimising friends but all those who had reluctantly given way to the new terminology. Because now the public had a point of differentiation: AI is something you can chat with.

A final cruel twist for those who have watched their discipline swell up and contract like Grandma in George’s Marvellous Medicine can be found in the recent trend for something called “Agentic AI”. I found this definition online:

Unlike traditional AI systems that require human intervention for decision-making, Agentic AI operates independently, using its internal models, learning algorithms, and decision-making processes to navigate and interact with its surroundings.

This works only for those suffering from long-term memory loss. In that case, “traditional” would mean the last five years, and the idea of autonomous agents would indeed be something new.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

From Coppelia

It’s time to remind you that Coppelia runs short courses and mentoring sessions covering in detail many of the topics we have touched on this year, including:

Bayesian statistical modelling
Machine learning
Best practice in statistics and data science
Proper A/B testing
Simulation

And probably most relevant of all this year

How to emerge from the LLM hype cycle unscathed!

If you’d like to be mentored by me, or would like me to run workshops for you and your team, then just drop me a line at simon@coppelia.io.

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Leave a comment

And this way of thinking will set you up nicely when it comes to translating the problem into code!

Preprints at the moment, I believe.

Russell, S., Norvig, P., 2016. Artificial Intelligence: A Modern Approach. Pearson. p. 4.