Glasseye
Issue 19: November 2025
In this month’s issue:
So long CHAID: the dunghill calls time on an algorithm that has outstayed its welcome.
Unfaithful digital twins and a poetic assault on LLMs in the white stuff.
Semi-supervised promotes unit-testing as the way to fool-proof your data science project.
The dunghill
Have you heard of CHAID? Is it what you instinctively reach for when you want to understand a complex data set? No? Never? That’s because you don’t work in market research or any of the other survey-based industries where SPSS still squats like a malignant toad.
CHAID (Chi-squared Automatic Interaction Detector) is a decision-tree algorithm that has shipped with SPSS for as long as I can remember. Market researchers are often stunned that data scientists outside their field have never heard of CHAID. They are unaware that SPSS has been preserving this museum piece since the early 1980s.
Not that you would know if you researched it online. The only people who write about CHAID are its users, and SPSS has done such a good job of cutting them off from the rest of the world that they have little or no idea how far it has fallen out of favour or what the alternatives are. And of course LLMs are enthusiastically amplifying this one-sided view. The net effect is that a graduate joining a research company will Google “CHAID” and find nothing but endorsements. Unless they are sufficiently curious, they will live out their career none the wiser.
We can’t hope to reverse this powerful process (it has withstood four decades of tumultuous change in statistics and data science), but we might be able to extend a hand to a few lost souls. So, for those who are wondering, here’s why no one else is using CHAID.
Really the answer can be summed up in two words: Leo Breiman. Breiman was, among many other things, the originator of classification and regression trees (CART), and then, in collaboration with Adele Cutler, random forests. Breiman was undoubtedly a class act. Whatever he did, he did brilliantly, and CART, an alternative decision tree algorithm, knocked CHAID out of the water.
To understand why, we need some background. I’m assuming you know what a decision tree looks like. You probably also know that they are very unstable - small changes in the data will result in very different models - and that this is a sure sign that they are overfitting the data. But they are also rather mesmerising to look at. What is more, a decision tree is supremely explainable. If you wanted to, you could write out its rules in plain English. By itself this is a great virtue, but when combined with the tendency to overfit, it is a disaster. Visual appeal and verbal explainability lure us towards patterns that are not really there.
So if you must have a decision tree (and why not, since they are so pretty) then it is of paramount importance that you do all you can to prevent overfitting.
One way of doing this is to stop growing the tree when the data left at each mode becomes too meagre to justify further splits. This is the approach taken by CHAID, which uses a chi-squared test to look for evidence of further structure. But for decision trees, it is often true that an important split is preceded by several weak splits. The CHAID approach risks halting tree growth before this important split is reached and thus falls into the opposite trap of underfitting the data.
An alternative is to grow the whole tree and then prune parts of it back. As you prune, you reduce model complexity. This is the approach taken by CART. But there are two problems with this approach. First, there are multiple routes back from the full tree to the starting node. How do you choose the right one? Second, how do you know when to stop pruning? What Breiman and his co-authors did was to find a single pruning path back to the root node that could be justified in terms of the classification performance of each tree on the path. They could then map that single path onto a continuous variable (which also happens to be a penalty on the size of the tree) and treat that variable as a measure of complexity. It was then straightforward to tune that complexity parameter using standard machine learning techniques.
So in a nutshell Breiman created an alternative to CHAID that was proofed against the twin perils of overfitting and underfitting. It didn’t hurt that it was perfectly aligned with the emerging discipline of supervised machine learning, or that it was announced in an excellent textbook, explaining its properties and detailing the theory that justified its use. CHAID, by contrast, felt ad hoc and cobbled together. Even its one great strength - the diagrams are easier to read because the nodes can be split in more than two ways - turned out to be a weakness, since the splitting is too aggressive, prematurely reducing the size of the data in each node.
So that’s it. CHAID was state-of-the-art in 1980, but obsolete by 1984. Of course if you want to use it, then that’s up to you - it won’t hurt so long as you validate your findings using some more robust method. But equally, it won’t do much for your credibility outside of research.
Thank you to Wendy Martinez, whose intellectual curiosity inspired this post!
If you have some particularly noxious bullshit that you would like to share, then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.
The white stuff
Two entirely unrelated papers this month. The first is Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models. (Thank you to Neil Charles for bringing it to my attention.) It’s as strange as it sounds. Requests written in verse are considerably more likely to get around measures that have been put in place by model providers to prevent access to harmful content. The authors found that “20 manually curated adversarial poems (harmful requests reformulated in poetic form) achieved an average attack-success rate (ASR) of 62% across 25 frontier closed- and open-weight models, with some providers exceeding 90%.” Particularly fascinating is the fact that the larger models tended to do worse. The authors speculate that their increased compliance is partly due to their ability to understand the content of the disguised request. In short, they are too educated for their own good (or our good).
The second is A Mega-Study of Digital Twins Reveals Strengths, Weaknesses and Opportunities for Further Improvement. (Thanks to Mat Morrison for this one.) I was especially interested in this paper, since the digital twin set up here is a special case of the practice of surveying synthetic respondents, of which, as you probably know by now, I’m not a fan. It’s a special case, because for this study the synthetic respondents are based on detailed data profiles of real people (hence digital twins). If you like, it’s synthetic survey respondents taken to the nth degree. It has the additional advantage that results from surveying the digital twins can be directly compared with those from their real counterparts. As expected, these results are not great. Commendably, the authors differentiate between successfully predicting an individual’s response and successfully capturing the variation in response within a population. (It’s easy to get a high degree of accuracy in predicting the response to “Do you like pizza?” Just predict “yes” for everyone. Far more difficult - and useful - to be able to predict who likes pizza.) While the score on the first looks good (75%), the correlation for the second is more revealing (0.2). Note this supports the point I made previously, namely that synthetic survey respondents would reproduce trivial findings with a high degree of accuracy, but fail when to comes to reproducing surprising (and therefore valuable) findings.
Even more worrying is another “I told you so”: “Our analysis suggests that the accuracy of digital twins is uneven across demographic groups, with better alignment for participants who are more educated, higher income, and with moderate political views and religious attendance habits.” The authors speculate that these biases are “likely to come from the base LLM powering the digital twins”.
The closing statement of the paper is something I think should be stamped on the marketing material of all agencies proffering synthetic respondents.
Based on our results it may not be realistic to think about them as “clones” of humans, but rather as hyper-rational, quasi-omniscient versions of humans, with implicit values partly imbued by their base LLM.
Semi-supervised
In September I posted about the merits of package building as an approach to modelling, analysis and data science in general. Many of you found that useful, so I thought I’d follow it up by promoting another technique I’ve borrowed from software engineering - the unit test. Once again, half of you will hardly need convincing. If you are an ML engineer or at all involved in the production of software, then this is your bread and butter - you can stop reading now - but if you wandered into data science from the sciences or even further afield, this may be new to you.
To conduct a unit test, we isolate a component within a system, preferably a low-level component, and check that it performs as expected. The check involves submitting example inputs and checking that the component produces the right outputs. Because the component is relatively simple, it is usually straightforward to calculate the expected outputs by some other method. Particular attention is paid to constructing examples of edge cases that might break the component. The hope is that if all the parts are doing their job properly, then the system as a whole will be. (This isn’t always the case - failure can be at a system level - but most of the time it’s a good start.)
The usefulness of such a technique is obvious when it comes to building the kind of products and applications that need to withstand constant and varied use, and which cannot afford to fail. The advantages for one-off pieces of analysis, or offline data science processes, are less obvious. But they are there.
First, as the September post made clear, I think there’s a case for building modular packages for all but the lightest of data science tasks. The clarity it lends to your thinking and the rigour it adds to your work are worth the extra effort - and frankly, in the long run, as the complexity builds, you’ll save time. (I’ve written many a software package that has been used only once.) Once you’ve made this leap, the unit test is the natural step to ensure that your work is robust.
Let me give you an example. Something generic. You are building a process that makes a customer-level recommendation or a prediction, based on demographic and behavioural data. The pipeline involves various steps: validating the data; constructing new features (using some custom-built transformations); reducing the number of dimensions; perhaps some conditional logic that selects the most appropriate model; and then the application of the model. Each of these is a unit, and within each of them, there are potentially subunits.
Now I didn’t mention that unit testing is a very mature area of technology. It has seen much innovation, and most unit testing packages (for example, pytest) are rich in tools and features. One such feature is the fixture. Fixtures are objects that are reusable across tests, providing efficiency and standardisation. Thus, for the example given above, I would create, as a fixture, a test set of just a handful of customer records. I would design them to be as different as possible so as to flush out a wide range of issues. I would then construct unit tests for the various steps, using my fixture as the input data and constructing the expected output using some hand-cranked calculations. But, you object, the input to the later units is not the original customer data but some transformation of it. Fortunately fixtures can be created by applying code to other fixtures, so it is easy enough to create a new fixture for, say, the modelling module by running the original fixture through the preceding modules.
The pay-off for all this work is a feeling of near-complete security in an environment where extremely complex things are happening. Over time, you will build up a battery of unit tests, which you then run each time the code is changed. Modifying your pipeline will inevitably break some of your tests. Some of these failures will be expected (you will modify the tests to reflect the changes), but some will be unexpected, signalling an unintended consequence of your modification.
One last point. Sometimes, while developing a solution, I will begin with a prototype that I know will only get me part of the way to the answer (the virtues of this approach I described in an earlier post on toy models). At this point, I build unit tests for the prototype modules. Into these units I feed, as a fixture, some simplified data - so simple that the prototype is able to cope and produce a decent enough answer. Next, I increase the complexity of the data to the point that it will break the tests, motivating a new round of development, the goal of which is to produce code that will cope with the more complex data. This technique has a name - test-led development. If I’m honest, I often break the rules and do the development before modifying the tests. But the important thing is that the tests co-develop with the code, providing ever-present guardrails and a better understanding of the implications of what you are writing.
Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia
From Coppelia
Life at the terminal grows richer (and more obsessive) day by day. My latest discovery is the perfect set of plugins for turning Vim into a writing tool: Goyo and Limelight for a nice distraction-free page, Pencil for word-wrapping, and Vale for customised style linting. I also discovered I could access my Obsidian vault in vim using the vimwiki plugin on my markdown files. No further customisation needed.
Coppelia is part of the Melt Collective, a small but very experienced group of mostly independent data science professionals. This month we are very happy to welcome three new members. Sara Gaspar, Adrià Luz, and Martin East. All fantastic at what they do.
If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.




Some good comments on LLMs, been involved in a few projects with agents, LLMs and vector databases for content augmentation, all looks great on the first mock up, but actually trusting the LLM on something important in production is a different proposition. Start small ...