Glasseye

Issue 25: May 2026

May 26, 2026

In this month’s issue:

Writing is thinking, reading is thinking, coding is thinking. Ada knew it. You should too. The white stuff sets things straight.
Semi-supervised explains how to solve the scrappiest of problems by maximising your ignorance.
The dunghill sniffs out the bad practice behind “modelled” variables.

The dunghill

A couple of weeks ago the Guardian ran an article covering the tragic consequences of the Kenyan government's use of machine learning for means testing. The initiative was supposed to determine how much each Kenyan family should pay for healthcare. In the terminology of the World Bank, the model would enable proxy means testing; in other words, it would provide a prediction of household income based on other variables, most significantly household possessions. This prediction would then be used to work out how much the family is able to contribute to a government-subsidised scheme. Putting to one side the ethics of doing this at all (Is it ever acceptable to use a proxy when discrepancies between the proxy and the true value have such dire consequences?), there were, as investigators at Africa Uncensored revealed, serious flaws in the modelling methodology that went beyond just top-level predictive accuracy. As I have pointed out in previous posts, subjective decisions nearly always need to be made during the training and tuning of machine learning models, and these decisions reflect particular interests. In the case of the Kenyan government’s programme, the decision was to tune the algorithm to classify the rich more accurately than the poor, on the grounds that a rich person classified as poor is far less likely to report the error than a poor person classified as rich. While this was better for government finances, it was disastrous for the poorest families.

Now bear with me, as this might seem unrelated. After I posted last month’s Dictionary of bullshit for statistics, AI and data science, I asked subscribers what I had left out. Scott Thompson then kindly offered “modelled as an adjective for papering over all kinds of shoddy guesstimation.” Perhaps you are lucky enough never to have heard of modelled variables, but they are a mainstay of the data brokerage industry. A modelled variable is a proxy for a real variable, created in exactly the same way as the Kenyan government’s proxy income value. In fact, had the Kenyan government been looking for a term that made their proxy variable appear more respectable, then they might have gone for “modelled” income.

The consequences of modelled variables are of course usually trivial compared to what has happened in Kenya, but nevertheless the shoddiness is well worth an airing. Let me say first that of course there is nothing wrong with creating a machine learning model that predicts income or anything else. It’s what you do with it next that matters.

For one thing, the producers of modelled variables often forget to mention the fact that the values are not real; and the consumers often forget to ask (granted they might not even suspect it). This, in fact, was my first exposure to “modelled” as a bullshit term - used to bat back some concerns about the reliability of purchased data. No, the variables were not “actual”, but I needn’t worry, I was assured, since they were “modelled”. Knowing no better at this early stage in my career, I failed to come back with the correct response - something along the lines of: “Well in that case I really need to know how accurate they are because this will affect every conclusion I draw using this data.”

The most common fate for a modelled variable is to end up as an input for yet another predictive model. Modelled income for example might end being used to predict customer lifetime value. But does this really work? There are three scenarios here:

This second model was trained using real income values as features; modelled income is used only as an input once the model is deployed. But now of course we don’t know how accurate the second model is, because any training-related performance measures assume that the income value is “actual”. The model could be wildly inaccurate.
The second model was trained using modelled income values; this will mean that the performance measures account for the uncertainty in such values, but the question then arises: why not just take the features that were used to predict income and use them directly in the second model? This will almost certainly yield a better outcome for the second model as it will allow interactions between those features to play a role in prediction.
The second model was trained on a mixture of real and modelled income values, the latter being used when the former are unavailable. Now we have a god-awful mess. The model performance measures are valid only for the particular balance of real and modelled income values present in the training set. Unless we can guarantee this balance in the wild, who knows how the model is going to perform.

Since not a lot of attention is usually paid to features used for machine learning what we end up with is a two-step process of obfuscation. First, nothing is said and nothing is asked during the handover of modelled variables. Second, these variables are tucked away in machine-learning pipelines, where they are safe from further interrogation, in a way they would absolutely not be if they cropped up in, say, a shareholder report. No one is any the wiser.

Thanks again to Volodymyr Fomichov for pointing me to the Guardian article and Scott Thompson for suggesting “modelled” as a bullshit term of art.

The white stuff

“Writing is thinking.” So says a short editorial in Nature Reviews Bioengineering, pleading the importance of “human-generated scientific writing” in the age of generative AI.

Writing scientific articles is an integral part of the scientific method and common practice to communicate research findings. However, writing is not only about reporting results; it also provides a tool to uncover new thoughts and ideas. Writing compels us to think — not in the chaotic, non-linear way our minds typically wander, but in a structured, intentional manner.

“Reading is thinking too”, says the author of a letter to the Annals of Biomedical Engineering, citing the Nature article. Of course it is.

I’m reliably informed that agencies are losing business because their pitch teams are presenting LLM-generated slides they neither authored nor properly read. Not having had the thoughts themselves, they are tongue-tied as soon as the prospective client asks a question.

As I said in Glasseye no. 23, I suspect we are now seeing proposals being generated from LLM-transcribed meeting notes, which then become contracts by a further act of generation, to be signed by parties who have no idea what they contain.

Coding is thinking too. A form of thinking that should be prized for the way it clearly separates out concepts that are muddled together in everyday thought. Ada Lovelace knew this (right at the very beginning) as you will learn from this month’s recommendation: Lovelace’s notes to her translation of Menabrea’s memoir, On the Mathematical Principles of the Analytical Engine. Here she advises that we introduce into our thinking the distinction between operations, objects operated upon, and the results of the operations performed upon those objects, a distinction she learnt from programming Babbage’s Analytical Engine:

It were much to be desired, that when mathematical processes pass through the human brain instead of through the medium of inanimate mechanism, it were equally a necessity of things that the reasonings connected with operations should hold the same just place as a clear and well-defined branch of the subject of analysis, a fundamental but yet independent ingredient in the science, which they must do in studying the engine. The confusion, the difficulties, the contradictions which, in consequence of a want of accurate distinctions in this particular, have up to even a recent period encumbered mathematics in all those branches involving the consideration of negative and impossible quantities, will at once occur to the reader who is at all versed in this science.

I enjoy watching Claude Code as much as the next person, but take care not to hand over the important stuff!

Semi-supervised

I have a useful tool for you. It was briefly mentioned in the white stuff of Glasseye no. 11, but it has come in handy so often recently that I thought it deserved more of a spotlight. This is the principle of maximum entropy as introduced by E.T. Jaynes in his 1957 paper, Information Theory and Statistical Mechanics. Compared to the concepts of statistical inference, the idea is relatively simple. To quote me:

The principle uses Shannon’s concept of information entropy - at the time only recently formulated - to update Laplace’s principle of insufficient reason (in the absence of any relevant information, we should assign equal probabilities to all possible outcomes). Jaynes redefines this in terms of entropy: in the absence of information, it is logical to assume the probability distribution that is maximally non-committal - in other words, the one that contains the least information, ergo the one that has the highest entropy.

Most usefully, Jaynes shows how we can constrain the optimisation process with whatever information is known about the distribution to obtain a final distribution that reflects the known facts while remaining maximally non-committal.

In order to sell it to you let me first explain why the principle of maximum entropy is useful. After that I will describe how it is done. There was a fashion a while back among consultancies for intimidating potential employees by posing irritating interview questions along the lines of: “How many trees are there in Hyde Park?” or “How many people can you fit in Wembley Stadium?” The point was, no matter how little you know, you should still be able to say something. The use case for the principle of maximum entropy is in a similar vein. It is, if you like, the opposite of a big data problem. It is the almost-no-data problem - certainly nothing at the level of the individual observations. Since this is not an unusual situation, especially when it comes to decision making, it is extremely useful to have a method that arrives at not just any guess, but the most rational one.

Specifically, the principle of maximum entropy comes into play when we are after a best guess at a probability distribution and all we have at our disposal is a few high level facts about that distribution, facts that can be turned into constraints. To give an example we might want a best guess at the distribution of UK citizens over gender, income band, and party voted for at the last general election. The scraps of information we have to hand are: a top level percentage split of individuals by party voted for, a percentage split of men by income band, and a percentage split of women in the highest two income bands by party voted for. The answer, I hope you agree, is not completely obvious.

So how do we do this? Fortunately Jaynes provides a general solution for discrete distributions. If a probability distribution with probability mass function p is subject to m expectation-based constraints:

\(\sum_{i=1}^n p(x_i)f_k(x_i) = F_k \qquad k = 1,\ldots,m,\)

then the distribution which maximises entropy is given as:

\(p(x_i) = \frac{1}{Z(\lambda_1,\ldots, \lambda_m)} \exp\left[\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)\right],\)

where the lambda k are Lagrange multipliers. The normalisation constant is given as:

\(Z(\lambda_1,\ldots, \lambda_m) = \sum_{i=1}^n \exp\left[\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)\right].\)

To arrive at the solution we need to calculate the value of the lambda k. This can be done by solving the system of equations:

\(F_k = \frac{\partial}{\partial \lambda_k} \log Z(\lambda_1,\ldots, \lambda_m)\)

That might feel like a lot of work, but the key step is turning the constraints that we have into expectation-based constraints, and we do this, in each case, by choosing an indicator function fk that picks out those discrete outcomes whose probabilities need to add up to whatever proportions we do happen to know about. Take the example above: if we know that 60% of the electorate voted Labour we would need an indicator function that picked out all the outcomes in which there was a vote for Labour, for example: male, income of 40 to 50K, votes Labour. After that it comes down to calculus and solving simultaneous equations (or you could just plug the objective function and the constraints into an optimiser).

I can almost guarantee that the end result will strike you as breathtakingly obvious in retrospect - the kind of conclusion that, had you thought about it long enough, you’d have reached intuitively. But intuition is one thing, justifying an intuition is something else. In fact the principle of maximum entropy is particularly good at proving things that do strike us as somehow obvious. For example, it provides a justification for the conclusion that, when nothing else is known, the distribution of a group over a variable is the best guess for the distribution of any of its subgroups over that same variable - an intuition that we draw on when we do certain kinds of sample weighting.

One last point: it should be clear that this technique is not a form of statistical inference. We obtain only point estimates for the individual probabilities. We can say nothing about how accurate these estimates are. The principle of maximum entropy does however play a role in Bayesian statistical inference, as a principled method for selecting priors.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia

From Coppelia

After the slapdash, ego-driven unpreparedness of most business meetings, it was a nice change to be an external reviewer for Birkbeck’s School of Computing and Mathematical Sciences. It was expertly run at an orderly pace and felt genuinely meaningful. This could be a case of grass is always greener, but I love the place (did my post grads there) and will always recommend it.
Thank you to everyone who has sent me more examples of deranged propaganda for synthetic respondents (and thank you to the providers themselves who have spammed it straight to my inbox). An update on my favourite bugbear is coming soon, I promise!

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Glasseye

Discussion about this post

Ready for more?