Glasseye

Issue 6: October 2024

Oct 14, 2024

In this month’s issue:

Semi-supervised uses Python to take us inside the probability triple.
The Eureka moment is debunked as we look at the history of deep learning in The white stuff.
I am unforgivably mean about large consultancies in The dunghill.
Plus, to round off, some symbolic logic from sympy and some dodgy logic from the Tony Blair Institute.

Semi-supervised

Some time ago, I was asked one of those questions that rarely come up because they are about the very basics. These are high-risk questions: there’s a chance that, just by asking them, the questioner will seem to be ignorant of not only that one fact but - since that fact is elementary - the entire field of knowledge built on top of it. Unfortunately (or fortunately, since this is what makes it fascinating) statistics abounds in foundational concepts that are mind-melting. The question I was asked (by someone curious and brave) is: what is a random variable?

And that’s a fair question: the fact is that a random variable is neither random nor a variable. No, a random variable is a function. It maps outcomes in a sample space to a measurable space.

Now personally, I think code is under-used as an explanatory device for theoretical subjects: its virtues are that it is readable, intuitive, lends itself to experimentation, and leads naturally to uses and examples. Plus, an argument could be made that more people are now fluent in Python than in mathematical notation. So that’s what I’m going to try here.

Taking a canonical example: the rolling of two six-sided dice. The sample space, usually represented by Ω, is the set of all the possible outcomes. In python, for our example, we can code it like this:1

omega = [(i+1,j+1) for j in range(6) for i in range(6)}]
print(omega)

[(1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (1, 2),
 (2, 2),
 ...
 (4, 6),
 (5, 6),
 (6, 6)]

Next we need to understand the event space (F), another confusingly named concept since statistical events include what we would more naturally think of as disjunctions of physical events. For example, the statistical event of the first die being higher than the second die is said to occur when any one of five physical events occurs.

F is the set2 of all possible subsets of Ω, which means there are an almost unbelievable 2³⁶ = 68,719,476,736 possible events for our two die. In python we can list the first few elements in the event space:

import itertools

italic_f = []
for i in range(4): # Just the first few
    italic_f = italic_f + [x for x in itertools.combinations(omega, i)]
italic_f

[(),
 ((1, 1),),
 ((2, 1),),
 ((3, 1),),
 ...
 ((1, 1), (2, 1)),
 ((1, 1), (3, 1)),
 ((1, 1), (4, 1)),
 ...
 ((1, 1), (2, 1), (3, 1)),
 ((1, 1), (2, 1), (4, 1)),
 ((1, 1), (2, 1), (5, 1)),
 ...]

Finally, we need a probability function (P). This maps each event in the event space to a number (the probability) between 0 and 1. Here is one such function corresponding to a world where both die are fair:

def P(event):
    return(len(event)*1/36)

And here it is in action on a randomly chosen event:

event_1258 = italic_f[1257] 
print(P(event_1258))

0.08333333333333333

Taken together, the two sets and a function (Ω, F, P) make up our probability space, sometimes called a probability triple.

And now, back to our starting point: a random variable is a function mapping the outcome space onto the real numbers (or a subset of the real numbers).3 Here, for example, is a random variable which works by adding together the values on the two die.

def X(outcome):
    return outcome[0]+outcome[1]

And here it is in action:

X(omega[5])

5

That feels a bit disappointing. But actually, what we are often most interested in is not the random variable but something quite different: the realisation of a random variable. Here all the parts come together. To obtain the realisation, we randomly sample from the outcome space. We know how to do this because the outcomes exist in the event space, and P gives us the probability of each event. Having sampled our outcome, we can then apply the random variable (a function remember) to obtain the realisation.

In our case, all 36 events corresponding to the 36 outcomes have the same probability, which makes life easier. I don’t know what kind of mathematical object such a realisation is, but in Python we can code it as:

import random
def realise_X():
    outcome = omega[random.randint(0,35)]
    return(X(outcome))

x = realise_X()

9

And of course a series of realisations of a random variable is a random sample:

[realise_X() for i in range(10)]

[12, 7, 11, 4, 8, 6, 8, 4, 4, 6]

I’m not suggesting that any of the above would be useful in a Python application. This is about building something in order to understand it since, to quote the ever-quotable Richard Feynman, “what I cannot create, I do not understand”.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia.io.

The white stuff

I found this month’s paper in a recent post by Gary Marcus. He has some beef with the Royal Swedish Academy of Sciences over the reasons given for awarding Geoff Hinton this year’s Nobel Prize for Physics. Marcus mentions that the citation implies that Hinton invented back-propagation (that’s certainly what the press took from it). This, says Marcus, simply isn’t true, and he links to the paper, Deep Learning in Neural Networks: An Overview, to make his point.

Now, it’s not the most readable paper in the world - it can’t be with so many citations to dish out - but it is fascinating for three reasons: first, it shows just how far back the roots of deep learning go and how torturous the early days were; second, it gives a much-needed chronology for the last two decades; and third it tears down the idea that scientific discovery is about one person having a breakthrough idea at one single moment. For a devastating debunking of the Eureka myth, I recommend The Invention of Science: A New History of the Scientific Revolution by David Wooten.

The dunghill

This month it is my turn to spill, and I will oblige with one of my favourite anecdotes. I tell this one a lot because (a) it makes me look clever, and (b) it makes a large and intimidating organisation look stupid. But there is a more serious point to it, which I will come on to.

The story is about good intentions and absurd results. The good intention was on the part of an employer who had insisted on using an external authority to confirm the accuracy of a classification algorithm. This wasn’t entirely driven by the love of truth: the customers of that business needed assurance, and no greater level of assurance could be had than the stamp of a Very Large Consultancy. This is why, soon after starting in the role, I found myself in a meeting with two serious men and a serious piece of paper. That piece of paper confirmed that all the checks had been done on the algorithm, and we were good to go.

Now for the part where I look clever - I don’t mind lingering over it. The precision and recall quoted in the report were based, as of course they should be, on a hold-out sample. But what had been overlooked was that both the training and the test data were oversampled - there was a higher rate of positives in the data than would be found in the population. What was needed was a Bayesian adjustment to the quoted figures that accounted for the fact that the algorithm would be deployed on the population.

To their credit, the serious men took it well and retreated to recalculate. Fine - at this point they could still claim to be auditing our work. What I didn’t expect was for them to ask me to check their audit of our work, because they were unfamiliar with the Bayesian adjustment. Yes, but … doesn’t that mean… the audit… ? Nope, apparently everyone was fine, boxes were ticked, and since I was happy with our work, on we went.

I find this fascinating. Despite my gloating, this wasn’t a difficult problem to spot. I know a great many statisticians and data scientists, none of whom work for Very Large Consultancies, who would have recognised the issue instantly. Why then do we defer to these organisations? Why do we farm out to them our most interesting projects - not just auditing but major undertakings in data science and AI?

Authority bias is doubtless a major factor; another is the cult of “smart” - we are assured that only the brightest kids will be looking at this: the straight-A students, trembling under the weight of their own intelligence.4 Unmentioned is the fact that they have never worked a day of their lives in a real business, have slept only three hours a night for the past week, and are out of their minds with anxiety over whether they will make the team for the next project.5

Meanwhile, somewhere in the basement (or these days at the end of their beds) sit a handful of stalwart employees, each with ten years experience in both the field and the industry, each expensively recruited and then somehow forgotten about. Why not put them on the project? I’ll tell you why - because they are too busy running SQL queries and dropping charts into PowerPoint. And even if they were to do the work without the cachet of a top consultancy, who would listen? To paraphrase a mad German philosopher: people are crazy, organisations are crazier.

If you have some particularly noxious bullshit that you would like to share, then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

From Coppelia

Some good times in September:

Old school, non-connectionist symbolic logic saved the day for me, in a project that involved making deductions about population groups. On the way I discovered sympy, which I’m going to be using a lot more.
Held an interesting seminar with a client to evaluate Tony Blair’s claim that 20% of public sector tasks could be automated using AI. A funny story: Blair’s report was researched by asking GPT-4 to decide whether it is possible to automate each public sector task. What could be wrong with that? At least it proves that some jobs at the Tony Blair Institute are ready for automation.
Thanks to Dominic Bates for directing me to pyro following the brief mention of Tensorflow probability last month. It looks great, and it’s on the list.

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Glasseye

Discussion about this post

Ready for more?