Glasseye

Issue 26: June 2026

Jun 29, 2026

In this month’s issue:

The dunghill scoops the goop in “marketing science” by comparing it to the wellness industry.
The Royal Society’s Transactions puts some rigour into data science. The white stuff is impressed.
Semi-supervised offers some advice on how to write well as a data scientist.

The dunghill

I doubt I’m the first to point out the similarities between the consultancy world of “marketing science” and the multi-billion-dollar wellness industry, but still I do think they are worth listing out - for this reason: anyone with GCSE science and a modicum of worldliness knows what the wellness industry is up to, and no one is surprised that it continues to make a fortune mining the gullible. But when it comes to the marketing science industry, its very existence, along with the cash it generates, is taken as proof that something serious is going on. It’s a veritable gaslighting extravaganza and can cause you to doubt your own good sense. Hence the value of a point-by-point comparison with the wellness industry. So here are my observations.

First thing to note: A similar, and I suspect overlapping, clientele. Rude - patronising even - but true nonetheless. In both cases the buyer tends to have little patience for hard science. Within the wellness industry this manifests itself as a hostility towards conventional medicine; within marketing, as a barely concealed indifference. A box needs to be ticked on the way to doing what was going to be done anyway. That is all.

Which leads us to the second similarity: both industries thrive by telling their customers exactly what they want to hear, whether that’s that their advertising campaign has delivered a sizeable return on investment or, far more tragically, that a fruit juice diet is more effective than a course of chemotherapy. This means that in both cases, the honest broker (usually the bearer of bad news) is hugely disadvantaged. Even more so when you consider that it’s so much easier to tell a compelling story when unconstrained by the truth. Simple, intuitively graspable narratives cut through: specially formulated teas wash away bad toxins; TikTok influencers rejuvenate boomer brands with Gen Z freshness.

Ironically, given the underlying contempt for actual science, these narratives are all the more appealing when decorated with pseudo-scientific paraphernalia. Fake journals and dodgy data are standard fare in the wellness industry. Unreviewed “white papers” are used as bait on the websites of agencies and consultancies. And now the introduction of made-to-order synthetic data sets, rich enough to accommodate any story or outcome.

Essential to the con, in marketing science as much as wellness, is that the practitioners believe what they are saying. Usually they pride themselves on bringing a little magic to the process. It’s a bit of art, a bit of intuition, some mystical know-how that lifts up the science and somehow delivers the goods.

Given the rampant over-promising you would be forgiven for thinking that the truth will eventually out, but somehow both industries manage to get away with it. How? Partly due to a well-known feature of pseudoscience, namely, that it is carefully framed to be unfalsifiable. Partly due to confirmation bias and group delusion, but also due to a marketplace full of endless novelty and reinvention. If a particular cure or solution fails there’s always another to try.

Have we reached the bottom? Can our cynicism go any deeper? It can, with the observation that for both industries there will always be a market for hokum and it will always be vast. I am regularly asked by research agencies whether they should start offering research based on synthetic respondents. This is like a pharmacy asking whether they should stock homeopathic remedies. No, they absolutely should not. Will they be missing out on a significant, highly lucrative market? Yes, they absolutely will.

To rescue the mood, one final, slightly more upbeat parallel. It’s not as if the entire wellness industry is a complete sham. There are plenty of therapists, physical trainers, registered dietitians, etc., doing their utmost to steer clear of the bullshit and ply an honest trade. Similarly, although massively outnumbered by the charlatans, there are some good data scientists working in marketing, advertising, and market research. I know this because I know some of them.

If you have some particularly noxious bullshit that you would like to share then I’d love to hear from you. DM me on substack or email me at simon@coppelia.io.

The white stuff

Now from the worst practice to the best, with last month’s special edition of the Philosophical Transactions of the Royal Society on the topic of statistical and scientific workflow. That might sound dry, but truly it was all my Christmases come at once. (Thanks once again to Adria Luz for bringing it to my attention.) “We believe”, say the authors in the opening article,

that there are shared aspects of quantitative research that are obscured by the varieties of models, methods and even philosophical frameworks that are successfully employed in statistics and machine learning, and also because it seems that many of the most important aspects of statistical practice, in whatever form, do not make their way into the textbooks.

I’ve been waiting my whole career to see that final clause in print. Yes, much of what they describe is not new - on the industry side we have been bootstrapping and cross-validating our way out of the hole between these two disciplines for well over a decade. But what is new, and what I get a massive kick out of, is seeing our worst bugbears - researcher degrees of freedom, the Rashomon effect, Chatfield’s model selection problem, and all the “questionable research practices” that dog our workday lives, being treated seriously in an academic journal as respected as the Royal Society’s Transactions.

In a nutshell, their proposed workflow uses the best bits of method to come out of machine learning and modern Bayesian methods to wrap the process of data exploration and model selection up into something much more scientifically rigorous. The hundreds of minor data manipulation and model selection decisions, which were never properly acknowledged by traditional statistics as sources of uncertainty and bias, are incorporated into their prediction perturbation intervals. And the absurdity of selecting one out of many equally well fitting models (the Rashomon effect) is addressed as part of predictive model averaging. (Again not itself new but benefiting hugely from being presented as part of a general framework.)

Some other highlights:

A nice passage on the sequential assembly of models, pointing out that “complicated models can often be best understood in relation to simpler special cases” and that a “benefit of building up models one step at a time is that sometimes we can reach the pleasant state of having a model that fits well and does the job”, and further that “at this point it can be helpful to try adding a bit more complexity, just to show that this additional step is not necessary. Again, even setting aside questions of model choice, this extra model can help our understanding.” This is the kind of basic know-how that rarely makes it into the journals.
A non-partisan, pragmatic, and grown-up approach to the Bayesian-frequentist debate that accepts that non-Bayesian statistical analysis is not disappearing any time soon. Instead the emphasis is on drawing out the equivalences and ensuring that much of the good stuff from Bayesian methods (the heavy reliance on simulation in particular) is mirrored in classical methods.
The idea of “thinking of any statistical procedures as not just producing a one‑time estimate but also as a mapping from data to inferences.” That strikes me as a good way to de-throne any statistical procedures that are getting too big for their boots.

The complete process is explored more fully in a further article, Predictability–Computability-Stability workflow for veridical data science in the age of artificial intelligence (and there are others in this special edition which I have yet to read).

Admittedly the full workflow is a lot of extra work, but the pay-off is surely in the feeling of having done proper science, knowing that what, if anything, you have uncovered is real knowledge, rather than the bilge described in this month’s dunghill.

Semi-supervised

Let’s put to one side the possibility that you might now be using an agent to write your emails, reports, presentations, etc. (a risky move, in my opinion, as we haven’t yet decided the etiquette on this) and consider the problem of how to write well as a data scientist. By this I mean everything from a one-line message on Slack to a formal write up of your work for publication. And since this is a skill that, far more than any mathematical ability, can hold you back or propel you forward in your career, it is worth more than a few moments of your time.

There are of course countless books on how to write good technical prose, the best of which I covered in the white stuff of Glasseye #17. So rather than regurgitate something that you can find covered comprehensively elsewhere, I’m going to give you some very specific tips, tailored to your role. In any case, this advice quite often runs counter to the more general advice. Unless you are pitching an idea, your goal is not to grab and hold someone’s attention; it is clear, information-rich communication.

Beware the pronoun: It’s so useful in everyday, context-rich conversation that it might seem like it can do no wrong, but remove the context or mishandle the scope and the reader is lost. “It’s fixed.” What’s fixed? “They fixed it.” Who fixed what, goddammit? This problem with pronouns extends to all ambiguous expressions that are narrowed down only in the head of the writer. “I found it in the code we looked at earlier.” But we looked at pages and pages of code earlier, and what counts as earlier? Earlier today? This week? This year?
Examples, examples, examples: They are the single most effective tool you have for clarifying your meaning. A thousand ambiguities are resolved in one stroke. So why are we bad at giving them? Several reasons. First, it’s hard work to turn your fluffy, abstract notions into something concrete. But hard for good reason: when formulating an example you are doing the real thinking, putting flesh onto the bone. Second, abstractions somehow feel more grown-up; examples, in contrast, a little childish - as though by giving an example we are admitting that we aren’t up to using the big words. Third, we underestimate the reader’s ability to generalise, worrying that giving an example might create the impression that this was all we meant. Give them some credit.
Where possible avoid abstractions, what Steven Pinker and others have called “zombie nouns”. Technology excels at producing abstract nouns, designed to be vague to the point of meaninglessness. Think of “platforms”, “frameworks”, “hubs”, “engagement”, “touchpoints”.
Don’t spare us the details: Include all the context you can think of and then some more. I would much rather you told me everything (without repeating yourself) and left it for me to skim the parts I know already. Writing is not talking: if you are boring me I can skip ahead. Which leads nicely to…
Verbosity is not a sin, unintelligibility is: The poor reader can waste more time attempting to decode a single badly phrased sentence than it would ever have taken to skim through a long but lucid paragraph. The notion that everything must be reduced to a headline or a bullet point comes from the newsroom and the advertising agency. But they are doing something entirely different with words.
If you can code well, you can write well: Think about it - so much is the same: you need to declare your terms upfront; break down complex ideas into easily understandable steps; organise a topic into self-contained but interlinked themes. And you wouldn’t dream of putting your first draft of code into production, so why press send without reviewing your prose?
Reread it from the reader’s perspective: Search for ambiguities - could they think I mean something else? - be paranoid; use the bastard check; anticipate objections, questions, the causing of offence.
Make it work for you: Rhetoric is not just for politicians. Language is a weapon. Use it to win arguments without even having them, to protect yourself, to get what you want.

Lastly, there’s no reason why you should tolerate bad writing from others, even your boss. If it’s meaningless blab, send it back, asking for clarification. Politely, of course.

Please do send me your questions and work dilemmas. You can DM me on substack or email me at simon@coppelia.io

From Coppelia

Coppelia is now offering workshops (half or full-day) on the proper (i.e. efficient, effective, and robust) use of coding agents in data science. It feels like we’ve got to that stage now. Let me know if you are interested.
I can report that Claude Code combined with Quarto and Python makes a formidable team. My workflow is a) do the analysis/modelling in Python, using the agent for the grunt work, b) write the report in markdown in the Quarto file, c) pull the whole thing together by using Claude Code to populate the code sections in the Quarto doc based on the previous Python work, instructing it to add various prettifying formatting tweaks.

If you’ve enjoyed this newsletter or have any other feedback, please leave a comment.

Glasseye

Discussion about this post

Ready for more?