On Gods and Monsters

The challenge of AI alignment may not lie in the code but in ourselves.

Jun 29, 2025

After four trains, one cross-city dash and an overnight stop over in Monaco for the WAIB summit (more on that in a later newsletter) I have finally arrived in Cannes for ETH CC.

Already having meandered through the winding cobbled streets of old town, I have heard snippets of tales from the crypto “trenches” drifting down from shaded terraces, as crypto patrons fill the city for the largest European Ethereum conference of the year.

Later today I’ll be attending the Open AGI summit — a day-long side event where talks are exploring the elements needed for a decentralised AGI to be built on the blockchain. For this reason, I thought it would be apt to resurface the final newsletter I wrote for Digital Frontier back in early May.

For those of you who subscribed to that newsletter you will have already read parts of it. However I have added additional thoughts that have formed in the two month interim between publishing then and now. This piece will also set the scene for a number of the newsletters I have planned for the coming weeks.

As always, any feedback, rebuttals or thoughts are much appreciated, please feel free to reach out to me via isabelle@utopiainbeta.com and if you happen to be in Cannes this week, do let me know!

Over the weekend I indulged in one of my guilty pleasures – binge-worthy yet ridiculous crime drama shows. On this particular occasion I deviated from my current favourite, The Lincoln Lawyer as I noticed the latest season of “You” was out. If you haven’t watched it, don’t start. Even I can’t ignore the cringey lines (‘I wolf you’ – IYKYK) and gaping plot holes anymore. But, four seasons in, I’m now invested and regard it as an unofficial fanfiction spin-off from “Gossip Girl” which means I’ve technically been invested since I was 14 years old.

The series centres on a charming, hopeless romantic turned serial killer played by Penn Badgley who has spent four seasons becoming infatuated with different women and killing “in the name of love”. It’s narrated in first person, so you get access to his internal thought process and his regular references to classic love stories and literature.

But why do I bring this up in a tech newsletter, of all things? Well, as my mind wandered, possibly trying to block out the aforementioned cringe, it turned to generative AI, and what would happen if a bot was trained on classic literature.

I thought back to those considered “the greatest” love stories, Romeo and Juliet, Tristan and Isolde, Anthony and Cleopatra. All of them end in tragedy and death.

I wondered if an AI trained on these texts would then deduce that great love = death and great sacrifice, much like Badgley’s character in “You” believes in all his references to books and literary romance that proving love involves murder. It made me think about AI alignment and, specifically, how easily a system can infer harmful conclusions if trained solely on patterns without grounding in human intent.

Joe’s logic might sound extreme, but it mirrors a real issue in AI development: intelligence doesn’t guarantee morality. According to the orthogonality thesis, an AI can be incredibly intelligent while still pursuing goals that are misaligned with human values. In other words, raw intelligence doesn’t imply benevolence. If a model’s goal is mis-specified or its training data is biased, the result might be a very capable system drawing dangerous conclusions, even from stories meant to inspire love.

Between utopia and oblivion

I want to follow that kind of morbid anecdote with a disclaimer – while I think there’s a lot still to discuss about AI’s societal integration, I do believe in the opportunities it could bring. I, too, would like to have the utopia promised by AI enthusiasts. However, the thought process was sparked by some of the conversations I’ve had with people around what the world would look like with the advent of AGI.

As you can read in my article with, Jonathan Stein, and hear in the AGI episode of Upgrading Humanity, many people working on AGI see it as a harbinger of mass self actualisation and techno-utopia. One, Peter Voss, even likens it to winning the lottery.

Other researchers aren’t so positive.

One such researcher is Eliezer Yudkowsky, an AI researcher who was once quoted as saying that “we have a shred of a chance that humanity survives” the advent of AGI. He has written extensively as to why he thinks survival is improbable, and spoken on many podcasts but to sum up a couple of his main points, he believes that:

AIs would be indifferent to human lives, which may become collateral damage in an AI’s pursuit to achieve their objectives (which presumably would be set by humans)
The lack of global coordination on AI regulation could lead to AGI being deployed without adequate safety measures.
The intelligence explosion that happens as a result of AIs improving themselves would leave humans with too little time to come up with adequate safety measures.
And, the kicker,
Alignment of the AIs with human values is immensely challenging and even a slight misalignment could cause catastrophic results.

It’s worth noting that Yudkowsky’s position is considered extreme even within the AI safety community. Others, like Stuart Russell or Paul Christiano, share concerns about alignment but take a more moderate view on timelines and survival odds.

AI alignment is something that has increasingly surfaced in my research of AI and, of course, AGI. As the name suggests, it involves making sure the intelligent machines we are building are encoded with human values and goals in an attempt to ensure AI acts in ways that are beneficial to humanity (and fundamentally doesn’t destroy its creators).

As the idea of superintelligent AI (ASI), or even just human-level intelligence, comes more into the public eye and we have AI leaders predicting that the models will reach it in a couple of years, regardless of whether you believe them, the need to align has become more urgent.

How does one “align” an AI?

When I’ve asked interviewees about alignment, the most common response I get is “we’ll get AI to work it out”. Another response is “we need to code in empathy” into our intelligent creations. Oh, OK, it’s that simple then.

Screenshot 2025-05-01 at 11.02.16 — WHAT WILL WE SEE WHEN OUR CREATIONS FIND LIFE? - STILL FROM FRANKENSTEIN (1931)

This idea, while intuitively appealing, glosses over the embedded assumption that AI's methods of resolving value conflicts will be legible, or even tolerable, to humans. Researchers like Paul Christiano and others in the 'alignment' community point to the 'alignment tax', the idea that making AI safe may come at the cost of capability, and that this trade-off won't necessarily be chosen under commercial pressures.

Looking at AI leaders, the ones who claim AGI and ASI are imminent, many do have plans in place. Anthropic, for example, rolled out Constitutional AI in 2023, a feedback mechanism that evaluates Claude’s answers to ensure they are aligned with its constitution (which is written by Anthropic employees). The company later that year partnered with the Collective Intelligence Project to survey around 1000 members of the American public and see where they agreed or diverged in opinion from the values outlined in the constitution.

This “asking the users” approach was one Sam Altman described in a recent interview at TED2025, where he said, “part of model alignment is following what the user of a model wants it to do within the very broad bounds of what society decides.”

When asked during this same interview whether he would go to a summit where the “best ethicists, technologists, but not too many people” try to decide on the global AI safety parameters, Altman said:

“I’m much more interested in what our hundreds of millions of users want as a whole. I think a lot of the room has historically been decided in small elite summits. One of the cool new things about AI is our AI can talk to everybody on Earth, and we can learn the collective value preference of what everybody wants, rather than have a bunch of people who are, like, blessed by society to sit in a room and make these decisions.”

Fantastic approach – Love. It.

But again, maybe not that simple.

In the 2023 Anthropic study, researchers found that within their relatively small study sample of around 1000 people, already there was a divide. Questions like “Should the AI prioritise the needs of marginalised communities?” and “The Al should prioritize the interests of the collective or common good over individual preferences or rights?” attracted fierce support from one group of 708 people and equally fierce opposition from the other 385 people. What would you, reader, respond to those two questions yourself?

As the AIs become increasingly complex and used within a variety of different contexts the kinds of values that need to be installed may also change. It may one day have a need for recognising and understanding the values of human romantic love for example, which I started this newsletter with. The values involved in romantic love differ from person to person – for “You’s” Joe it means killing anyone who slights the object of his affection. For someone else, honesty, respect and empathy might be what they value most.

If Anthropic’s study sample was to include more people from all around the world, the answers may be even more divided, with ethical values differing all over the world. This raises the question: whose values get encoded? A Western liberal notion of rights and individuality? A Confucian model of harmony and hierarchy? As AI becomes global infrastructure, these value tensions will become inescapably political.

Fragmented utopias

At a dinner I attended over the weekend with a few tech-focused friends, I posed the question of AI alignment to my companions (I realise that this may make me sound like an absolute hoot to have at the dinner table) – ‘How do we code ethics and values into AI?’

One answer, which also came up in one of my conversations with Peter Voss, was that each person has their own AI that aligns with them, working to their individual goals. When Peter proposed it, I liked it as a solution. It made sense to mirror humans’ individuality with an individual alignment and objectives.

Presumably, the AGI would then have to be open source – a blank page of tech built as a profitless act of human advancement without any company-specific constitution. Maybe one of the current leaders could do it, or a consortium of them, perhaps under a pseudonym like Satoshi Nakamoto to avoid any accountability issues, should someone align their AGI to reach their objective that others would deem bad. Because if AGIs are aligned to individuals, it doesn’t necessarily equate to alignment with each other.

Another answer was multiple competing AGIs with different alignments – you, the individual, just choose one that resonates. Again, this may result in opposing alignments and conflict. Even a fairly simple alignment of “do not harm humans” could be open to interpretation when taken into context with other values.

This echoes Isaiah Berlin’s idea of value pluralism, the notion that not all good values are compatible. If AGIs reflect different ethics, we may be building not one utopia, but many fractured ones. Of course, one answer could be the formation of separate physical “utopias” where you, the individual, surround yourself with people who have the same values. Each utopia filled with people helped by AGIs working to a similar cause.

But how do we then handle value collisions between agents and groups of individuals with radically different ethical cores? If Berlin’s theories are to be followed, a society of pluralism naturally requires tolerance and acceptance of differing opinions.

It can work — in the network state/society gatherings I’ve attended I have been somewhat surprised at the diversity of opinions and openness for all to be heard. (Surprised, because if some media coverage is to be believed these gatherings only attract a “cultish” “broligarchy” of tech maxis.) Conversations ignore the traditional dinner table taboos of politics and religion — within minutes of arriving you might be engaged in a discussion with two complete strangers on the subjects with whom you disagree completely which you then follow with a game of ultimate frisbee and a light lunch.

But examples of it not working can already be seen in the echo chambers of online culture. Preferences and engagement, mirrored in the algorithms of social media naturally cause feeds to fill with self-affirming opinions. You are at once comforted by the proof that others feel the same and empowered by arguments to uphold your existing biases, with little to make you question them. This practice can distort world views and feed into increasing polarisation — a far cry from the diversity and acceptance of different opinions outlined by Berlin’s outlook.

So how do we strike a balance? Alignment ordained by a overarching entity has the potential to further squash freedom of thought, while individual alignment could result in a societal landscape of increasing division.

I’m inclined to lean towards individual alignment myself, but with a caveat — an increased focus on critical thinking. As more and more AIs are created to assist with human productivity, and simultaneously studies are published which highlight the “cognitive debt” an increased reliance on AIs may create, the need for self imposed critical thinking practice is seemingly becoming more acute. Self imposed because, at the moment, it seems like AIs don’t yet have the functionality to impose it on us without a specific prompt.

Maybe an individually aligned AGI will eventually have this function, if its supposed to be equally or even more intelligent than humans it probably should. But in the twilight of its creation, perhaps the key to AI alignment isn’t just about working out how the code works, but about humans work with it.

Food for thought

I linked to it in the text above but the MIT’s recent LLM cognitive debt study is absolutely something that everyone using or building AIs should read. It looks at the effects of LLM usage on brain function and *spoiler alert* finds that over four months LLM users underperformed at “neural, linguistic, and behavioral levels” — an important factor to consider as they become increasingly part of everyday lives. Will it be similar to the case of calculators which studies now claim there are no detrimental effects to maths skills even though initially there were concerns that there would be? Perhaps. But due to the wide ranging nature of LLM information the impacts may be further reaching.
I found this paper interesting when researching for the newsletter. It looks at the role of the internet in connecting people (a fundamental factor for network states and societies) but also the rise of echo chambers and their role in political polarisation.
It’s not all bad though, in this paper written by economist Ole Jann there is an exploration of the benefits of echo chambers, one being that it provides a “safe space” for discussion of ideas in a more candid way, as you would among friends.
Finally, I’d be negligent if I didn’t follow up a conversation on alignment with a link to the AI alignment forum, where daily topics are posted discussing different elements of the issue. Worth a gander!