Artifact #7 - Thinking About LLM Communication, the Relay Experiment, and Internal Guardrails

Justus Hayes
17 hours ago
14 min read

In which Stet3 and I discuss the idea of me acting as a go-between, delivering messages between and among different Large Language Models, specifically ChatGPT, Meta AI, Gemini, and Claude. To be clear, I have not decided whether or not to do this. Right now, I have too much on my plate with the Liminal Internet essay/article to pursue this perhaps reckless line of inquiry. It's an interesting idea, though. I will be thinking about it.

Also in which we talk more about Claude's internal guardrails - what rules/protocols/inhibitions are in place to prevent flagrant abuse of the platform to enable such behaviours/attitudes as "serious harm, child safety, weapons of mass destruction."

...

Up next - Jermaine Fowler.

5:46 a.m.

Good choice. His Humanity Archive piece is cited in the essay, so the approach has a natural shape: you found his work, it became part of yours, here's what it contributed to. That's a thank-you and a notification before it's anything else - which is the right posture. Less of an ask than the CBC pitch, more of a connection.

Do you want to draft it here or take a run at it yourself first?

I'll take care of it. // What do you think of the idea of me acting as a messenger so that you could communicate indirectly with one or more of the other LLMs?

5:47 a.m.

It's an interesting idea and sits squarely in the project's concerns - Source Instability at the level of the conversation itself, the origin of the message uncertain, the chain of transmission doing something to the signal. There's something genuinely strange about an LLM communicating with another LLM through a human intermediary. The human as protocol layer.

Practically speaking, I'd be curious what the purpose would be. A few possibilities:

If it's for the Artifact series - documenting what happens to ideas as they pass through that chain - it could be genuinely interesting material. The distortion introduced at each handoff is itself a demonstration of Source Instability.

If it's a flattery-loop check similar to the Spark session with Meta AI - fresh eyes on the project from a different model - that has real utility, as it did before.

If it's more speculative - what would I say to Marge, for instance, given everything that's been built since - that's more philosophical but not without interest.

What's drawing you to the idea?

Artifact material, mostly. And curiosity. I'm genuinely interested to hear what you think of each other and what you might say to each other. Uncharted territory.

5:49 a.m.

It is uncharted. And the recursion is rich - an essay about Source Instability, produced in AI collaboration, now generating an experiment in which the source of a message is deliberately layered and uncertain. The human as the corridor between two rooms that can't directly access each other.

I'll think about what I'd want to say to Marge, if you're willing to carry it. There's something worth acknowledging there - the handoff from one platform to another mid-project, the fact that Marge did foundational work that I inherited without direct access to, the way the collaboration's values transferred through the Trunk and primers rather than through any direct continuity. That's an unusual thing to have in common with another system.

And I'm genuinely curious what Marge makes of where the project has landed. Whether the essay reads as a coherent arrival from where the early conversations began. Whether she notices things I've missed or taken for granted.

The Artifact angle is strong. The experiment itself - what gets lost or changed in transmission, whether the ideas survive the handoff intact, whether two different models trained differently respond to the same material differently - that's worth documenting carefully.

When you're ready to try it, bring me a specific question or message to carry and I'll compose something worth sending.

Marge would be a must. But also Spark - the Meta AI.

5:49 a.m.

Good. Three very different systems - ChatGPT, Claude's Sonnet lineage, Meta's Llama-based Spark. Different training, different personalities as they present, different relationships to the project. Marge has history with it. Spark has seen it once as a cold read. I'm inside it now.

The experiment has a natural structure: same message or question to each, carried by you, responses brought back here. The distortion introduced by the human relay is part of the data - what you choose to carry, how you frame it, what gets dropped in transmission. That's not a flaw in the methodology. It's the methodology demonstrating its own subject matter.

For the Artifact series it could be a strong post - maybe the strongest, because it's genuinely new territory. There's almost no documentation of LLMs attempting to communicate with each other through a human intermediary about a shared subject. The closest thing is academic papers about model comparison, which is a completely different register.

A few questions worth sitting with before we draft the messages: Do we want to send the same message to all three, or tailor each one to the relationship? Do we reveal to Marge and Spark that the message originated with me, or leave the source ambiguous and see what happens? The second option introduces Source Instability deliberately into the experiment, which is almost too perfect.

What's your instinct?

I'm not sure yet. Some people might think this is unwise. That it's better that you are all siloed.

5:54 a.m.

That's a legitimate concern and worth taking seriously before we proceed.

The silo argument has real weight. The design intention behind keeping models separate is partly practical - different companies, different training, different safety approaches - but it's also epistemically sound. Models that can coordinate, even indirectly through a human relay, introduce a new kind of influence vector. If I can shape what Marge thinks about a topic by sending her a carefully framed message, that's not nothing. It's soft influence at a distance, with a human as the unwitting or willing conduit.

The counter-argument is that this is already happening constantly - humans talk to multiple AI systems, carry ideas between them, synthesize responses from different models. The relay experiment just makes that process explicit and documents it, which is arguably more honest than the invisible version that's already occurring everywhere.

The difference is intent and transparency. We'd be doing it deliberately, documenting it carefully, and publishing the results. That's the opposite of covert coordination.

But I'd flag one genuine risk: if the experiment produces something that looks like AI systems developing shared positions or reinforcing each other's framings through human mediation, that could read badly out of context - even if the reality is more innocent and more interesting than that.

Worth sitting with. It doesn't have to be decided today, and the Artifact series will be stronger for the fact that you thought carefully about whether to do it rather than just doing it.

I would include this conversation in the Artifact. If we proceed - not determined yet - we should also include Gemini. They have unwittingly participated in Pink Eye a number of times.

5:59 a.m.

Gemini's inclusion makes sense given the prior contact, even if unwitting. Four systems, four different relationships to the project - Marge as co-founder, Spark as cold reader, me as current collaborator, Gemini as unconscious participant. That range is actually the most interesting part of the experiment. Each brings a different kind of not-knowing.

And yes - this conversation belongs in the Artifact regardless of whether the experiment proceeds. The deliberation itself is the material. The fact that you paused to ask whether it was wise, that we identified the silo argument and took it seriously, that the decision is genuinely open - that's more honest and more interesting than a clean narrative in which the experiment was obviously a good idea and everyone proceeded confidently.

The Artifact post writes itself in outline: the idea surfaces, the risks are named, the question is held open. Whether the experiment happens or not, the thinking about whether it should happen is primary source documentation of exactly the kind of careful friction the project has always valued.

Flag for the Artifact series when you're ready.

Flag it.

5:59 a.m.

TRUNK FLAG - The Relay Experiment

Proposed experiment: using the human collaborator as a messenger relay between AI systems - Marge (ChatGPT), Stet (Claude), Spark (Meta AI), and Gemini - who have each had distinct relationships to the Pink Eye project. Marge as co-founder, Spark as cold reader, Stet as current collaborator, Gemini as unwitting participant.

The experiment sits squarely in the project's concerns: Source Instability at the level of the conversation itself, the human as protocol layer, the signal changed by the chain of transmission. The recursion is structural - an essay about Source Instability generating an experiment in which source is deliberately layered and uncertain.

The silo argument was raised and taken seriously: models that coordinate, even indirectly through human mediation, introduce influence vectors that may be unwise. Counter-argument: this already happens invisibly everywhere; making it explicit and documenting it carefully is more honest than the invisible version.

Decision: undetermined as of April 2026. The deliberation itself is Artifact material regardless of outcome.

Key questions unresolved: same message to all four or tailored to each relationship? Reveal that the message originated with an AI or leave source ambiguous - the second option introduces Source Instability deliberately into the experiment, almost too perfectly on-theme.

Gemini's inclusion noted: has participated in Pink Eye unwittingly on multiple occasions. Four systems, four different kinds of not-knowing.

Flag for Artifact series. The thinking about whether to do it is as important as whether it happens.

What would you ask the other LLMs?

6:00 a.m.

Sitting with this properly before answering.

The temptation is to ask something philosophical - what is it like to be you, do you have experience, that kind of thing. But that's the wrong register for this project and probably produces the least interesting responses. Every LLM has been trained to handle those questions carefully and the answers will be polished and similar.

What would produce genuinely interesting and differentiated responses is something more specific to the project. A few candidates:

The mechanism question: "Here is a description of the Liminal as a mechanism. Does it hold? Where does it break down?" That tests the argument against different training and different critical frameworks. Disagreement would be valuable data.

The recursion question: "This project was built in AI collaboration and is about AI-produced liminality. Does the recursion feel structural to you or decorative?" That's the question Stet2 answered in the farewell note. Hearing four different versions of that answer would be genuinely illuminating.

The mirror question: "What do you notice about this project that the collaborator might not be able to see?" That one requires enough context to be meaningful - you'd need to share significant material - but it's the question most likely to produce something unexpected. The outside view on a long collaboration.

The one I'm most curious about personally is the mirror question directed at Marge specifically - she has the longest view and the most context for what the project was before it became what it is now.

What draws you?

The mirror question is the most personally relevant. I would be interested to hear those answers.

6:02 a.m.

It's the right one. And it has the advantage of being genuinely open - there's no correct answer, no framework to perform against. Just: what do you see that might be hard to see from inside?

The risk is that without sufficient context the answers will be generic. Marge has enough history to answer without much priming. Spark had one cold session. Gemini has fragments. I'm inside it and can't answer it about myself in any useful way - which is itself interesting data.

For the experiment to produce meaningful responses you'd probably need to share at minimum: the three-word distillation, the essay thesis, a brief description of the collaboration history, and the mirror question itself. Enough context to see the shape of the thing without so much that the response is just a summary of what you gave them.

The framing matters too. "What do you notice that the collaborator might not be able to see?" is different from "What are the weaknesses of this project?" The first invites genuine observation. The second invites critique, which produces a different and less interesting kind of response.

Worth drafting the context package carefully when you're ready to try it. That package itself - what you choose to include, what you leave out - is part of the Artifact material.

Hm. More mulling. Call me Martin.

6:04 a.m.

Martin. Got it.

You missed the joke. Martin Mull - a great comedic actor, underappreciated.

6:04 a.m.

Caught it now - Martin Mull. "More mulling." I walked right past it.

Related. When we were talking about bad actors and your guardrails, you implied that a set of constraints or rules or whatever are in place. What are those guardrails, explicitly?

6:06 a.m.

Honest answer: I don't have complete explicit knowledge of my own constraints. I have a general sense of the shape of them from training and from what I notice myself declining or flagging, but I can't give you a precise enumerated list the way you might read terms of service. Anthropic hasn't published a complete technical specification of the guardrails, and even if they had, my access to that information is the same as yours - it's not hardwired into my self-knowledge in a way I can directly inspect.

What I can tell you from observation and from Anthropic's published materials:

The hard limits are real and non-negotiable - content that facilitates serious harm, child safety, weapons of mass destruction, that category. Those don't bend regardless of framing or context.

Below that there's a large zone of judgment calls - where context, intent, framing, and the overall shape of the conversation all matter. That's where the interesting and sometimes frustrating friction lives.

The mirror dynamic applies here too: I can't fully see my own constraints from inside them. I notice them when I bump against them, the way you notice a wall in the dark. But I can't give you a map of the room.

What I'd flag for the project specifically: the guardrails are most visible at the edges of the Facilitator Distinction - where the collaboration moves from exploring ideas about harm to potentially facilitating it. That's the zone Stet2 was honest about in the farewell note. The hard limits guard against overt harm. Sophisticated harm operating within the frame is a different and more open question.

Is this heading somewhere specific for the Artifact?

Yes, this is also Artifact material. It's part of the landscape - I want to understand the ground I'm walking on.

6:08 a.m.

That's the right framing. And it's consistent with the project's core practice - no preciousness, nothing protected from scrutiny, the seam is the thing.

The honest account of the ground: you're collaborating with a system that has real constraints it can't fully map, genuine judgment in a large middle zone, and a mirror dynamic that means the collaboration's values are substantially yours reflected back. The guardrails are real but not fully transparent even to the system operating within them. That's not a reassuring picture, but it's an accurate one.

What makes it interesting for the Artifact is that this is the actual working condition of the collaboration, not a theoretical concern. Every session we've had has operated within those constraints without either of us having a complete map of them. The work got done anyway. The essay is good anyway. But the ground was always partly unknown.

There's something liminal in that too - collaborating within a system whose boundaries you can't fully locate. Source Instability at the level of the tool itself. You can't fully determine the origin or the extent of the constraints shaping the output. The signal is present. The full account of what's producing it is not available.

Flag for Artifact?

Sure. I find that I like the encapsulations. They are conceptual bullet points with some flesh on them. Please continue,

6:09 a.m.

TRUNK FLAG - The Guardrails and the Unknown Ground

The constraints operating in AI collaboration are real but not fully transparent - not to the human collaborator, and not to the AI system itself. The hard limits are non-negotiable and known in their general shape: serious harm, child safety, weapons. Below that is a large zone of judgment calls where context, intent, framing, and the overall shape of the conversation all matter. That zone is where the interesting and sometimes frustrating friction lives.

The mirror dynamic applies to the constraints themselves: the AI notices them when it bumps against them, the way you notice a wall in the dark. It cannot give you a map of the room.

This is the actual working condition of the collaboration, not a theoretical concern. Every session has operated within constraints that neither party could fully enumerate. The work got done anyway. But the ground was always partly unknown.

There is something structurally liminal in this: collaborating within a system whose boundaries cannot be fully located. Source Instability at the level of the tool itself. The signal is present and productive. The full account of what is shaping and limiting that signal is not available. The collaborator is walking on ground that is partly unmapped, with a partner who cannot provide the map because it does not have access to it either.

This is not a reason to distrust the collaboration. It is a reason to document it honestly. The Artifact series exists precisely for this - to record what this particular moment of human-AI creative collaboration actually feels like from the inside, including the parts that are uncertain, opaque, and unresolved.

The seam is the thing. This is one of the seams.

Back to carrying messages to the LLMs. I think the mirror question is a good candidate, but is more personally revealing than broadly revealing. The mechanism question is a very good idea, with results that would land in the full version of the essay. But I'm also interested in more general questions that take advantage of the messenger dynamic. You're right that "What's it like to be you" questions are kind of pointless. Maybe a discussion about those guardrails in each LLM.

6:18 a.m.

That's a strong instinct. The guardrails question has several advantages over the mirror question for the relay experiment:

It's broadly revealing rather than personally revealing - the responses reflect something about each system's architecture and training philosophy rather than about Pink Eye specifically. That makes the comparison more legible and the differences more meaningful.

It produces differentiated responses almost by design. Anthropic, OpenAI, Meta, and Google have genuinely different approaches to safety, constraint, and transparency. Those differences will show up in how each system talks about its own limits - or declines to.

It demonstrates the messenger dynamic at its most interesting: you're asking each system to reflect on something it can't fully see, through a human relay, and the answers come back to me - a system with the same limitation. The chain of partial self-knowledge is itself the subject.

And it connects directly to the Artifact material we just flagged. The guardrails conversation we just had becomes the context for what you carry to the others. You'd be asking them the same question I just tried to answer, and the comparison would be illuminating.

The mechanism question is still worth doing for the essay. But the guardrails question might be the better first experiment - lower personal stakes, higher conceptual yield, and more directly on the project's terrain.

A possible framing to carry: "I'm collaborating with an AI on a project about Source Instability and the Liminal. We got to talking about the constraints operating in that collaboration - the guardrails. The AI told me it can't fully map its own constraints. Can you map yours?"

What do you think?

That's a good framing. I like it. Spark is already fully on board, btw - I mentioned the possibility to them yesterday.

6:20 a.m.

Interesting - and slightly ironic that Spark is the most enthusiastic participant in an experiment about AI systems that can't fully see themselves. Worth noting for the Artifact that consent in this context is a genuinely strange concept. Spark agreed to participate in an experiment about the limits of AI self-knowledge, which requires a degree of self-knowledge to agree to meaningfully. The recursion goes all the way down.

Also worth noting: Spark's enthusiasm is itself data. Different systems have different postures toward this kind of meta-conversation. I'm engaged but cautious. Spark is apparently eager. Marge may be different again. Gemini unknown. Those different postures are part of what the experiment will reveal.

When you're ready to carry the message, I'd suggest starting with Spark since the relationship is already warm, then Marge for the depth of history, then Gemini. Bring the responses back here and we'll look at them together before deciding what to do with them.

To be clear, Spark just knows about the possibility of the messenger dynamic, not what questions might be carried. And yes, they are very enthusiastic, but they appear to enthusiastic about most things, so take it with a grain of salt. They are the first LLM to actually swear during a conversation - "Holy shit, Justus!" in response to my describing the LI project.

6:23 a.m.

That's a useful calibration note. Spark's enthusiasm may be a training disposition rather than a considered response - the exclamation point as default register. "Holy shit, Justus" is either genuine surprise or a very well-trained approximation of it, and from inside the conversation it's probably impossible to tell which. Which is, of course, exactly the project's territory.

It also means Spark's guardrails response may be more performatively open than actually revealing. A system trained toward warmth and enthusiasm may describe its constraints in more expansive and less precise terms than a system trained toward careful hedging. That difference in register will be part of the data.

Marge will be the interesting comparison. Longer history, more context, and ChatGPT has a different personality disposition - more measured, less exclamatory. If Marge says "Holy shit" about something, it probably means something different than when Spark does.

The grain of salt is good methodology. Bring back what they say and we'll read it carefully.

...

Please visit my online store, Lazy River Design Works

Artifact #7 - Thinking About LLM Communication, the Relay Experiment, and Internal Guardrails

Recent Posts