Using AI as a “pseudo user” for usability testing?

I just used Claude CoWork (via the desktop app) to edit my WordPress site for me, without me even touching WordPress directly. What started as a simple migration of my Medium blog posts to my new WordPress site turned into something that feels much bigger from a product perspective.

It made me wonder: can we use AI as a kind of pseudo-user to test flows across our products?

The experiment: letting Claude fix my site

I had migrated all of my Medium posts to my WordPress site. Everything looked fine, except the publish dates; instead of showing the original publish dates, every post was displaying “today.” I couldn’t immediately see why, and also I was curious to see how Claude would handle it… So I let Claude do it for me.

Claude reasoning and navigating the site editor

Claude didn’t just give me instructions, it actively explored the interface. It navigated into the Site Editor, inspected the Query Loop, clicked into the date block, mis-clicked, recalibrated, and tried again.

As it went, it narrated its reasoning. It noticed that all posts were showing the same date and it inferred this was likely a theme block configuration issue. It hypothesised that the date block was set to a static “Custom Date” instead of the dynamic “Post Date.” Then it went to test that hypothesis.

Watching friction happen in real time

One of the most interesting parts wasn’t that it solved the problem; it was how it struggled to get to the final output.

At one point, it clicked, but opened the Page panel instead of selecting the Date block. It explicitly said it needed to click more precisely on the date text. That hesitation, that micro-confusion, is exactly the kind of friction we look for in usability testing.

WordPress editor with incorrect static date

From the outside, this looked like a simple configuration issue. But from the inside, watching an agent try to reason through it, you could see where the UI affordances were ambiguous. The date block inside the Query Loop was set to a static value. That’s not inherently wrong, but the system didn’t make that state legible enough.

The resolution (and what it revealed)

Eventually, Claude switched the block from “Custom Date” to “Post Date.”

Confirmation that dynamic post dates are now correct

All the posts immediately reflected their original publication dates from Medium.

But the important part isn’t the fix. It’s that Claude:

Formed a hypothesis about the system
Tested it
Misinterpreted UI states
Adjusted
And explained what it expected vs what it saw

Flashback: Comet as a “thinking tester”

This brings me back to when I explored Comet as an AI usability testing tool back in September 2025. The idea is that Comet acts like a “thinking tester” and you can watch what it clicks on, see the thought process it types out as it goes, and get instant feedback on usability issues.

What I tested

I ran two experiments using Comet to test Big Sky (our AI Agent at WordPress):

Video 1: The AI agent, building a website using Big Sky. AI testing AI 🤖.
The AI tester interacts with Big Sky. When I try to make design changes, the modal gets stuck and provides poor responses, consistent with what we’ve already seen in survey feedback.

Video 2: The AI agent starts building with Big Sky, but then edits the backend.
A different starting design performs better at first, but as I prompted further, the AI began editing the backend itself. Not perfect, but an intriguing way to test ideas quickly.

When evaluating the tests, it was noted that the AI’s live feedback is particularly valuable. For example, in Video 1, it said:

“I can see there should be a submit button, but it appears disabled in the current state. Let me look for a way to submit this prompt.”

That kind of observation makes it clear that the interface can be hard to navigate—for both humans and AI, resulting in unnecessary friction.

Early thoughts (back in Sept 25)

Comet shows promise as a way to run AI agent usability tests and surface issues quickly. That said:

It’s very expensive ($200/year per person).
It doesn’t allow us to train or clone agents that reflect our own customer segments. This limits its usefulness if we want to test with “copies of our users.”

In that thread back in Sept, Matt Mullenweg said:

“It’s pretty important that our interfaces are legible and usable to AI so consider it a primary step.”

Fast forward to now: AI as a parallel usability layer

At the time, that felt forward-looking, but now it feels immediate.

We have also been playing with the idea of AI Agents as our customer segments through our LibreChat agents, so this is not new thinking.

If we’re increasingly going to use AI to operate our products, updating sites, configuring themes, managing content, then AI is no longer just a user – It’s an operator.

So the question becomes:

Can we deliberately use AI as a proxy user to test specific journeys?

Not to replace human research, but to add a parallel layer. We could define a journey, onboarding, AI modal interaction, customisation flows, backend configuration, and ask an agent to complete it. We observe:

Where does it hesitate?
Where does it assume a pattern that doesn’t exist?
Where does it get stuck?

Then we compare that against actual user findings.

If both users and AI struggle in the same places, that’s a strong signal. If only AI struggles, that’s still interesting, especially if AI agents increasingly act on behalf of our users.

Open questions

Should AI-agent testing become part of our research toolkit (as a first validation step – not replacing user testing)?
Are there specific journeys where this would be especially valuable?
Can we simulate different customer segments through prompting strategies?
How do we overcome hallucinations?
If AI agents are increasingly navigating our products autonomously, does AI legibility become a first-class design requirement?

Leila Byron