UX Roundup: User Testing | AI Ads | Users Like AI | Multiple Analytics Experiments | Low-Fi Prototyping | AI for User Research
- Jakob Nielsen
- 7 hours ago
- 12 min read
Summary: User Testing 12-step process | AI advertisements outperformed human-created ads | Users give high NPS scores to the main AI models | Running multiple A/B tests simultaneously | Does vibe design mean the end of paper prototyping? | Study of how UXR uses AI in their daily work

UX Roundup for September 15, 2025. Many styles for this week’s hero image. Which one do you prefer? (GPT Image-1)
User Testing: 12 Steps, Two Songs
I made a new song about my 12-step process for usability testing:
User testing for solo piano (YouTube, 4 min.)
User testing, full orchestration (YouTube, 5 min.)
I made both songs with Suno 4.5+. My original concept for the song was “1930s Hollywood nightclub,” which is why I came up with the idea of setting the song to a score by solo piano. (I have been listening to old Scott Joplin recordings recently.)
However, as I was working on the song in Suno, I thought that it was too simple with only a solo singer accompanied by a solo pianist. Frankly, while Suno is insanely great for music creation, compared with anything we had just two years ago, its singers or pianists are not as good as the very best from the history of old human recordings. For full, rich orchestration, this matters less, but the focus is very much on the performance skills of those two soloists in my new song.
Because of this thinking, I made the rare decision to produce a second version of the same song, but now with full orchestration for a complete band. (The lead singer even has a bit of accompaniment for the chorus, even though we don’t see these backup singers in my music video, because HeyGen, which I used for avatar animation, is bad at lip-synching multiple characters.)
After designing the avatars and animating them to produce two complete music videos (including B-roll clips and dance breaks), I am not so sure that the solo piano version is too simple. I actually prefer the solo version, but take a look at both videos and let me know which one you like best.
The YouTube audience prefers the solo piano version, according to video analytics: the average view duration is 31% longer, which is substantial difference. The longer people continue watching a video, the more they like it.
It’s an old finding that the different media forms in a multimedia production can influence how the other media forms are perceived by users. A classic example is video games, where upgrading the sound quality in old console games led users to rate the game visuals higher, even when the visuals themselves remained unchanged. The human brain is a highly associative computer, with neurons interconnecting in mysterious ways.

I describe a 12-step process for user testing in a recent article that is very long and detailed. For a faster and more entertaining overview of the process, listen to my two new music videos. (GPT Image-1)
Two lessons from this project:
Looking back on the old world of human-performed media, I have a newfound respect for the executive producers at record labels who were charged with deciding on which new bands to invest in, digging through a slush pile of songs that were often made with low-end recording equipment, without seeing each band in a stage performance.
Looking forward to the new world where we are all multimedia creators, producing with a workflow that cuts across many specialized AI tools, it may be necessary to take a concept through multiple stages (more work!) to judge its potential fairly.
If you want to see a more professionally-produced AI video, watch Panasonic’s recent Chinese commercial for its new laundry machine designed by Porsche. I prefer my own songs to Panasonic’s soundtrack, but their visuals are superior.
AI Advertisements Outperformed Human-Created Ads
My reporting on human-factors studies of AI has focused on questions such as how much AI improves employees’ productivity, enhances creators’ creativity and expressiveness, and the degree to which AI exhibits superior empathy relative to humans (usually in healthcare). However, design is not just about helping users, even though that’s the angle I personally prefer.
Most design aims to influence people rather than help them. This is particularly true for advertising. For such design projects, our usability metrics must change from the traditional ones I just mentioned. This extends beyond mere correctness to evaluate how AI-mediated interactions impact the user's internal state.
A recent neuroscience study of a series of AI-generated advertisements is a good example of this dimension. It did not measure whether each advertisement was “correct,” but rather how deeply it captured a user's attention, how emotionally engaging it was, and how little cognitive load (confusion) it imposed. Success here is defined by the ability to generate a specific, desired neurological and psychological response. In other words, we are broadening the goals of user research and design from usability to Fiduciary Responsibility.
This research, conducted by Rubén Carbayo Jiménez and colleagues, and published in the AIRSI 2025 conference proceedings, employed neuroscience to measure the effectiveness of AI-augmented creativity objectively. The research aimed to bridge a critical gap: while generative AI is being rapidly adopted in marketing, empirical data is scarce on the actual effectiveness of the assets it helps create. The study directly compared advertisements created solely by human marketers with advertisements created through a collaborative process where AI assisted the marketers.
To move beyond subjective self-reports, the researchers used biometric tools. An electroencephalogram (EEG) with 20 channels measured participants’ neural signals to quantify levels of engagement and confusion, while an eyetracker generated visual heatmaps to measure attention. The AI-assisted advertisements were found to be significantly more effective across all key metrics. They generated, on average, 70% higher engagement and 6% higher visual attention within key areas of interest, such as the advertiser’s logo. Perhaps most impressively, they produced 72% lower levels of confusion in viewers. These findings provide some of the first evidence that human-AI collaboration can produce creative work that is not only different but measurably more effective at capturing audience attention and conveying a clear message, compared with traditional human-produced advertising creative.

In a neuroscience study, AI-generated advertisements outperformed human-created ads. The specific findings are maybe less interesting than the example of using such new measurement methods. (GPT Image-1)
Users Like AI
MeasuringU (the leading experts in quantitative user research) collected subjective satisfaction scores for three leading AI products (ChatGPT, Gemini, and Claude) in January and February 2025. The mean Net Promotor Score (NPS) received from people who have used these AI products ranged from +39% to +43%. Based on the sample size, the difference in mean scores was not significant.
In general, I am not a huge NPS fan, but this metric has the advantage of being easy to interpret: it simply measures how many more people are promoters of a product (rate it highly) than detractors (rate it poorly).

NPS is simply the percentage of promoters minus the percentage of detractors. Neutral scores are ignored. This metric is simple and widely used, but it does overlook much of the nuance in user opinions. And worst of all, it’s a purely opinion-based metric that doesn’t score how usable the product actually is, only how it performs relative to customer expectations. (GPT Image-1)
Scores around +40% are quite good, though not excellent.
Remember that these satisfaction scores were collected in January and February 2025, when AI was significantly less advanced than it is now. The top models were GPT 4.5, Gemini 2.0 Pro, and Claude Sonnet 3.7.
Would current AI (GPT 5 Pro, Gemini 2.5 Pro, Claude Opus 4.1) score better? The products are clearly much better, but that doesn’t mean that users’ subjective satisfaction scores would necessarily be higher. Satisfaction is a ratio between what you experience relative to what you expect. In general, people’s expectations for UX quality tend to rise at about the same speed as usability improves across the industry. Echoes of Jakob’s Law, where user expectations for your design are set by the combined impact of everything else they use.
It's an open research question whether AI is accelerating so fast that it’s outpacing the leveling effect of Jakob’s Law on user satisfaction ratings. It will be very exciting to see the results from the next round of AI UX metrics.

AI improves at a much faster rate than traditional software products and websites ever did. We don’t know if user expectations simply rise at that new speed, resulting in fairly constant satisfaction levels, or whether user expectations change at about the same pace as they always have, in which case AI satisfaction scores should improve considerably. (GPT Image-1)
Since I haven’t done this research, I can only speak for myself, but I am about equally frustrated with the current AI user experience as I was a year ago. I recognize that it’s objectively much better, but I am now attempting much more advanced tasks, and I want AI to do so much more for me that I still subjectively feel that I suffer from poor usability.
Running Multiple A/B Tests Simultaneously
Ron Kohavi wrote a very interesting article about whether one should run multiple A/B tests in parallel. (Kohavi is probably the world’s leading expert on using analytics in product design. I strongly recommend following him — and reading his articles if you conduct analytics experiments.)
There’s a very simple argument against parallel experiments: there’s a risk of interference between the tests. When you’re changing several design elements at the same time, you don’t have a completely clean A/B test.
As a simple example, let’s say you want to test whether it’s better for the “Buy” button to be red or orange. And you also want to test whether big or small buttons are better. Well, it could be the case that orange is better for big buttons and red is better for small buttons.
For a completely clean study, you would need a multivariate experiment, rather than multiple separate A/B tests. Most real-world websites don’t have enough traffic for each of the many cells in a multivariate study to receive enough traffic to achieve sufficient statistical power for any conclusions to be valid.
For analytics experiments to be valid, you need observations from at least 200,000 users. No problem for a big site, but for a mid-sized site, it might take more than a week to accumulate 200,000 users. (For my own small site, I would need to run an experiment for two months.)
Running experiments for a long time has two problems:
If you do not allow different experiments to run in parallel, you will only be able to run a small number of experiments per year, slowing down your improvement curve.
If experiments run for months at a time, they suffer drift for many reasons that ultimately make the conclusions invalid even after you hit the raw number of sessions (for example, cookie drift, the same user using multiple devices, changing seasons, or other factors changing user behavior over time).
Luckily, Kohavi says that in his experience, it’s rare for different design experiments to have serious interaction effects.
It’s good to be aware of the risk of interactions, but in practice, this risk is minuscule compared benefits of scaling an experimentation culture to the point where the company continuously runs a large set of experiments.

If you’re a medium- or high-traffic website, run many A/B tests in parallel. Multivariate is only for huge sites. (GPT Image-1)
For example, Kohavi notes that Meta (including Facebook and Instagram) conducts approximately 20,000 concurrent experiments at any given time, while Bing runs hundreds of concurrent experiments on a single webpage. Of course, these digital properties have the traffic for large-scale analytics experiments, and you probably don’t. But the conclusion remains: if you can use A/B testing, don’t shy away from running several experiments in parallel.
End of Paper Prototyping
Dr. Nick Fine recently wrote a highly thought-provoking article, leading with the statement, “Prototyping is back with a vengeance!” (The good doctor has a lot of other good posts and is worth a “follow.”)
His first insight is useful, and probably not very controversial: UI prototyping has always been a great early stage in the UX design process, to get early user feedback. Now, with “vibe design,” AI tools make UI prototypes virtually free, so there is no excuse for not using them.
However, vibe designing your UI prototypes raises a new question: The tools can make great-looking screens right out of the box, resulting in high-fidelity prototypes. Is there still a role for the old-school low-fidelity prototypes, which we used to make as paper prototypes instead of computer-based prototypes?
Paper prototyping became essential to UX design for key reasons:
Speed and Low Cost: Using just paper and markers, teams can visualize and adjust interfaces in real-time, experimenting freely without software constraints. Failed ideas are easily discarded, enabling rapid exploration of multiple approaches. You could whip up a revised design during the one-hour break between two user-testing sessions.
Risk Reduction: Following “fail early, fail cheap,” paper prototypes identify fundamental issues before costly development begins. Discovering navigation problems at the sketch stage prevents expensive code revisions later.
Team-Building: Anyone can sketch ideas without technical skills, enabling cross-functional collaboration. This low barrier to entry brings wider perspectives and builds team buy-in through shared creation.
Psychological Benefits: The rough, unfinished nature reduces emotional attachment, making teams and users more willing to provide (and accept!) honest criticism. This openness ensures critical feedback emerges early when changes are easiest.
Function Over Form: By stripping away visual polish, paper prototypes focus discussions on layout, content, and interaction flow rather than aesthetics. Users address core usability issues instead of debating colors or fonts.
Paper prototyping endured because it enables rapid user feedback, minimizes sunk costs, and keeps teams focused on fundamental design decisions before investing in high-fidelity development.
Now, vibe design and vibe coding allow designers to input aesthetic preferences, mood boards, and functional requirements, and receive multiple, distinct high-fidelity interface options almost instantaneously. Even better, these visuals automatically become functional front-end code.
This acceleration collapses the timeline between concept and a testable, realistic artifact. Furthermore, generative AI eliminates the need for “lorem ipsum” placeholders. Prototypes can now be populated with contextually relevant copy and unique imagery, creating a more immersive and realistic testing experience from the outset.

Our beloved paper-prototyping supplies are becoming exhibits in the Museum of UX History. (GPT Image-1)
The argument for the obsolescence of paper prototyping is compelling. If the goal is rapid iteration, and AI can generate polished, interactive digital prototypes faster than a designer can sketch them, reasons 1 through 3 from my list evaporate, as does half of reason 4:
Design changes can be made faster with vibe design tools than when having to draw new screens with a marker.
Vibe design is even faster than paper design by humans, allowing wider use of parallel design prototyping where many different design alternatives are explored in parallel.
Anybody can vibe design. In fact, it’s one of the key benefits of this approach that users can create their own applications, which UX professionals can then refine if they gain traction.
Humans have very little attachment to designs created by AI, making the design team more willing to make radical changes than if the change involved trashing the lead designer’s creative brilliance that turned out not to work.
High-fidelity prototypes allow teams to test not just functional flow, but also interaction nuances, animations, and the emotional resonance of the visual design much earlier. By presenting users with a product that looks and feels real, teams may gather more accurate data on engagement and usability. Furthermore, a computer-based prototype is easier to use in remote usability studies than a paper-based physical prototype.
That said, Dr. Fine points out (and I agree) that there are still reasons to employ low-fidelity prototypes in the UX design process: Argument 5 for paper prototyping remains relevant, as does the user-oriented half of reason 4. (But since we can now vibe design the prototypes with AI, we might retire the colored markers, sticky notes, and scissors from our toolkit.)
Low-fidelity prototypes offer critical benefits:
Preventing Visual Distraction: High-fidelity prototypes risk stakeholders fixating on colors and fonts rather than core functionality. Lo-fi prototypes focus attention on fundamental design aspects, such as content, workflow, and features, without cosmetic distractions.
Encouraging Honest Feedback: Rough prototypes invite candid criticism since they’re clearly unfinished. Users feel freer suggesting major changes to sketches versus polished designs that appear final. This “safe space for criticism” uncovers fundamental issues early.
Context-Specific Needs: Complex design problems and experimental interface still benefit from manual sketching. The act of drawing itself spurs creativity in ways that AI tools do not currently replicate.
Skill Development: Sketching builds critical thinking, especially for newer designers. Manually drawing each state ensures a thorough understanding of user journeys that AI shortcuts might skip.
A hybrid approach may be best for now, until the vibe design tools improve firther. One counterintuitive tactic: start high-fidelity for user testing (realistic reactions), then present low-fidelity versions to stakeholders (strategic focus without visual distraction). This reversal keeps discussions on core functionality while gathering authentic user data.
AI challenges paper prototyping’s old dominance but doesn’t eliminate the purpose of low-fidelity prototyping. We need to separate the experience from the implementation, which is a very old UX insight. The mantra becomes “high fidelity, held loosely.” We should abandon paper prototyping but retain some use of low-fidelity prototypes, and even manual sketching. The core mission persists: understand users, test early, iterate often, at whatever level of UI fidelity serves the UX process best.
How Researchers Use AI for User Research
One of my favorite user-research experts, Nikki Anderson, has conducted a study where she analyzed how 70 user researchers use AI in their daily work. She will present the findings in a one-hour seminar tomorrow, Tuesday, September 16, at 10:00 AM USA Pacific time (6pm London, 19:00 Berlin, check other time zones).
The talk is free, but advance registration is required.

What UXR thinks about AI: free seminar tomorrow. (Seedream 4)

Chinese image model Seedream just released its version 4. I am very impressed, and it’s currently ranking at the top of the leaderboard. For comparison, I asked Seedream 4 (left) and the GPT native image model 1 (right) to both draw a comic strip about the seminar announcement, using animals as characters. I probably still prefer GPT’s art style (right), but I must admit that I need to adapt my prompts for Seedream to leverage its strengths. For this contest, I unfairly used a prompt that has worked well with GPT’s image model in the past.