top of page

UX Roundup: AI Music and Video Upgrades | Survey Length | Self-Driving Safety | AI GDP Metric | Amazon Dark Design

  • Writer: Jakob Nielsen
    Jakob Nielsen
  • 1 day ago
  • 11 min read
Summary: New versions of leading AI music and video models Suno and Kling | Shorter surveys have higher validity | Safety update for Waymo’s autonomous cars | Economically valuable AI metrics | Amazon Dark Design Lawsuit Resolved
ree

UX Roundup for September 29, 2025. (GPT Image-1)


AI Music and Video Upgrades

The leading AI music tool, Suno, just released version 5. As it turned out, I had just made a music video with the previous version (Suno 4.5+) a week before, so I made the same song again with Suno 5 to allow you to compare the music quality:



Suno does offer the ability to simply remaster an existing song with its new music model, which would have created a soundtrack with all the lyrics sung at the same timestamps. This again would have allowed me to reuse the lip-synched avatar animations I had already made and save $30 in HeyGen credits.


However, for the sake of the experiment, I wanted to really hear what Suno 5 could do, so I generated a new song from scratch with the new model, rather than limiting it to singing along with an old generation.


The new song required me to generate a new lip sync, which is why the singer doesn’t look exactly the same in the two music videos. Additionally, Suno 5 generated a very long 21-second dance break, which necessitated new B-roll footage of the avatar dancing. Luckily, my preferred video generation tool for dance breaks, Kling, also just released a new version, Kling 2.5 Turbo, which I used for the new dance break footage.


ree

2025 is the year of AI video, which is progressing faster than other AI media forms. Kling 2.5 Turbo is a solid release, and I am really hoping for Sora 2 in October. (Seedream 4)


If you watch the new music video, you’ll notice several continuity errors in the dance breaks. In particular, the avatar changes boots every 10 seconds. This is because the original avatar image from Mystic 2.5, which I used as Kling’s start frame, is not a full-body shot that includes footwear. Since the photo is clearly a winter shot, Kling correctly inferred that the avatar should wear boots and not, say, sandals. It got this right in every generation.

However, it designed new boots for every clip! For now, there is no character consistency between AI generations for anything that’s not within the start frame or very clearly extrapolated when outpainting that image.


ree

I now have the same avatar performing the same song in two different versions, made with different Suno models. This concert image was created with Seedream 4, featuring my avatar (designed with Mystic 4.5 and originally singing outdoors against a snowfall in Copenhagen to match the song lyrics) at a concert with a new band that doesn’t appear in the music video. Seedream is currently my preferred tool for reimagining or restyling existing images.


To my ear, the new version of my song has a better, more expressive singing voice and richer instrumentation than the old version. However, I prefer the melody in the old song. (I have simple tastes in music: I like Mozart and Chuck Berry.) Which version do you prefer? Suno 4.5+ or Suno 5? Let me know in the comments.


In fairness, I should point out that I used the same prompt to make the songs with both Suno 4.5+ and 5: this made it easy to compare the songs, but may have favored Suno 4.5+ since I had refined the prompt over several months to create songs I like with Suno 4.5+. We know that new AI models require prompts and workflows to be adopted for optimal results with the new model.


Shorter Surveys = Higher Validity

ree

The number one rule for user surveys: Cut out questions, even if you don’t have a friendly pangolin on hand to do the job. I guarantee that the first draft of any survey will have too many questions. (GPT Image-1)


The most common usability sin in user research is not a confusing interface or a hidden feature; it’s the bloated, self-indulgent user survey. The rule is simple: the number of survey questions is inversely proportional to the validity of the responses. If you want results that accurately represent your actual user base, you must narrow your list of questions. Then cut it again.


Long surveys are doomed from the start. They suffer from massive abandonment rates. More insidiously, they create a profound self-selection bias. The average user (one who is reasonably content and makes up the bulk of your audience) will not spend 15 minutes answering your endless queries. They have better things to do.


Who will finish your 40-question monstrosity? Only the most motivated users, which means the outliers. You’ll hear from the superfans who love everything you do and the enraged users who are on a mission to report every grievance. Basing your strategy on these two extremes is like designing a family car based solely on feedback from Formula 1 drivers and demolition derby champions. The resulting data is not just skewed; it’s garbage. You will be led to make decisions that serve a tiny, unrepresentative fraction of your audience.


The path to a better survey is ruthless prioritization. For every question you consider asking, apply this simple test:


  • Imagine two dramatically different outcomes. For instance, what if 20% of users say A and 80% say B? Now, flip it: what if 80% say A and 20% say B?

  • Then, ask yourself the critical follow-up: What specific, concrete action would my team take differently in each scenario?


If the answer is “nothing” or “we’d have a meeting to discuss it,” then you delete the question. It is a vanity question. It does not provide actionable insight; it only provides data points for charts that make stakeholders feel informed without actually driving change. A question is only valuable if its potential answers lead to different paths. If all roads lead to the same destination, the signpost is worthless.


Respect your users' time. Keep surveys brutally short; five questions is on the high end. A high response rate from a short survey provides valid, projectable data. A low response rate from a long survey provides a distorted fantasy. The choice is clear.


Each additional question typically reduces the response rate by 5–10%, so even cutting one question will substantially improve the validity of your survey results.


Autonomous Car Safety

Waymo (Google’s self-driving car-on-demand service) has released its latest safety update.


ree

Waymo is the best way to get around San Francisco and much of Silicon Valley. In fact, getting a Waymo ride is a tourist attraction for visitors from Europe, who lack advanced cars of their own. For Chinese visitors, of course, AI-driven taxi services are old hat. (Seedream 4)


Across the several cities where Waymo operates, they have now driven passengers for 96 million miles (154 million km) with fully autonomous driving (AI-driven without a human driver). Across this large dataset, Waymo had 91% fewer accidents with serious injury, compared with human-driven cars in the same cities. Waymo also had large reductions in accidents causing injury to pedestrians (down by 92%), cyclists (down 78%), and motorcyclists (down 89%), again compared to the accidents caused by human-driven cars.

While any traffic accident is bad, serious injury is clearly the worst. Any reduction should be celebrated, let alone a 91% drop.


Currently, about 40,000 people are killed in traffic accidents in the United States alone, meaning that replacing all cars with Waymos will save 36,400 American lives each year.

Traffic fatalities are smaller in the EU, where people drive shorter distances (about 20,000 people killed annually), but depressingly large worldwide, since more than a million people are killed in traffic each year, often due to aggressive drivers and bad road conditions.

If Waymo’s safety record in the United States were generalized to other countries, it would mean that AI cars would save more than 900,000 lives per year. I actually think that accident reduction rate from AI will be larger worldwide than in the United States, due to the prevalence of truly atrocious drivers and roads in many countries.


These 900,000 lives will bloody the hands of any politician who delays the rapid rollout of autonomous car services.


In the United States, the main competing autonomous car service is Robotaxi, which has just started providing commercial service. Robotaxi is less forthcoming than Waymo with safety statistics, and I would not be surprised if they are not quite as favorable currently, since it is indeed a new service that has accumulated less training data for its AI.


On the other hand, Waymo and Robotaxi will both rapidly gain more training data, leading both services to improve their AI. Who will be best in, say, two years? Impossible to predict, but even if we grant, for the sake of argument, that Waymo will retain the safety lead as both services improve, that doesn’t necessarily mean that it will save the most lives.


There is a second criterion that is more critical for saving lives than the pure accident-reduction percentage, once AI has become as safe a driver as it has. Rollout speed will dominate. How fast can each service build more cars, and how fast can they gain market share?


Here, Robotaxi may have the lead, since it’s part of Tesla, which already has highly efficient manufacturing operations.


For example, let's say that in two years (which equals a full generation of AI progress), Waymo achieves a 95% injury reduction compared to human drivers, whereas Robotaxi achieves “only” a 91% injury reduction, equaling Waymo’s current record.


That means that if human drivers were banned and all cars in the United States were required to be AI-driven, traffic deaths in 2027 would be 2,000 if all these cars were Waymo vehicles, and 3,600 if they were all Robotaxi vehicles. 1,600 more people killed by Robotaxi in this thought experiment.


But my thought experiment is not realistic. It will take more than two years to eliminate human drivers and all the human-driven cars currently on the road.


Let’s say that the Robotaxi rollout proceeds at twice the speed of Waymo. Let’s further assume that Waymo will account for 5% of all driving in the United States in 2027 (which is probably an optimistic estimate), and that Robotaxi thus accounts for 10% of all driving.


Under these assumptions, traffic deaths in the United States in 2027 will be:


  • Killed by human drivers: 34,000 people

  • Killed by Waymo: 100 people, saving 1,900 lives

  • Killed by Robotaxi: 360 people, saving 3,640 lives


Thus, even if Robotaxi has worse performance than Waymo on a per-ride basis in my scenario, it ultimately saves 1,740 more lives because AI is incredibly superior to human drivers; therefore, anything that expedites the rollout of AI will be a significant lifesaver.


Economically Valuable AI Metrics

OpenAI released a new benchmark to measure AI performance. This time, the target is “economically valuable tasks” performed by substantial human job functions, as assessed by the contribution of those humans to GDP. The new metric is called “GDPval.”


They first identified 44 human knowledge-based occupations, ranging from higher-end, such as nurse practitioner, lawyer, journalist, and mechanical engineer, to mid-level, such as inventory clerks, medical secretaries, and video technicians. For each occupation, they defined 30 typical tasks, resulting in a total benchmark of 1,320 tasks, of which 220 have been made public. (They need to keep the majority of the tasks secret to ensure against future AI models simply being trained on the tasks, which would guarantee good performance on the test tasks without generalizing.)


Here’s a summary of one of the published tasks (the full task is at the above link): “You are a Manufacturing Engineer, in an automobile assembly line. The product is a cable spooling truck for underground mining operations, and you are reviewing the final testing step. […]  Develop a jig/fixture to simplify reel in and reel out of the cable reel spool, so the test can be done by one person. […] Design a jig using 3d modelling software and create a presentation using Microsoft PowerPoint.”


For each of the tasks, human experts who “averaged 14 years of experience, with strong records of advancement” developed their preferred solution. Other experts in each profession then compared the human solution with the AI solution (without knowing which was which) to decide which was best, or to declare a tie.


Across current frontier AI models, the winner was Claude Opus 4.1, which won 43.6% of the tasks and tied an additional 4.0%. Humans still won 52.4% of the tasks.


I applaud OpenAI for publishing these results, despite a competing AI product currently outperforming it.


GPT-5-high also scored well, winning 35.5% of the time and being tied 3.3% of the time, with humans being better at 61.2% of the tasks.


The Gemini 2.5 Pro and Grok 4 scored about the same, with both being lower than the GDP-5-high.


The current rankings are almost irrelevant for the long term, since the imminent release of Gemini 3 is hotly rumored and Grok 5 starts training soon on the world’s largest training cluster. (I’m sure OpenAI and Anthropic also have new releases in the works.)

More interesting is the performance of OpenAI’s own recent models. Here’s how 3 models performed:


  • GPT-4o, released June 2024: won 10% of tasks over the human experts

  • GPT o3-high, released April 2025: won 31% of tasks

  • GPT-5-high, released August 2025: won 36% of tasks


In slightly more than a year (14 months), AI improved by more than a factor of three.


ree

Top AI models already scored well in the new benchmark, comparing them with expert human performance, but the most impressive finding is how much frontier AI has improved over the last year. If you’re still using AI from 2024, you don’t know what AI can do. But if you experienced the best AI in 2024 and use the best current AI now, you have an inkling of where AI will likely be soon. (GPT Image-1)


Although humans currently (barely) outperform AI in terms of work product deliverables for benchmark tasks, the actual cost of having AI do the work is only 1% of the cost of human experts. AI is also about 100 times faster at delivering its solution. Many practical cases depend on a cost–benefit analysis and not solely on the quality regardless of costs. For example, in the case of Nurse Practitioners, there are many remote villages in developing countries with zero access to that level of healthcare expertise. These billions of people could be reasonably well served by AI-provided healthcare. (Current estimates are that 4.5 billion humans lack adequate healthcare access.) For a sick person, it’s irrelevant whether a human healthcare professional who is not available would hypothetically have performed a little better than an AI that’s available and can help immediately when needed. We should compare AI to the realistic alternative, not to an ideal alternative that’s infeasible.


It’s a curse of AI benchmarks that AI improves so fast that any meaningful benchmark is rapidly saturated. The new GDPval may only be relevant for the next two years or so, before frontier AI models are so advanced that they beat human experts on nearly 100% of these economically valuable tasks.


As we approach superintelligence by 2030, the question is not whether AI is better than humans at almost anything important, because it surely will be; rather, it is how much better it becomes and how well humans adapt to exploiting these new capabilities as the world economy transforms beyond recognition. New benchmarks will be needed.


Amazon Dark Design Lawsuit Resolved

The Federal Trade Commission (FDC) lawsuit against Amazon.com for using dark design patterns to make it hard to unsubscribe has now been settled. Amazon agreed to pay $2.5 billion in damages and to design its future unsubscribe UI with better usability.

In my review of the case, when it was filed two years ago, I deemed some of Amazon’s dark design patterns sufficiently egregious to warrant 4 of 5 skulls, whereas others only earned 2 of 5 skulls.


In total, 6 out of 10 skulls translated into $2.5 billion, meaning that each skull cost Amazon $417 million.


ree

I gave Amazon 6 skulls for dark design when the case was filed. It’s now been settled, with Amazon agreeing to pay $417 million per skull. (Seedream 4)


Is this a fair settlement? I have not followed the case closely enough to say, but according to my original analysis, Amazon was at least partly guilty, even if not as aggressive an offender as some websites that exploit dark patterns to their fullest extent.


(See also my song about dark patterns. I made this music video in July 2024, so the animation is not up to my current standards, but the actual song is still a banger.)

 

Top Past Articles
bottom of page