top of page

UX Roundup: AI Judgment | Heuristic Evaluation | Consistent Brand Assets | New Music Model

  • Writer: Jakob Nielsen
    Jakob Nielsen
  • 3 minutes ago
  • 9 min read
Summary: AI judgment may follow a scaling law | Scaling AI’s judgment of usability heuristics | Generating consistent visual design assets with AI | New AI music model: Mureka O2

 

UX Roundup for February 2, 2026. (Nano Banana Pro)


AI Judgment May Follow a Scaling Law

The “bitter lesson” just became more bitter for those humans who still believe in meatware supremacy and their “unique” ability to have “taste” or judgment about what’s best among the ceaseless AI creations. AI can also exhibit judgment, and it appears that its judgment gets better with more compute, meaning that it will likely become superior to human judgment in a few years, as AI compute keeps scaling up.


(In general, “the bitter lesson” for AI is that throwing more compute at a problem consistently outperforms approaches built on human knowledge and domain expertise. Time and again across chess, Go, speech recognition, and computer vision, researchers initially made progress by encoding human understanding into systems, but these approaches were ultimately surpassed by simpler methods that scaled with available compute. The “bitter” part is that this lesson is psychologically hard for researchers to accept: we want to believe our insights about the structure of problems matter, but history shows that betting on more compute and letting AI systems learn and work on their own wins in the long run.)


A new research paper by Bingyang Ye from Harvard and several colleagues shows that AI scales its judgment abilities with more compute. It’s too early to declare this a new AI scaling law because Ye only studied a single domain: AI’s ability to predict which scientific papers would later be seen as the most important.


The study cleverly utilized the fact that there is a broadly accepted metric for the importance of academic papers: how many other scientists later cite each paper. In the study, the researchers limited their AI to run offline, using only knowledge as of a given date in the past, meaning the AI could not know how many citations each paper would later receive. (But the researchers knew the citation count, meaning that they could score the AI’s judgments.)


The interesting point here is not how good AI was at judging which research papers would become the most influential, but rather that this judgment improved with more compute, in two ways:


  • Model training: for all three frontier AI model families in the study (Google, OpenAI, Anthropic) the newest and biggest models performed better than the older or distilled models in the same family. For example, Gemini 3 Pro (which was the winner among the 11 models in the study) did better than Gemini 2.5 Pro, which again did better than Gemini 2.5 Flash (a smaller model).

  • Think-time compute: Each of the 11 models was given low, medium, and high reasoning budgets, and it was usually the case (though not every single time) that thinking more resulted in better judgment.


I applaud the authors of this study for actually using the most recent AI models (Gemini 3 Pro, GPT 5.2, and Claude Opus 4.5), in addition to testing year-old models. Too often, we read research performed with older AI models, meaning that the findings are already obsolete, given the pace of AI improvements.


I hope other researchers will extend this study in other domains, including areas where judgment is less clear-cut than for the citation count of academic papers. In particular, AI needs to be able to judge the quality of generative user interfaces and the many content types it produces (writing, images, videos, etc.).


Pending such research, we can’t say for sure whether there is, in fact, a scaling law for AI judgment abilities, but I think this will very likely turn out to be the case.


AI’s judgment (sometimes called “taste”) likely improves with more compute, raising hopes that it will scale to unprecedented heights in the coming years and soon surpass human judgment. (Nano Banana Pro)


Scaling AI’s Judgment of Usability Heuristics

Following up on the previous news item, there are signs that AI is improving at heuristic evaluation, which relies heavily on judgment based on vague criteria. I have previously discussed the potential for a usability scaling law, and even though the new data is insufficient to declare a law yet, I am getting more hopeful.


The Baymard Institute publishes numerous usability guidelines specifically for e-commerce websites, and on January 20, 2026, it announced an AI service that performs heuristic evaluations based on 154 of these guidelines at 95% accuracy, which it claims is comparable to that of human UX experts.


Of more interest is the point that this accuracy level was only reached for 39 guidelines in the previous version of the tool, announced May 20, 2025. Thus, in 8 months, AI’s ability to perform a particular type of heuristic evaluation (according to Baymard’s guidelines, not mine) improved by a factor of 154/39 = 3.95x. Getting roughly 4 times better in 8 months is the same as doubling every 4 months, which is exactly the current pace of AI improvements for a general set of “economically valuable tasks” tracked by METR and discussed in my 2026 predictions article.


To be honest, I had expected AI to improve more slowly at heuristic evaluation than it does for general knowledge work because usability is so intensely dependent on judgment and contextual understanding. We all know that AI’s skills are “jagged” (i.e., better at some thing than others), and it’s still true that AI is much better at programming than at usability. But if it actually doubles every 4 months, there’s light at the end of the AI-usability tunnel.

Of course, two data points are not enough to declare a trend, let alone a scaling law, so I encourage other researchers to keep measuring the quality of AI’s performance with the full range of UX methods and processes.


As an aside, Baymard has published 769 usability guidelines for e-commerce sites, meaning that the full set is 5x the number its AI can currently match in performance to human UX experts. If the usability scaling law holds up and AI doubles its UX skills every 4 months, we won’t be able to have AI conduct a full heuristic evaluation of an e-commerce site for almost 10 more months, or approximately until December 1, 2026.


Let’s say that this happens, and AI will be able to handle all of Baymard’s guidelines by the end of 2026. This won’t mean that it will be as good at a general heuristic evaluation of other forms of user interfaces, besides e-commerce sites.  General heuristic evaluation is more complex than using a highly specific set of design guidelines. My guess is that it may take one or two more years (that is, until late 2028) before AI has fully cracked the general heuristic evaluation problem.


We know from my original research into the heuristic evaluation method that the quality of a heuristic evaluation is highly dependent on the evaluator’s level of usability expertise, and also that what I dubbed “double experts” do even better. Double experts are people who are simultaneously experts in usability and in the application domain. This is why you should hire UX professionals with extensive experience in your domain if you want to use design reviews or other heuristic methods alongside user testing. User testing also benefits from being done by usability staff with domain knowledge, but this is less critical, since the test participants supply their own domain knowledge if your recruiting screener is good. (Recruiting is step 4 of my 12-step process for sound usability studies.)


AI’s ability to perform usability work may follow a scaling law similar to that shown for other forms of knowledge work. For sure, AI heuristic evaluations improved impressively over the last year. (Nano Banana Pro)


Generating Consistent Visual Design Assets with AI

AI-generated images have advanced to the point where anyone can create thousands of attractive illustrations for a few cents each. However, for many design projects, it’s not enough that the pictures are pretty. They must also be on brand. (I don’t care about this for my own content: I prefer exploring a range of styles since I create for the joy of it, not to build a business. But companies need to consider branding.)


AI is becoming increasingly steerable, and a new experiment by Luke Wroblewski demonstrates its ability to generate brand-consistent assets on demand. Luke has long used a green man with a big, round head as a consistent design element in his articles and presentations. I’m sure he drew these illustrations manually in the old days, but now he launched a service called the LukeW Character Maker, which draws illustrations in his exact style with AI.


Luke Wroblewski’s green avatar greets my tiger mascot in an image I made in a few seconds with the LukeW Character Maker.


The tool follows a simple process:


  1. Asset requests are analyzed and rewritten by a language model that aligns them with the brand style and guidelines.

  2. The rewritten prompt is sent to an image model (I’m guessing Nano Banana Pro) together with several reference images.

  3. The resulting images are subjected to a verification process that analyzes them and rejects them if they do not comply with the brand guidelines. (Luke says that Google ignores the uploaded reference images about 10–20% of the time. My experience with Nano Banana Pro’s compliance with reference images is substantially worse.)

  4. If the image fails verification, the process resets to step 2, and the tool tries to generate a new image. Better luck this time!


Step 3 is the most interesting to me, since it requires the AI to exercise judgment over the generated image. For now, it seems to simply check that the image has a green man, but as AI’s design judgment improves with the hoped-for scaling law, it’s easy to imagine it will also score the image's quality according to a much wider set of criteria. Depending on the circumstances, the AI could restrict itself to delivering final artwork that meets the highest quality standards for important clients, or artwork that is “good enough” for users on lower-priced subscription levels.


New AI Music Model: Mureka O2

I came across a new song-creation service: Mureka. (This is an affiliate link, so I will get a referral fee if you use it to sign up.)


I have been using Suno for virtually all my songs for at least a year: Suno 4, 4.5, 4.5 Plus, and 5 were the music models for 25 of the 26 songs in my 2025 highlights reel. As you can see from this list, Suno released 4 different models in a single year and is has also released several useful UX innovations to facilitate song editing and variations.


However, several AI influencers were impressed with Mureka’s recent upgrade, claiming that it offers richer musical sound. So I gave it a try: Jazz song about my top predictions for 2026, made with Mureka (YouTube, 7 min.)


I used the same C-pop avatar and lyrics as in my original Suno version of this song (YouTube, 5 min.), so you can compare the two versions to see which you prefer. (Please let me know in the YouTube comments!)


It’s possible that the influencers who, well, influenced me to try Mureka are right that it delivers richer music. However, on balance, I prefer Suno. Partly, Suno’s user interface shows greater maturity, especially in editing capability. Generative AI is a roll of the dice (which is why it produces the powerful dopamine hit of operant conditioning), so you rarely get the perfect result in one shot, but the ability to steer variations in a journey through the latent design space helps get what you want. Steered revisions and directed editing also promote a stronger feeling of authorship and creative ownership.


Regarding the specific Jazz song about my 2026 predictions, Mureka had two weaknesses, regardless of how well you like the lushness of its music: First, I don’t think the singing voice was stable throughout the song. Fast-forward from the first verse to the last, and it feels like two different singers. Second, Mureka inserted an 18-second dance break in the middle of my verse about apprenticeship, breaking the flow of the lyrics. (Dance breaks are fine in music videos and allow creators to showcase their avatars’ dance performances, but should be positioned immediately before or after a chorus and not in the middle of a verse.)


For these reasons, I still prefer Suno and will likely use it for most of my upcoming songs. But since I now paid for a month of Mureka, I will give it a few more chances. If you track what songs I publish in February, you’ll see which model I ended up preferring.


On balance, I probably prefer Suno for creating songs, but Mureka is a strong contender. (Nano Banana Pro)


If Mureka can gain traction and paying subscribers, I hope it will add more robust features in the future. It would be great to have real competition for Suno, now that Udio has thrown in the towel, abandoning individual creators in favor of kowtowing to corporate music labels. (Even worse, there are signs that Suno may also turn traitor to indie music. The rumors about their possible upcoming releases are worrisome, but of course not proven.)

 

Top Past Articles
bottom of page