UX Roundup: Image Models Compared | Writing for the Web | AI vs. Human Cognition | E-Commerce Usability | Multimodal Music Models

Jakob Nielsen
44 minutes ago
13 min read

Summary: New Image Models: Microsoft MAI-Image-1 vs. Grok Image 0.9 | Writing digital content: Julius Caesar as a role model | Should AI mimic human cognition? | AI helps e-commerce sell more | AI singers understand the lyrics

UX Roundup for November 17, 2025. (GPT Image-1)

New Image Models: Microsoft MAI-Image-1 vs. Grok Image 0.9

Microsoft has released a new text-to-image model on www.bing.com/create, called MAI-Image-1. I tested it with the same prompt I had used with Seedream 4 to make one of the illustrations in my article “Estimating AI’s Rate of Progress with UX Design and Research Methods.”

Test image of human–AI handshake, made with Microsoft’s new MAI-Image-1 image model.

I am disappointed. This is the AI image quality we expected in 2024, not in late 2025. Compare with the images generated by Grok’s experimental image model 0.9:

Grok 0.9 gave me 12 alternatives to choose from in less time than MAI-Image-1 needed to generate its single image. Some of these are also not that great, but the best of this bunch handily beats Microsoft. Furthermore, with Grok, a simple scroll provides me with 12 more versions in a few seconds. (These fast response times encourage broader exploration of the latent solution space. Speed matters for AI-assisted creativity!)

My preferred human–AI handshake from Grok Image 0.9. Grok got the image credit footer right (even if the font size is a little bigger than I prefer in a footer), whereas MAI-Image-1 garbled this text.

To be fair to Microsoft, this is its first independently developed image model, and I believe it will improve with each new release. AI advances quickly these days. However, they would have benefited from following Grok’s example by using a version number that clearly marks this as an early trial. In any case, more competition is welcome, as we don’t want to be stuck with a few monopolistic AI models from the big labs with their love of censorship.

(I am not so naïve as to expect Microsoft to defend creator freedom vigorously in the face of pressure from controlling politicians or pearl-clutching puritans. I am too small fry for even the worst politicians to censor me, and I don’t care how many pearls get clutched because of my content. But Microsoft needs to get along with powerful special interests, likely causing it to opt for censorship in some cases. However, of all major AI providers, Microsoft probably has a respect for individual user freedom the deepest embedded in its corporate DNA due to its history as a personal computer champion, creating products to empower individual users.)

Julius Caesar as Web Writer

Use the inverted pyramid writing style when writing digital content: structure content by placing the most crucial information at the very beginning. The “bottom line” shouldn’t be at the bottom of your content, but at the top. Follow with supporting details, and conclude with more general background information at the end.

This structure originated in journalism during the 19th century, largely due to the invention of the telegraph. Transmitting stories over telegraph lines was expensive (often priced per word) and unreliable; connections could drop at any moment. To ensure the most critical facts of a story (the famous who, what, where, when, and why) made it through, reporters learned to put them in the very first sentences. If the transmission was cut off, the editor would still have a usable, albeit brief, story.

[Left:] The inverted pyramid was good in the 19^th Century, and it’s still good in the 21^st Century. [Right:] Internet wits invented the term TL;DR (too long; didn’t read) for the same idea of starting with the bottom line. (GPT Image-1)

This journalistic technique is now a cornerstone of website usability because it directly supports how people read online. Users don’t read web pages word-for-word; they scan. They are task-oriented and impatient, often deciding in seconds whether a page has the information they need.

Using the inverted pyramid improves usability in three key ways:

It respects the user’s time: It immediately answers the user's primary question, delivering value above the fold (the part of the screen visible without scrolling).
It aids scannability: Users can scan the headline and the first paragraph to get the main idea and decide if they want to invest time in reading the details.
It improves comprehension: Front-loading the conclusion provides a clear context, making the supporting details that follow easier to understand and process.

Let’s look at a very old example: Gaius Julius Caesar (100–44 BC) was one of the top three military geniuses of antiquity (the others being Alexander the Great and Hannibal). However, he was also one of the most accomplished writers in the Roman Empire. Two of his most famous lines illustrate the difference between the writing style you likely learned in school and what’s recommended for digital content.

Even while leading his legions to defeat masses of Gallic warriors, Julius Caesar found time to write. (Seedream 4)

Gallia est omnis divisa in partes tres. (All Gaul is divided into three parts, with Gaul being mostly equivalent to France.) This is the first line of Caesar’s book Commentarii de Bello Gallico (Commentaries on the Gallic War), which little kids such as myself learned to translate in the early days of Latin class.
Veni, vidi, vici. (I came, I saw, I conquered.) Said by in 47 BC after his swift and decisive victory over Pharnaces II of Pontus at the Battle of Zela.

When you read Gallia est omnis divisa in partes tres, the likely first conclusion is that we’re in for a snooze-fest in that book. In fact, Caesar’s Commentaries are worth reading, but they are not exactly written in the style of a modern action novel, much as he details many battles. This first line is a classic example of the writing style that starts by laying a solid foundation of theory and only later moves to explaining what the entire thing means to the reader.

In contrast, Veni, vidi, vici is short and snappy: it tells the complete story in only 3 short words (in the original Latin). This is what we should aim for in writing for the web and other digital content.

“I came, I saw, I conquered” is one of Julius Caesar’s most famous quotes, mostly because it’s so pithy. Even though it’s more than 2,000 years old, use Caesar’s copywriting as a role model when writing for the web. (Seedream 4)

Should AI Mimic Human Cognition?

We know fairly high general intelligence is possible, since we possess it. We also know that intelligence far surpassing that of the average human is feasible, as evidenced by a few exceptional individuals who possess such exceptional brains. Should we build and measure AI according to this role model?

Superintelligence has been achieved several times in human history. However, the best way to achieve it in machines is likely to be different from the way a few human brains became so smart. (Seedream 4)

I don’t think so, for two reasons. First, it is unlikely that the best way to build intelligence that runs on meatware is the best way to build intelligence that runs on computer chips. Many early airplane inventors tried to emulate the way birds fly, with flapping wings. Made sense at the time, since birds were the only known heavier-than-air design to fly. But since the Wright Brothers, we have known that flapping wings are bad for airplanes.

Airplanes don’t fly like birds. AI brains shouldn’t be like human brains. (Seedream 4)

The second reason is that human brains are suboptimal in many ways for the tasks we want AI to perform. Human cognition is the result of an evolutionary process that optimized our chances of survival in the ancestral environment. Avoiding being eaten by sabertoothed tigers scores high for how our brains work. Evaluating business strategies for a company’s next decade: less so.

Most of our brains evolved with the goal of not being eaten: all our long-ago ancestors were exactly the individuals who were not eaten, and therefore left us with genes that make us think in ways optimized to avoid predators. (Seedream 4)

Cialdini’s influence principles are a good example of these evolution-determined human biases: why do we trust pretty and tall people more than plain or short ones? Because looking good and being tall were indications that you (or your parents, which amounts to the same, generically) were good providers and proficient hunters who brought home lots of meat. You want to follow the advice of those people, and not people who eat less meat. In fact, preferring the advice of tall people would have added a percentage to a person’s survival chances and thus perpetuated genes that made the brain comply with Cialdini’s “liking” principle. (Much as this is illogical and could backfire today, where listening to a short nerd like me might improve your business success more than whatever tall salespeople say.)

DNA that encouraged trust in tall and attractive individuals would have been selected for in ancestral times and thus spread through the gene pool. Today, such cognitive biases backfire because we live in a different world. (Seedream 4)

This brings up the question of IQ, where we frequently see posts that a new AI model has scored, say, 130 on an IQ test, which would place it in the top 2% of the human population in countries like the United States. Except that it doesn’t.

IQ tests are in fact great at measuring the intelligence of humans. More than a century of psychometric research has honed IQ tests to the point where they have a high correlation with Spearman’s g (general intelligence), which again is highly correlated with measures of human achievement, such as educational attainment and achievement (how many years of education a person has completed, and how much they have learned, respectively, which are different metrics), job promotions, salary, staying out of jail, patents awarded, highly-cited research papers published, and so forth.

But the actual IQ test is a bunch of irrelevant challenges, such as the ability to repeat back a sequence of random words forward or backward. Humans can only do this for a small number of words, and it turns out that the more chunks you can keep in short-term memory, the smarter you are, and the better you do on real-world tasks that require decades to measure, whereas the “repeat these words back” test takes a minute. Other IQ test elements score things like ability to move and rotate colored cubes to replicate a pattern, or showing a tic-tac-toe–like matrix of 9 fields, where 8 are filled with weird squiggles and the last one is left empty, and you have to identify the underlying system that dictates what squiggle will complete the sequence.

Score enough of these silly games, and you get a great IQ estimate for a human brain. Those same IQ test components often have zero relevance for the performance of an AI. For example, even the most stupid AI has a context window capable of holding thousands of words, which would equate to far beyond an Einstein-level IQ if the test were meaningful for machine intelligence. (I don’t know if it was ever measured how many random words Einstein could remember, but it would likely have been less than ten.)

Because AI is so different from HI (human intelligence), it makes no sense to measure it the same way. AI has what’s often called “jagged” skills, meaning that it performs much better at some things than others. That’s true for humans as well, but the differences are greater for AI.

Meat and chips. Different ways to achieve intelligence require different ways to measure IQ and its machine equivalent. (Seedream 4)

We should indeed measure the “smartness level” of AI, but this is not an IQ, but a new concept. Currently, the best measure of this capability is the ability of AI to complete real-world tasks, particularly when comparing AI’s performance with the same tasks performed by humans. (I particularly like METR’s measures of AI task durations, which show that the duration of tasks that can be correctly completed by AI doubles every 7 months.)

AI Helps E-Commerce Sell More

Most studies of AI’s productivity impact are time-on-task measures: how fast can people accomplish a task with and without AI? Studies often find productivity gains of around 40% in terms of the amount of work an office worker can accomplish in a day. Some studies also measure quality and usually find that the quality of the work products is either the same or often better, despite being produced faster.

However, office work is often busywork. What about the impact on companies’ bottom line? Data about this is scarcer on the ground.

A new research study by Lu Fang and colleagues from Zhejiang and Columbia Universities sheds some light on AI’s impact on firm performance in the e-commerce sector, reporting on randomized field experiments conducted at a major cross‑border online‑retail platform. (Likely a Chinese e-commerce site like Alibaba.)

Unfortunately, academic publishing is a slow process, and the studies were conducted between September 2023 and June 2024, when the state-of-the-art AI models were GPT-4 and GPT-4o, respectively. I would expect better results from a study using a model like Gemini 3 Pro, which is likely to be the best AI model by late 2025.

Due to the time lag in academic research and publishing, we’re only now seeing research results with GPT-4, which has long been considered an obsolete AI model. (Seedream 4)

The authors studied 7 different workflows in that e-commerce firm, including answering pre-sales customer inquiries, search query refinement, product descriptions, marketing push messages, Google advertising copy, and credit card chargeback disputes.

The sample sizes for most of the experiments were huge, with data from between 1.2 to and 13.7 million customers. (The pre-sales inquiry data set was smaller, with only 44,614 customers, and the chargeback case experiment covered only 30,000 cases.)

Business metrics improvements were as follows, for the AI use condition compared with the control group condition of business as usual (without AI):

Pre-sale service: 22% conversion rate lift (significant at p < 0.01)
Credit card chargeback defense: 15% lift in defense success rate (estimated by the data science team)
Marketing push messages: 3.0% conversion rate lift (significant at p < 0.05)
Product descriptions: 1.3% conversion rate lift (significant at p < 0.05)
Search query refinement: 1.2% conversion rate lift (significant at p < 0.05)
Google advertisement: 3.3% lower conversion rate, but this difference was not statistically significant, so it might also have been zero

These business gains from AI are smaller than what we’ve seen for office worker productivity, but they directly translate into cash for an e-commerce company.

The paper estimates the dollar value of these conversion rate increases, accounting for the number of times per year an average customer encounters each process. This switches the value of AI per process, as follows:

Search query refinement: $2.6 per customer per year
Pre-sales service: Between $1.3 and $1.6 per customer per year
Product descriptions: $0.5 per customer per year
Marketing push messages: $0.1 per customer per year

Even though service was the process that saw the largest improvements from AI, the business value only came in second, because customers, after all, don’t ask for service that often. In contrast, e-commerce customers search incessantly, so even though the improvement from the primitive AI at the time of the study was only 1.2%, the financial gains were the largest in this study.

This analysis should serve as a lesson to anyone involved in using AI to enhance business processes. We need to measure (or at least estimate) the annualized bottom line impact and not just single-use cases.

AI practically rained dollars on the e-commerce firm in this study. To assess which workflows to improve with AI, we should measure the actual money to be gained, not just traditional usability metrics like time on task. (Seedream 4)

An interesting tidbit in the paper: the (unnamed) e-commerce company had 50 million AI API calls per day in mid-2024, but a year later, in mid-2025, this had increased to more than 1 billion AI API calls daily. More than 20x growth in AI use in one year. This is a more pragmatic indication that company management did see value from AI, because AI tokens are not free.

The e-commerce firm in this study increased its AI use by 20x in a single year, running a cool billion AI API calls daily by mid-2025. This is the most visible indication that company management found AI to be worth the cost. (Seedream 4)

The e-commerce platform hosts a highly diverse population of sellers, which varies substantially in terms of firm characteristics. (One of the clues that makes me suspect that it’s Alibaba.) The effect of using AI to improve search queries differed dramatically, depending on the size of these sellers:

Small sellers: 1.7% increase in conversion rate
Big sellers: 0.2% increase in conversion rate

A factor of 8x difference in how much AI helped, depending on the seller's size. (Statistically significant at p < 0.01.)

It is likely that small sellers are less sophisticated than big sellers in their ability to write copy that performs well in search, and that this is why their sales were lifted more when AI was used to improve the matching efficiency of consumer queries.

Many studies have found that AI narrows the gap between top and bottom performers. It helps everybody, but often gives a bigger boost to people at the bottom. In this new study, AI helped less professional e-commerce sellers more than it helped big firms. (Seedream 4)

Interestingly, there was a similar effect on the customer side. The researchers divided the users into high and low experience, based on their years since registering with the platform, their number of logins during the 30 days prior, and their total spending on the platform during this period. For all three ways of estimating customer experience, the less experienced customers saw higher conversion rate lifts than the more experienced customers. One likely explanation is that people who shop a lot online are better able to deal with cumbersome search engines and poorly written product descriptions.

Overall, this study adds a fresh angle to the findings from previous studies that current AI tends to be of more help to lower-end users than to higher-end users.

Annoyingly, this newly published study is based on old data, and AI models have improved substantially since the experiments were conducted. We should get much better results in the future. (Seedream 4)

AI Singers Understand the Lyrics

I am very impressed with Suno’s version 5 upgrade that I used for my song last week, Disabled GUI Buttons: Show, Disable, or Hide?

Disabled buttons: my latest music video.

In particular, the way the vocalist sings the words “look like they’re alive” (0:30) versus “just hide it all” (1:05), to represent two of the three design patterns (always showing features, even when they are inactive, versus completely removing a feature from the screen when it’s inactive).

The vocalization implies an understanding of what the lyrics actually mean. Of course, this is child’s play for a language model (these are not complex ideas), but new for a music model. Under the hood, I suspect that Suno is indeed using a language model to analyze the lyrics and inform the vocalization to inject more feeling into the song.

Adding “soul” to AI content has long been a goal, and Suno is making a step in the right direction. Getting emotional singing right would not be surprising for a love song, of which there must be hundreds of thousands in the training data. (My only feeble attempt at a love song was “Be My UX Valentine,” which was made in February with Suno 4 and is less impressive.) Adding proper emotional singing to a song about UX design is more impressive.

One way AI music is getting better is through increased understanding of the meaning of the lyrics before vocalizing them. (Seedream 4)

UX Roundup: Image Models Compared | Writing for the Web | AI vs. Human Cognition | E-Commerce Usability | Multimodal Music Models

New Image Models: Microsoft MAI-Image-1 vs. Grok Image 0.9

Julius Caesar as Web Writer

Should AI Mimic Human Cognition?

AI Helps E-Commerce Sell More

AI Singers Understand the Lyrics

Recent Posts

Top Past Articles

A New AI: Creation as Exploration and Discovery

The 10 Usability Heuristics in Cartoons

4 Metaphors for Working with AI: Intern, Coworker, Teacher, Coach

Dark Design Patterns Catalog

Jakob’s Law of the Internet User Experience

Ideation Is Free: AI Exhibits Strong Creativity, But AI-Human Co-Creation Is Better

The 10 Usability Heuristics Reimagined

UX Needs a Sense of Urgency About AI

AI Is First New UI Paradigm in 60 Years