Unreliability of AI in Evaluating UX Screenshots
Summary: In a case study of ChatGPT evaluating 12 e-commerce screenshots, most of the AI-driven redesign suggestions were inconsistent and untrustworthy. Human UX expertise must be employed to judge UX advice from AI.
ChatGPT-4 tantalizes us with the promise of instantaneous UX critiques. With just an uploaded screenshot — be it from a live product or a design prototype — we can prompt the AI with a seemingly simple question: “What UX improvements can be made to this page?” While the AI’s response is immediate, its utility proves inconsistent and heavily contingent upon your own level of UX acumen.
The Baymard Institute took ChatGPT 4 for a test drive to analyze 12 screenshots of e-commerce websites for which they already knew the usability problems based on prior work. The outcome? Discouraging, to say the least.
(The Baymard Institute is the world’s leading authority on e-commerce usability, having conducted a staggering amount of usability testing of e-commerce websites. They claim 140,000 hours of user research. Suffice it to say that they are not hallucinating their guidelines for e-commerce UX design, which are highly credible due to this extensive research.)
Should AI be at the table when your UX team conducts a screenshot design review? (Image by Dall-E.)
More Bad Than Good: AI’s Dubious Redesign Suggestions
What do you get from an AI usability analysis? The following pie chart shows the Baymard folks’ analysis of the UX redesign recommendations from ChatGPT. Only 19% of the recommendations were judged to be sound UX advice, meaning that implementing them in a redesign would improve the website. The vast majority of AI redesign recommendations were junk, according to Baymard.
A few redesign recommendations (9%) were directly harmful and would likely make the website worse. A huge number of recommendations (72%) would be a waste of time if they were implemented. They wouldn’t make the site worse but wouldn’t improve it, so why spend resources on them?
What should we make of these findings? It depends on how good and efficient you are at judging the AI’s redesign recommendations. If you are sufficiently experienced and knowledgeable that you can correctly identify the harmful and useless suggestions, these hallucinations won’t make it into your redesign project. In that case, the only question is whether the time you spend judging and rejecting so many bad recommendations is more or less than the time you save in getting a list of good UX recommendations in a few seconds.
Baymard reports that their human experts had spent between 2 and 10 hours analyzing each of the screenshots in the study. Let’s take the average and say it takes a human expert 6 hours to produce a complete usability analysis of an e-commerce screenshot. In this study, the ChatGPT analysis yielded an average of 11 hallucinations (bad recommendations) per screenshot. How long does it take to read and reject that lousy content? Hard to say, but I think a strong UX expert might be able to do it in an hour. A waste of time, but maybe not that bad. It’s always easier to react to something you’re shown instead of needing to discover insights from scratch.
The Hidden Gap: AI’s Blind Spots in Usability Assessment
A second, and maybe more important, question is what we do not get from an AI usability analysis. The following pie chart shows the Baymard analysis of the disposition of all the static usability problems in the screenshots. Did AI find them or not?
AI only identified 24% of the usability problems in the screenshots, whereas 76% of the problems were overlooked. This analysis is generous because I am only considering static UX issues related to individual screenshots.
Many additional UX problems are related to the interaction flow between screens. By definition, ChatGPT has no chance of discovering problems stemming from the movement between pages as long as it is only presented with individual screenshots seen in isolation. We can certainly hope that future AI products will gain the ability to analyze workflow problems, maybe from uploading movie recordings of real use.
In Baymard’s analysis of the e-commerce sites in the experiment, only 57% of the overall usability problems were isolated to individual screens, whereas 43% came from moving between screens. The above pie chart only visualizes the disposition of these static usability problems. To consider the total user experience of the websites in question, we would have to add a hulking pie slice for the unaddressed dynamic issues.
When Old Meets New: The Problem with AI’s Aged UX Advice
We all know that the training-data cutoff for the current version of ChatGPT was over two years ago. This is one of the reasons that Perplexity.AI has gained popularity by better integrating current data from the web. Usually, being based on old data is not a problem when using ChatGPT for UX work because the field is so stable: the human brain doesn’t change, and therefore, most usability findings remain the same year after year.
However, while broad user behaviors are robust, some detailed usability guidelines can change as users get accustomed to specific UI design patterns. Such detail-level changes caused a few ChatGPT recommendations to move from the “good advice” column to the “useless advice” column. The advice was based on user observations from past decades, as presumably documented in various pre-2021 articles in the AI training data. But Baymard’s user research from 2023 demonstrated that users now understand several details in checkout forms that previously caused trouble. Thus, those elements can be retained now, even if it might have been good advice in the past to modify them.
ROI Says: Not Worth the Time
Now, let’s do the ROI calculations: It takes 6 hours to analyze a screenshot comprehensively. Let’s say that the time spent on static and dynamic issues is proportional to their number. This means it will take a human expert 3.4 hours to analyze the static issues in a screenshot. AI will identify 24% of the problems, which would have taken the human expert 0.8 hours to find.
Thus, using AI saves our human expert 0.8 hours. (He or she will still have to spend 2.6 hours to analyze the screenshot further to discover the issues AI overlooked and will also have to spend a further 2.6 hours analyzing the workflow issues.) I am not sure this calculation holds entirely because it assumes no overhead in integrating the AI findings and the subsequent human findings. But let’s use this number for now.
Bottom line: AI saves our human expert 0.8 hours to find usability problems, but it costs an additional 1.0 hour to identify and reject its hallucinations — net outcome: minus 0.2 hours.
If all these assumptions hold, we get negative ROI from asking AI to perform screenshot analysis.
One last point: my calculations assumed we have an experienced UX expert at hand. The times will likely be longer if we only have junior UX staff. But the big problem is that junior UX folks are as likely to overlook several of the hallucinations, leading to even more wasted time as the team implements useless redesign suggestions.
AI is prone to reporting hallucinations: was the violinist really a pineapple at this concert? Such bad outcomes currently make AI dubious for analyzing screenshot usability. (Hallucination by Dall-E.)
AI’s UX Potential: Tap the Good, Avoid the Bad
This case study found that it’s not worth having ChatGPT 4 make redesign recommendations for e-commerce website screenshots. It might even be dangerous to do so if a senior UX professional does not review the AI output.
This doesn’t mean that AI is useless for UX design or redesign projects, just that we should not employ it to discover usability problems. AI is highly beneficial for ideation because it can produce many highly diverse ideas in no time. Many of these ideas will be poor, but there will also be ideas you did not think of yourself and are worth pursuing. AI can give you ideas for issues to investigate in user research and for new design directions. This user research must be conducted with real (live) customers. Still, AI can vastly improve any UX research team’s work with anything from drafting consent forms to writing tasks for studying the research questions. (Check the tasks, of course, but you can get 8 suitable tasks written in no time if you ask AI to give you 20. The rest should be rejected, but they were free, so what do you care.)
Finally, looking ahead, let’s remember Jakob’s First Law of AI: Today’s AI is the worst we’ll ever have. ChatGPT 4 fails as a skilled usability analyst because its good insights are not worth the time it takes to guard against its false insights. That said, the first time you have a new UX intern analyze a design, do you forge ahead and implement all of his or her recommendations immediately? No, you take your mentorship responsibility seriously and walk through the report to point out its weaknesses. And, quite likely, the green intern has one or two fresh insights that you hadn’t thought of.
ChatGPT4 is the greenest of interns when it comes to UX analysis. However, ChatGPT 5 might be more like an intern at the end of the internship: much more mature but still not a senior UX professional. ChatGPT 6 could conceivably reach the level of a true UX professional’s ability to evaluate user interface designs.
And a word to the wise: no UX expert, however seasoned, is fail-proof. That’s one of the oldest lessons from my research on heuristic evaluation: no single evaluator, no matter how skilled, will catch every usability hiccup. That’s OK because you need a hybrid approach incorporating heuristic evaluations and user testing for iterative improvement of your UX design.
A great use of AI in UX: rapid ideation. For this concluding image, I uploaded this entire article to ChatGPT 4 and asked it to recommend illustrations. It gave me these 4 ideas, and I used the top left as the lead image (though I’m using the pineapple-hallucinating robot as the social media image because it’s more likely to halt users who’re doomscrolling their feed). I almost picked the usability engineer in a lab coat in the lower right as a less busy image. I am not afraid of leaning on a cliché, but two clichés in one image were too much, so upper left it was.