Summary: Specifying emotional intent when designing AI voices | AI narrows skill gaps | New music video using animation from Kling 1.5 | Talk to Santa as an AI-generated voice | Misadventures in image prompting | Jeff Bezos spends 95% of his time at Amazon on AI | Combining multiple AI tools into a single workflow to create a music video| 13 AI-UX job openings at Microsoft
UX Roundup for December 16, 2024. (Midjourney) This image inspired me to make a UX Christmas song (YouTube, 2 min.)
Specifying Emotional Intent When Designing AI Voices
Voice-generation AI service Hume now allows users to design a custom voice by varying sliders along 10 emotional dimensions for the voice. Examples include “assertiveness” (from timid to bold) and “enthusiasm” (from calm to enthusiastic).
I doubt I personally will dabble in voice design for my avatars: it seems too much work compared to picking a predesigned voice. But for major brands or creative storytelling projects, you absolutely want custom voices that emote exactly the desired characteristics.
Note that this new way of designing voices exemplifies AI’s ability to support intent-based outcome specification. The user tells the AI what effect he or she wants to achieve, and the AI figures out how to deliver this.
Voice-driven interactions can have more impact when the AI voices are designed based on emotional parameters instead of traditional voice-generation parameters. (Ideogram)
AI Narrows Skill Gaps, Take 10
I haven’t actually counted the number of research studies that have produced this same conclusion, but it’s certainly at least 10, if not more. Virtually every study that has measured user gains from AI has found that the less-skilled users benefit more than the more-skilled users.
AI narrows skill gaps. It doesn’t close the skill gap between the best and the worst users: some people will always perform better than others. But the difference is less with AI than without AI.
AI improves the performance of the best workers, so everybody gains. But it improves the performance more for lower-ranked workers.
Chengcheng Liao and colleagues from the Sichuan University Business School in China recently published another study with the same conclusion. In a controlled experiment with 1,090 salespeople, the sales staff who used an AI tool sold 5.5% more than those without AI help. This is not an impressive lift from AI, but we don't know what AI tool was used. Of more interest is the finding that “inexperienced agents benefited nearly six times more than their experienced counterparts.”
AI is a seniority accelerant that uplifts users who would have performed the worst without AI assistance.
My metaphor for this is that AI is a forklift for the mind. It helps users lift heavy cognitive burdens. In a real warehouse, introducing forklifts means weak workers can now lift just as heavy pallets as the strong workers. Similarly, for knowledge workers, AI helps untrained or stupid staff narrow the gap to highly trained smart people.
I report on this new Chinese study not because the findings are remarkable but precisely because the Sichuan researchers replicate the findings from several Western studies. The more we see the same general conclusions from highly disparate research projects, the more credible it becomes.
AI is a forklift for the mind. (Leonardo)
New Music Video Using Animation from Kling 1.5
I have produced a new version of my music video, Future-Proofing Education in the Age of AI. (YouTube, 3 min. video)
This version uses animations made with Kling 1.5. The song uses new music made with Suno 4.
Upgrading my music video about the future of education with improved animation from Kling 1.5. (Base image for I2V made with Ideogram.)
Compare with the video I made in June 2024 for the same song with Luma Dream Machine. Immense progress in AI video in just 6 months. 2024 truly was the year of AI video. (The two video versions were both made with image-to-video based on the same image of a singer on stage with a robot band, which I made with Ideogram. This makes it fair to compare the two videos to assess progress in AI video generation.)
Is this version perfect? No, there’s still room for much improvement in AI video in 2025.
(To appreciate advances in AI video since early 2024, see my music video about dark design from April 2024: truly primitive animation, so I redid that video in July 2024.)
AI video quality progressed at lightning speed in 2024. AI still can’t beat a Taylor Swift music video, but I wouldn’t be so sure in a year or two. (Ideogram)
Talk to Santa
ElevenLabs, a leader in AI voice generation, has launched a free service where you can “talk to Santa Claus.” While this is just for fun, it’s a way for you (and maybe your kids) to experience multimodal AI.
Call Santa and talk to him — or rather, his AI-generated voice. (Ideogram)
Misadventures in Image Prompting
Come on, Midjourney. You clearly know how an old-fashioned telephone looks. How come you can’t draw Santa Claus talking on the phone? (I guess that’s why I keep my Ideogram subscription active.)
Jeff Bezos Spends 95% Of His Time at Amazon On AI
In a recent interview, Jeff Bezos (the founder of Amazon.com) stated that 95% of his time at Amazon is now spent on AI. Hardcore founder mode, to focus on the one thing that matters for the future, while leaving the small stuff to others! He explained his AI emphasis by it being a “horizontally enabling layer” like electricity that makes everything else in the company better. Thus, if you get this one thing right, many other things will follow and be good as well.
Jeff Bezos is all-in on AI. Not just “AI First” (as any decent business leader should be by now), but “Only AI” to a first approximation, with only a 5% margin of error. (Grok)
Combining Multiple AI Tools Into a Single Workflow to Create a Music Video
I find it fascinating how the leading members of the AI creator community wield various combinations of AI tools in their creative processes.
The latest example is a music video (YouTube, 4:32 min.) by Marco van Hylckama Vlieg. He posted another video (18 min.) to demonstrate his workflow for creating this music video:
Generate the song on Suno and then split it into separate “stems” for the vocals and the instrumentals. Export the vocal stem as an audio file.
Generate multiple still images of the singer, using Midjourney’s character reference feature and/or RenderNet.
Import the vocal stem into HeyGen together with many different still images of the singer.
Use that same soundtrack multiple times to generate a set of lip-synched avatar videos of the singer performing the song, basing each avatar video on a different still image.
Import the original full soundtrack of the sound into CapCut together with all the avatar videos with different views of the singer performing the song. (Also, as a separate workflow, generate short B-roll video clips and import those into CapCut as well.)
In CapCut, mute all the avatar videos and only use the original song from Suno for the audio. Put all the avatar videos onto the CapCut timeline and cut between them. Since they were all generated to lip-synch the same audio file, they will be perfectly aligned, both with each other and with the full soundtrack of the song.
Add B-roll clips instead of the singer during song sequences without vocals.
For additional visual interest, use CapCut features such as “AI movement” to add zooms, shakes, or other movement to the original avatar shots from HeyGen, which tend to be rather static.
In summary, making a 4-minute music video took a lot of work. In contrast, I spent less than 2 hours making my latest music video about Future-Proofing Education in the Age of AI. (About an hour to make clips of my singer and the B-roll in Kling followed by about an hour to edit the lot together in CapCut.) There’s certainly no contest between my video and Marco’s. His has vastly more visual interest, and I never bothered trying to lip-synch my singer. (I like my music better than Marco’s “trance/techno” song, but that’s a matter of musical preferences.)
Making a nice music video with AI currently requires the user to orchestrate a combination of different AI tools, each of which is good for one step in a long workflow. (Leonardo)
Most of my readers will probably not bother making music videos, though I recommend trying a few as personal projects, for example, for a birthday party. The more fundamental insights from this case study are:
The current state of the art in AI is primitive, even if it’s advancing rapidly. You often have to resort to convoluted workflows with too many steps to get the desired results. You also need to combine a wide range of AI tools, because each shines in a few areas whereas nobody does everything right.
We need one-step AI products to create music videos — and other outcomes users want. (Midjourney)
There’s immense business potential for building integrated AI products that handle the full workflow for users so that they can focus on their creative intent and not on how to piece together the puzzle pieces. After all, AI is intent-based outcome specification, as opposed to the legacy user interface style of command-following. However, the full potential of outcome specification only happens when the user can specify his or her intent at a higher level. As an example, “I want a 2-minute video of my preferred avatar explaining Jakob’s Law for an audience of UX professionals” (or “for business stakeholders who do not know much about UX but need to be convinced about the benefits of complying with Internet-wide design standards”).
There are rich business opportunities in building AI tools based on a task analysis of user needs so that a single tool offers an integrated workflow where users can specify their high-level intent and let the AI figure out how to create the desired outcome. (Ideogram)
Microsoft AI UX Jobs
Microsoft has 13 job openings to work on design and user research for its AI products. I haven’t checked all the listings, but the application deadlines seem to vary between December 31, 2024, and January 4, 2025, so get cracking on your applications!
Most positions are in Redmond, WA (the Seattle area where MS has its headquarters), with a few in Mountain View, CA (a secondary MS location with about 3,000 employees that I believe started when they acquired PowerPoint in 1987 — Mountain View is also the home of Google, in case you get tired of working at Microsoft).
As I pointed out in my discussion of OpenAI’s Sora launch event last week, we’ve now reached the point for AI products where UX becomes a major differentiator. What matters is less the capabilities of the underlying AI model than users’ abilities to control the AI to produce results that meet their goals. Usability as a competitive advantage again incentivizes AI vendors (like OpenAI and Microsoft) to boost their AI design teams, and we can finally see this happening.
Microsoft has abandoned its nerdy legacy and is now hiring UX professionals by the dozen to improve the usability of its AI products. (Leonardo)
New AI Podcast Creation
Google made a splash with its NotebookLM’s ability to create incredibly natural-sounding podcasts based on written materials. You can see a podcast I made about why UX leaders should go “founder mode” (YouTube, 5 minutes).
Other AI vendors have jumped onto that podcasting bandwagon and launched their own tools for creating AI-generated podcasts. I made another version of a “Founder Mode” podcast with ElevenLabs’ podcast feature recently (YouTube, 8 minutes).
NotebookLM and ElevenLabs are audio-only, so they make the soundtrack but not the full video. I added the video with Kling and HeyGen for the two experiments I linked above. Only HeyGen offered an easy way to lip-synch the speakers with the audio.
Now, HeyGen has launched its own podcast-creation feature, which I used to make a podcast about the four metaphors for working with AI: intern, coworker, teacher, and coach (Instagram, 4 minutes). I also added some B-roll with animations of the 4 AI roles, made with Kling 1.5. Watching current avatars for 4 minutes straight is boring, so we need to intersperse additional animation for visual interest.
HeyGen is the only one of the three services to simultaneously create audio and video. They make good use of their lip-synch capabilities. I also appreciate that HeyGen is attempting to animate more of the avatars than purely their faces. However, the gesture animation is truly primitive at this time. (Of course, HeyGen’s podcasting feature is still touted as being experimental and lives in the “Labs” section of the website, so we shouldn’t be too critical.)
It’s great that HeyGen is moving toward having avatars with gestures, but the gestures currently don’t match what the “person” is saying and look more awkward than emotionally engaging. Give them a few more months to work on gesture animation. (Leonardo)