top of page
Writer's pictureJakob Nielsen

New AI Models: GPT 4o mini and Llama 3.1 405B

Summary: AI evolves. Prices drop. Performance rises. Giants compete. Open-source challenges proprietary. Future uncertain. Progress accelerates. No likely long-term winner.

 

Buckle up for a thrilling ride through the latest AI breakthroughs. We’re talking cheaper models, beefed-up performance, and a wild race between tech giants. Get ready to explore what's hot in the world of artificial intelligence!


AI Prices Drop with GPT 4o mini

OpenAI has launched a smaller version of GPT that is almost as good as its flagship model, GPT 4o. The new model is called GPT 4o mini. Names like this confirm my often-stated belief that the Abominable Snowman is responsible for the product naming strategy in the OpenAI marketing department.


The Abominable Snowman is hard at work on his assignment from OpenAI: To coin a product name for “GPT 4o mini.” The “o” supposedly stands for “omni” so “o mini” somewhat translates into “a little everywhere.” (Leonardo)


Prices are as follows per 1 million input tokens (the equivalent of 2,500 book pages, or more than all the books I’ve written put together):


  • GPT 4o: $5.00

  • GPT 4o mini: $0.15

  • GPT 3.5 Turbo: $0.50


With these prices, there doesn’t seem to be any reason for most people to continue using GPT 3.5, since 4o mini supposedly scores much better on the benchmarks.


The full-featured 4o is 33x more expensive than the mini model. Is slightly better AI performance worth that much? For some applications clearly yes, but for most, probably not. The OpenAI press release (linked above) compares 4o mini favorably with several other AI models, but disingenuously doesn’t include Claude 3.5 Sonnet in the comparison charts. Claude 3.5 Sonnet is almost certainly better than mini, but at $3.00 per 1M tokens, it’s also 20x the price. (I really want to see how much better Claude 3.5 Opus is.)


This huge price drop reflects enhanced algorithmic efficiency on the backend. AI software was initially horribly inefficient, but the old code is rapidly getting replaced by better code that consumes less inference compute. This follows the long-term trend discussed in my article The Decade of AI Super-Acceleration.


AI is getting cheaper, rapidly. (Ideogram)


Llama 3.1 405B Released

Facebook/Meta released the newest version of its frontier model: Llama 3.1 405B. It’s finally appropriate to refer to Llama as a “frontier” model, because measurements show that it performs on par with GPT-4o and Claude 3.5 Sonnet.


Llama 3.1 unleashed. (Leonardo)


Meta published a 92-page (!) paper about the new model, entertainingly titled “The Llama 3 Herd of Models”. (The “herd” includes some smaller and cheaper models, but I’ll focus on the flagship model in these comments.)


Meta’s Llama AI now jumps about as high as the other frontier AI models. However, this bar is still low compared to where AI will be in a few years. (Leonardo)


The paper has several interesting tidbits:



The AI scaling law continues to hold, as Llama climbs the peaks of AI performance. Llama 3.1 required about 50x as much training compute as Llama 2. Llama 4 may require between 10^27 and 10^28 FLOPS to train, corresponding to 9-10 months of the xAI Supercluster’s compute. Meta, of course, is building its own super data centers. (Ideogram)


  • Meta and xAI are both talking tough about how much better their next AI models will be. OpenAI is more soft-spoken. We’ll see who wins the next round. My prediction is that there will likely not be a single AI winner in the long run, with multiple frontier models slugging it out for temporary AI supremacy until the next releases come around and shake up the picture.

  • The new Llama 3.1 405B performs about as well as GPT 4o and other frontier models, though each model outshines the others slightly on specific metrics. For example, both models score 93% on AP exams, averaged across 15 topics. AP, or “advanced placement,” are fairly advanced high school courses in the United States. About 20% of American high school graduates pass at least one AP exam, so the ability of AI to score high on all of them matches the perception that “current AI is as good as a smart high school graduate.”


Here's a comparison of the AP exam scores for GPT 4o and Llama 3.1 405G. There’s a fairly high correlation of r = 0.53 (p<0.05) between the two models’ exam scores, so there’s rough (but not perfect) agreement about what topics AI is good at. The most prominent outlier is AP Physics, where GPT only scored 71% (the only low score below 80% in the dataset) and Llama scored 93%.


Exam scores on 15 AP exams for two leading AI models. (Data from Meta’s paper linked above)


Llama is an open-source AI model. (In contrast to OpenAI, which despite its name has proprietary models where the weights are hidden.) Mark Zuckerberg makes a big point about the benefits of open models, to create more specialized AI products that either distill a big model down to a small, specialized model for local or embedded use or form the base for advanced models that are finetuned for domain-specific or company-specific purposes. These arguments sound credible, but we shall see how many tier-2 AI labs actually produce something useful on top of Llama.


Meta, OpenAI, xAI, and Claude are in the ring, fighting for AI supremacy, and more entrants are eager to join the fight. I don’t believe there will be a single long-term winner. (Leonardo)


As an aside, the new Llama model is 4x as scary as the limits specified in the EU’s AI Act, which says that any AI trained with more than 10^25 FLOPS is a systemic risk. After playing with the new model, it’s clearly ridiculous to consider it a risk to society. Come on, it's only a smart high school graduate! It’s truly bad for Europe that the EU specified a limit in FLOPS for outlawing AI instead of regulating real risk in the form of nasty use cases (e.g., creating deepfakes, which absolutely should be illegal, except for clearly defined satire). Things are only a little better in the U.S., where the Federal Government regulations kick in at 10^26 FLOPS. So AI can only consume a little more than 2x the Llama training compute before the Feds start limiting our ability to help humanity through better AI.


To give you an idea of the new model's capabilities, I used Llama-3.1-405B-Instruct (AI model names are just terrible) to produce a bullet-list summary of my newsletter for Monday that uses emojis for the bullets. Usually, I use Claude 3.5 or GPT 4o to create my bullet lists, and I only post them on LinkedIn, not in the newsletter itself.


💡 Innovation through simplicity: lessons from LEGO's Creative Play Lab

📊 How LEGO's matrix org chart drives creativity and innovation

📚 How mockups books can bring your ideas to life

🤖 Marc Andreessen's 2016 predictions on AI: how did they hold up?

💼 The impact of AI on jobs: what UX designers need to know

📈 The need for a growth mindset in UX design

🤔 How to design for an AI-driven future where jobs are changing

🌟 The transformative power of AI on the economy

🌐 The social implications of AI: how to think about the impact of AI on society

👥 Limitations of role-based prompting in AI systems

🎯 Specifying the target audience can improve AI prompts

🔮 The future of prompt engineering: what can we expect?


I feel that these bullet points are not as good as what I usually get from Claude 3.5 or GPT 4o. In particular, Llama used a very limited set of emojis that were less creative and fun than what I usually get from AI. As an example, here’s the first bullet point I got when running this newsletter through Claude 3.5:


🧱 LEGO's simplicity: Pick up a brick and start playing


Honestly, that’s better than Llama’s light bulb.

 

Top Past Articles
bottom of page