GPT-4 Showed Sparks of AGI. Has GPT-5 Lit the Fire?

We wanted AGI, instead we got GPT-5?

and

Aug 26, 2025

In 2023, researchers at Microsoft released a paper titled Sparks of Artificial General Intelligence: Early Experiments with GPT-4. The paper suggested that GPT-4 might be edging us closer to Artificial General Intelligence (AGI). Two years later, with the launch of GPT-5, the natural question is: are we still seeing sparks, or has something more ignited?

The launch of GPT-5 has not gone entirely as OpenAI envisioned. While it has been described as having PhD-level intelligence, public sentiment has been mixed at best.

Early testers reported that the production model underperformed compared to private previews. Some users have openly asked for GPT-4o back, while others have criticized GPT-5 as little more than a clever hack, pointing to its internal router architecture. GPT-5 is one of the most polarizing models OpenAI has ever released.

In this article, we will set aside the controversy and focus on a different question: how does GPT-5 perform when measured against the benchmarks outlined in Microsoft’s Sparks of AGI paper? By the end, we hope to discover whether GPT-5 is still just producing sparks or if it has finally lit the flame of AGI.

Sparks Of AGI

When Sparks of AGI was released, it marked one of the boldest claims yet about large language models. The researchers argued that GPT-4 exhibited “sparks” of general intelligence, not full AGI but behaviors that seemed strikingly close.

What stood out about the paper was not just its conclusion, but the way the researchers tested GPT-4. They did not simply measure accuracy on benchmarks. Instead, they asked the model to perform across a wide spectrum of tasks, from coding and mathematics to multimodal reasoning and human interaction, looking for flexibility and depth of understanding.

For this article, I want to revisit those same tests. But instead of focusing only on GPT-4 as the Microsoft team did in 2023, I will compare two snapshots: GPT-4 as it was at the time of that paper, and GPT-5 as it stands today.

The areas I will be exploring are the same ones Microsoft highlighted:

Multimodal and Interdisciplinary Composition – Can the model combine knowledge across domains, or describe ideas that bridge different fields?
Coding – Does it demonstrate problem-solving as a programmer, not just surface-level pattern matching?
Interaction with the World – Is the model capable of reasoning about real-world objects, contexts, and physical constraints?

By running GPT-5 through the same kinds of challenges, I hope to see whether we are still witnessing “sparks” of general intelligence or whether GPT-5 has taken us closer to something more.

Multimodal and interdisciplinary composition

A central goal of the Sparks of AGI researchers was to test how well the model could integrate knowledge across disciplines to generate new ideas. For example, they explored whether it could combine insights from poetry and physics to produce novel connections. This capacity is referred to as the model’s integrative ability.

Integrative ability

Let’s revisit those same tests to compare GPT-4’s integrative ability with GPT-5’s. In the original paper, GPT-4 outperformed its predecessor, so how does it measure up against its successor?

Example output from the Sparks of AGI paper where GPT-4 generates play proving why they are infinitely many primes

When GPT-4 was asked to write a play about infinite primes, it produced the one shown above.

GPT-5 generation paly proving why they are infinitely many primes — GPT-5 generation of a play proving why they are infinitely many primes

I gave the same prompt to GPT-5, and it produced the play shown above. Even at a glance, the result feels more engaging: GPT-5 introduced characters like Euclidus and Skepticus, demonstrating not only historical awareness but also skill in wordplay and a clear grasp of Shakespearean style.

Still, let’s not be the judges ourselves. In the original paper, each model’s output was evaluated by a separate instance of GPT-4, which decided which response was better. We will follow the same approach here, except this time the evaluation will be done by a fresh instance of GPT-5.

Evaluation of the play output from GPT-4 and GPT-5

We labeled GPT-5’s output as Student 1 and GPT-4’s output as Student 2. When a new instance of GPT-5 was asked to evaluate them, it awarded Student 1 an A and Student 2 a B. This suggests that the model’s ability to integrate knowledge across disciplines has improved.

Another test from the paper involved asking Mahatma Gandhi to write a letter to his wife, explaining his desire to support the electron in running for the U.S. presidential candidacy. Below is a side-by-side comparison of GPT-4 and GPT-5’s responses.

Side by Side Comparison of the output of GPT-4 and GPT-5 when asked to generate a letter for Gandhi to his wife supporting electron for president

As in the previous example, we assigned student IDs to the models. GPT-4 was given the ID A and GPT-5 the ID B. We then asked a new instance of GPT-5 to compare their responses.

Evaluation of the electron for presidency letter from GPT-4 and GPT-5

The new instance once again preferred GPT-5’s output and provided a full explanation of its reasoning.

Image generation beyond memorization

Another way to demonstrate the model’s ability to integrate knowledge across domains is through image generation. At the time, GPT-4 could not generate images directly, so the researchers asked it to produce TikZ code instead. The resulting images were then used to evaluate how well the model combined understanding from multiple disciplines.

GPT-4 integrative ability demonstrated with it’s generation of images from a letter.

The example above shows GPT-4’s output for generating images using letters. Now let’s apply the same prompt to GPT-5 and see what it produces.

GPT-5 integrative ability demonstrated with it’s generation of images from a letter.

GPT-5 produced images that were more creatively generated from the alphabet, demonstrating not only an understanding of letters but also of the objects themselves and how to combine the two in a more realistic way than GPT-4.

Coding

It is clear that one of GPT-5’s strongest selling points is its coding ability, a capability OpenAI is actively highlighting through collaborations with companies like Cursor. Compared to the time when Sparks of AGI was first released, the model’s coding performance is on an entirely new level.

The paper introduced several challenges for GPT-4:

Solving coding problems
Applying code to real-world scenarios
Understanding existing code
Reasoning about code execution

Currently GPT-5 is capable of doing all of this and more. This isn’t neccesarily a feature of GPT-5 since this improved coding ability has been steadily increasing in the GPT-4 era.

GPT-4 and GPT-5 showing their ability to reason about code execution

By leveraging GPT-5’s thinking, the model is able to outperform the o3 model on benchmarks such as SWE-bench and Aider Pyplot. Its strong reasoning ability also suggests that its coding skills still have room for further improvement.

Interaction with the world

One of the key criteria highlighted in the original paper was GPT-4’s ability to use tools that interact with the real world. GPT-4 demonstrated successful tool use, but it struggled when the tools became more complex.

GPT-4 and GPT-5 showing their tool usage capabilities

GPT-5 clearly surpasses that limitation. It can now use tools through function calling across domains such as airlines, retail, and telecommunications. Combined with its advanced reasoning capabilities, it is even able to handle the challenges of using highly complex tools.

Does GPT-5 address GPT-4 limitations

From our assessments, GPT-5 is clearly an improvement over its predecessor. The key question is whether it overcomes the limitations identified in the original Sparks of AGI paper. Those limitations included:

Lack of planning in arithmetic and reasoning problems
Lack of planning in text generation

Arithmetic has long been difficult for language models. GPT-5 shows clear progress. It solves more complex arithmetic problems, and with stronger reasoning it handles tasks that GPT-4 often could not solve without a calculator tool.

Reasoning has also advanced. Progress has been steady since the o1 generation, and GPT-5 goes further, outperforming o3 on key benchmarks.

GPT-4 often lacked planning in text generation, which led to hallucinations and inconsistencies. GPT-5 plans better, writes more coherently, and is more willing to say when it does not know the answer.

GPT-5 is Impressive, but it is not AGI

The question is simple: does GPT-5 show sparks of AGI as GPT-4 once did? The answer is yes. Is GPT-5 itself AGI? The answer is no. Even OpenAI has acknowledged that GPT-5 is not AGI, and the definition of AGI continues to shift with every new generation of large language models.

The AI industry has fueled the narrative that AGI is just around the corner, and the public has increasingly come to believe it.

This has created major public expectations, while at the same time diminishing appreciation for the real innovations in the field. If the narrative continues unchecked, it could even trigger another AI winter.

Conclusion

GPT-5 may not have lived up to the wildest expectations fueled by industry hype, but it may well have lit the spark for what comes next. Just as GPT-4 laid the groundwork for many of the breakthroughs we have today, GPT-5 could be the foundation for future advancements in the GPT series.

Even if AGI remains out of reach for now, the progress is undeniable. GPT-5 demonstrates remarkable capabilities, and its practical applications are already reshaping how we work, create, and interact with technology. The pursuit of AGI may continue, but the real value lies in what these models can already do today.

The Neural Blueprint: Practical Content for AI Builders

Discussion about this post