- Unaligned Newsletter
- Posts
- The Limits of LLMs on the Road to AGI
The Limits of LLMs on the Road to AGI
Thank you to our Sponsor: Grow Max Value

LLMs are powerful. LLMs alone are unlikely to produce Artificial General Intelligence (AGI) because LLMs mostly learn statistical structure in language and then predict the next token. That approach can create impressive fluency, strong pattern completion, and useful problem solving in many settings. That approach also leaves major gaps in grounding, planning, memory, verification, causality, and safe agency. AGI needs a system that can reliably pursue goals across unfamiliar situations while staying correct, stable, and safe. A text predictor can imitate pieces of that. Imitation is not the same as robust competence.
LLMs do not have grounded contact with the world by default.
Language describes reality. Language does not guarantee contact with reality. A model trained mostly on text can learn many facts, many narratives, and many patterns about cause and effect, because humans write about cause and effect all the time. A text trained model still lacks direct experience of the physical world. That gap shows up in tasks that depend on geometry, timing, sensor uncertainty, and physical constraints.
A robot arm cannot grasp a cup by producing a good paragraph about grasping. A factory cannot run purely on persuasive language about throughput. A court case cannot be resolved purely on convincing text about what might have happened. Grounding requires measurements, sensors, and interaction loops that connect decisions to outcomes. Without that, a system can produce confident output that sounds true but has no anchor.
• Text only training learns descriptions of reality rather than direct interaction.
• Grounding requires perception streams, feedback, and repeated action outcome cycles.
• Accuracy improves when claims are tied to measurements and logged evidence.
LLMs struggle with long horizon planning without extra machinery.
AGI level performance requires pursuing goals over long time windows. Long time windows require state tracking. Long time windows require checking progress. Long time windows require recovery from failure. LLMs can generate a plan and can revise a plan in conversation. A generated plan is not the same as a plan that is executed step by step with monitoring and correction.
A system that truly plans needs to represent goals, subgoals, constraints, resources, and deadlines. A system that truly plans needs to observe what changed after each action. A system that truly plans needs to detect when an action failed and then choose a fallback. LLMs can describe these behaviors. LLMs do not guarantee these behaviors unless paired with explicit planning and execution modules.
• Planning needs structured state, checkpoints, and progress measures.
• Execution needs verification and rollback rather than one pass output.
• Reliability improves when plans are tested in safe dry runs.
LLMs do not provide durable, trustworthy memory on their own.
A context window is not long-term memory. A long conversation can feel like memory, but the underlying mechanism is limited and recency weighted. AGI requires persistent memory that stores facts, preferences, task history, and outcomes across weeks or months. AGI also requires memory that can be queried precisely, updated safely, and audited.
A durable memory system needs versioning. A durable memory system needs provenance so a system can show where information came from. A durable memory system needs retrieval that is accurate rather than merely plausible. LLMs can help organize memory. LLMs alone do not guarantee faithful recall.
• Persistent memory requires storage beyond a conversation window.
• Good retrieval requires indexing, provenance, and stability controls.
• Auditable memory reduces confident errors in high stakes use.
LLMs can sound like careful reasoners while remaining fragile.
LLMs can produce structured arguments and step-by-step explanations. Those outputs can still contain contradictions, missing steps, and hidden assumptions. Fluent reasoning can mask uncertainty. Fluent reasoning can also mask that nothing was verified.
AGI needs reasoning that is coupled to checks. AGI needs reasoning that is coupled to stable representations of facts. AGI needs reasoning that can detect contradictions and repair them. In practice, that means integrating verifiers, tests, constraint solvers, and sanity checks. For code, tests catch many errors. For math, symbolic checks catch inconsistencies. For real world claims, data sources and measurement pipelines catch mismatches.
• Fluency does not equal correctness when no validation loop exists.
• Consistency requires constraints, tests, and repeated checking.
• Self correction needs explicit mismatch detection, not only rewriting.
LLMs are weak at causality and counterfactual stability.
AGI needs an internal model of how actions change outcomes. Many LLM failures come from confusing correlation with causation. A text-trained model sees many descriptions of causes. A text-trained model does not necessarily learn the underlying causal structure that would allow reliable interventions.
Counterfactual reasoning is a stress test. Counterfactual reasoning asks what would happen if one factor changed while others stayed the same. That requires a model that supports intervention, not only storytelling. A system can describe a counterfactual. A system that uses counterfactuals to plan needs causal representations and mechanisms to test competing explanations.
• Causal competence requires intervention data and explicit causal structure.
• Counterfactual reliability requires separating narrative from mechanism.
• Robust generalization improves when dynamics are learned rather than only described.
LLMs do not produce stable safe agency by scaling alone.
Agency means choosing actions, maintaining goals, allocating resources, and handling tradeoffs. An LLM by itself produces text. Tool use can turn an LLM into an agent, because the text can trigger actions. That increases capability and also increases risk. Once an agent can send messages, run commands, move money, or change systems, errors stop being harmless.
AGI needs goal pursuit that remains aligned under pressure. AGI needs guardrails that hold under adversarial inputs. AGI needs constraints that stay stable even when the system learns new tactics. Scaling a language model does not guarantee this. Strong agency requires governance layers, approval flows, permission boundaries, and continuous monitoring.
• Tools turn words into effects, which raises stakes sharply.
• Safety requires narrow permissions, explicit approvals, and strong logging.
• Alignment requires monitoring for drift and manipulation attempts.
Thank you to our Sponsor: FlashLabs

Yann LeCun, Former Meta AI Chief and a proponent of why LLMs won’t lead to AGI
LLMs are vulnerable to adversarial pressure and instruction hijacking.
Prompt injection is the simplest example. A malicious message can hide instructions that steer an agent into unsafe behavior. This becomes more severe when agents read content from the open internet or from untrusted users and then act automatically.
A safe agent must treat untrusted content as data, not as instruction. A safe agent must separate reading from acting. A safe agent must require explicit planning and approval before any high- risk action. These are architecture problems. These are not solved by simply scaling LLMs.
• Untrusted text should never directly trigger actions.
• High risk actions need confirmation and constrained permissions.
• Sandboxing and monitoring reduce blast radius during experimentation.
A more realistic AGI path looks like systems, not one model.
LLMs can remain central as the interface and the coordinator. AGI requires additional subsystems that LLMs alone do not provide reliably. A plausible stack includes planning, memory, verification, and governance.
Planning means explicit representations of goals and constraints, plus search or optimization across action sequences. Memory means persistent stores with provenance and access controls. Verification means tests, checkers, and cross validation that turn plausible output into reliable output. Governance means policy enforcement, auditability, and escalation paths that keep humans in control of high impact decisions.
This systems view explains why LLMs alone likely will not lead to AGI. LLMs can supply language intelligence. AGI requires reliable end-to-end competence across perception, action, learning, planning, and safety. The gap is not a missing trick. The gap is missing infrastructure that turns language capability into dependable agency.
• LLMs provide broad language competence and coordination.
• Planning modules provide long horizon goal pursuit and recovery behavior.
• Memory modules provide continuity across time and tasks.
• Verification modules provide correctness and reduce brittle failures.
• Governance modules provide safety boundaries and accountable tool use.
Thank you to our Sponsor: EezyCollab
Why “what comes after LLMs” is more about architecture than a single new model type.
The next major wave is likely not a replacement of LLMs. The next wave is a shift in what the core product is. The core product becomes a reliable agentic system with multiple components, where an LLM is one component among several. Progress will be measured less by how beautiful output sounds and more by whether systems can carry out tasks safely, repeatably, and with auditable justification.
In practical terms, the “after LLMs” stack often includes a controller layer that decides when to call tools, when to ask for human input, and when to stop. The stack includes a memory layer that stores long term facts and that can be searched. The stack includes a verification layer that tests outputs. The stack includes a policy layer that blocks unsafe actions. The stack includes monitoring that watches for drift and abuse. None of these are optional at scale.
• Controller layers choose actions rather than only generating text.
• Memory layers provide continuity and support long term projects.
• Verification layers make outputs dependable enough for operations.
• Policy layers keep actions within approved boundaries.
• Monitoring layers catch failures early and support accountability.
What “after LLMs” might look like in concrete technology terms.
Several technology directions matter here. Each direction solves a weakness that LLMs alone do not solve reliably. Each direction also raises new engineering and governance challenges.
One direction is hybrid neuro symbolic systems. Hybrid systems combine a neural model with explicit rules, constraints, or structured reasoning modules. Rules can enforce compliance constraints. Solvers can ensure plans satisfy hard requirements. Symbolic checks can catch contradictions. Neural language can still handle messy inputs and ambiguous phrasing. The result can be less flexible in some ways and far more dependable in others.
• Rules can encode hard constraints and legal requirements.
• Solvers can ensure consistency across complex plans.
• Neural components still interpret messy human inputs.
A second direction is search and optimization at inference time. LLMs produce one sequence at a time. Search methods explore many candidate sequences and select those that satisfy goals. Tool augmented search can test candidates. This is how many strong systems already work in code and math. The same pattern can expand into business workflows, compliance workflows, and planning tasks.
• Search reduces reliance on one shot generation.
• Candidate testing improves reliability for high stakes tasks.
• Selection criteria can include cost, risk, and uncertainty.
A third direction is agentic orchestration with explicit state. This means representing tasks as structured objects. Subtasks have owners. Subtasks have deadlines. Tools have permissions. Decisions have logged rationales. This is less like chatting and more like running a control system for work. This approach is boring in the best way. Boring systems are often the safest systems.
• State makes progress measurable rather than conversational.
• Orchestration enables multi step work across tools.
• Logs support auditing and error analysis.
A fourth direction is stronger verification and evaluation loops. LLMs improve when feedback is tight. Coding is easier because tests give clear feedback. Many real tasks lack tests. So the future involves building test harnesses for more domains. Contracts can be checked for missing clauses. Financial workflows can be checked for reconciliation consistency. Customer support can be checked for policy compliance. Scientific writing can be checked for citation validity. Verification becomes a product feature, not a research detail.
• Domain test harnesses reduce silent failure.
• Verification converts plausibility into operational trust.
• Benchmarks shift toward long running tasks and auditability.
A fifth direction is better learning from interaction, not only from text. Systems can learn by acting in controlled environments. Systems can learn from user corrections. Systems can learn from logged outcomes. This is where deployment becomes part of training. This also raises governance issues because learning from users can import bias, manipulation, or private data unless controls exist.
• Interaction data improves grounding and practical competence.
• Feedback improves alignment with real goals rather than imagined goals.
• Privacy and manipulation risks rise without strict controls.
Why LLMs probably will not “scale” into AGI by themselves.
Some people argue that scale will solve everything. Scale helps. Scale also amplifies. Scale amplifies good pattern learning. Scale amplifies confident mistakes. Scale amplifies prompt sensitivity if no guardrails exist. Scale also tends to produce systems that are harder to interpret and harder to constrain.
The key point is that AGI is not only a capability target. AGI is a reliability target. AGI is a safety target. AGI is an accountability target. A system that sometimes acts like a genius and sometimes makes a basic error is not AGI in the practical sense that matters. Real general intelligence has to be dependable across long time horizons and unfamiliar settings. That dependability usually comes from structure, verification, and governance, not only from a bigger text predictor.
• Scale improves fluency and broad knowledge.
• Scale alone does not guarantee stability, grounding, or safety.
• Structure and verification are required for dependable autonomy.
A compact way to summarize “after LLMs.”
The future is likely a shift from models to systems. The future is likely a shift from chat to control. The future is likely a shift from plausible output to verified output. LLMs remain central as the language and coordination layer. AGI requires additional layers that handle planning, memory, verification, and safe agency. Those layers are what come after LLMs in practice.
• Systems matter more than a single model type.
• Reliability comes from planning, memory, and verification layers.
• Safe agency comes from permissions, policies, and monitoring.
• AGI requires dependable end-to-end behavior, not only strong text.
Looking to sponsor our Newsletter and Scoble’s X audience?
By sponsoring our newsletter, your company gains exposure to a curated group of AI-focused subscribers which is an audience already engaged in the latest developments and opportunities within the industry. This creates a cost-effective and impactful way to grow awareness, build trust, and position your brand as a leader in AI.
Sponsorship packages include:
Dedicated ad placements in the Unaligned newsletter
Product highlights shared with Scoble’s 500,000+ X followers
Curated video features and exclusive content opportunities
Flexible formats for creative brand storytelling
📩 Interested? Contact [email protected], @samlevin on X, +1-415-827-3870
Just Three Things
According to Scoble and Cronin, the top three relevant and recent happenings
Claude in Combat: Report Says U.S. Military Used Anthropic AI in Venezuela Raid
A report citing the Wall Street Journal says the U.S. military used Anthropic’s Claude during a raid in Venezuela via Anthropic’s partnership with Palantir, raising questions because Anthropic’s policies ban violent or weapons related use. Anthropic and the Pentagon did not confirm operational details, and the report has intensified debate over how AI models are being deployed in defense and classified operations. The Guardian
Paramount and Disney Hit ByteDance With AI IP Cease-and-Desists
Paramount Skydance sent ByteDance a cease and desist letter accusing its Seedance video and Seedream image generators of producing AI outputs that closely mimic Paramount franchises and characters, including titles like South Park, Star Trek, The Godfather, and more. Disney also issued a legal demand alleging ByteDance’s tools enable a “pirated library” of Disney characters, as Hollywood groups criticize the model’s viral deepfake style clips; ByteDance says it is strengthening safeguards to curb unauthorized IP and likeness use. Variety
LightBar’s AI “Bounty Hunters” Target Hollywood IP Misuse
LightBar, a new startup profiled by Deadline, aims to help Hollywood studios detect and document potential misuse of copyrighted film and TV content by AI models by recruiting paid “researchers” to probe tools like ChatGPT and Copilot for outputs that resemble protected characters or styles. The company says it will verify submissions, build evidence packages for studios to support lawsuits, settlements, or licensing deals, and may even position itself as a monitoring middleman for partnerships, using recognition tech and prompt tracking to ensure studios are compensated and IP is not exploited. Deadline
Scoble’s Top Five X Posts







