2025 — A Year of Paradigm Shifts in LLM Research
2025 was a watershed year for large language models (LLMs). A confluence of new research directions, scaling laws, and product releases dramatically reshaped the field. Below I outline the most consequential developments and what they mean for the future of AI.
1. Reinforcement Learning from Verifiable Rewards (RLVR)
In 2025, RLVR emerged as a new, powerful training stage that sits alongside the traditional supervised‑fine‑tuning (SFT) and RLHF phases. By training LLMs on automatically verifiable rewards across many environments (e.g., math or code puzzles), the models spontaneously develop strategies that look like “reasoning” to humans—they learn to break problems into intermediate steps, iterate, and converge on answers (see the DeepSeek R1 paper for examples).
Unlike SFT and RLHF (both short fine‑tuning phases with modest compute), RLVR uses an objective, non‑cheatable reward function, allowing much longer optimization. It turned out to be a very high capability‑to‑cost ratio, consuming compute that would otherwise have been allocated to pretraining. Consequently, most of the capability gains in 2025 came from LLM labs digesting this surplus compute in the new stage; we saw LLMs of roughly the same scale but with longer RL runtimes. Moreover, this new stage introduced a brand‑new control knob (and associated scaling laws), allowing capability to be modulated by generating longer reasoning traces and increasing “thinking time,” i.e., as a function of compute at inference. OpenAI’s o1 (late 2024) was the first demonstration of an RLVR model, but the release of o3 (early 2025) marked a clear inflection point—you can intuitively feel the difference.
2. Ghosts vs. Animals / Sawtooth Intelligence
2025 was the first year—at least for me and, I think, for the whole industry—that we began to internalize the “shape” of LLM intelligence more intuitively. We are not “evolving/cultivating animals,” but rather “summoning ghosts.” Everything in the LLM stack is different (neural architecture, training data, training algorithms, especially optimization pressures), so the entities we get in the intelligence space are fundamentally different, and it is inappropriate to think of them through an animal lens. From the perspective of supervision signals, human neural networks are optimized for tribal survival in a jungle, whereas LLM neural networks are optimized to imitate human text, collect rewards on math puzzles, and earn human up‑votes on LM Arena. Because verifiable domains enable RLVR, LLM capabilities near those domains “spike,” yielding an interesting sawtooth performance pattern—they are simultaneously brilliant generalists and confused, cognitively limited elementary‑schoolers, and can be fooled by jailbreak tricks to leak your data at any moment.
Human intelligence: blue, AI intelligence: red. I like this version of the meme (sorry, I lost the original post on X), which points out that human intelligence is also sawtooth in its own way.
Related to this, I observed a general apathy toward and loss of trust in benchmarks in 2025. The core issue is that benchmarks are essentially verifiable environments, making them immediately susceptible to RLVR and its weaker forms (such as synthetic data generation). In the typical “bench‑maxxing” process, LLM lab teams inevitably build environments near the small region of embedding space occupied by the benchmark and grow sawtooth patterns to cover them. Training on the test set has become a new art form.
What does it look like to crush every benchmark yet still not reach AGI?
I have written more on this topic:
3. Cursor / A New Tier of LLM Applications
The most noteworthy thing about Cursor (aside from its meteoric rise this year) is that it compellingly reveals a new tier of “LLM applications”—people are now talking about “Cursor for X.” As I emphasized in my Y Combinator talk this year (slides and video), LLM applications like Cursor bundle and orchestrate LLM calls for specific verticals:
- They perform “context engineering.”
- They orchestrate multiple LLM calls underneath, forming increasingly complex DAGs while carefully balancing performance and cost trade‑offs.
- They provide a GUI tailored to the specific application for human interaction.
- They offer “autonomous sliders.”
In 2025 there was a lot of discussion about the “thickness” of this new application tier. Will LLM labs capture all applications, or will there be ample greenfield for LLM applications? Personally, I suspect LLM labs will tend to nurture general‑capability “undergraduates,” while LLM applications will, by providing private data, sensors, actuators, and feedback loops, organize, fine‑tune, and actually drive themselves into specialized deployment teams for particular verticals.
4. Claude Code / AI Living on Your Computer
Claude Code (CC) is the first compelling demonstration of an LLM agent—it chains tool use and reasoning in a loop to perform extended problem solving. Moreover, what stands out to me about CC is that it runs on your own computer, together with your private environment, data, and context. I think OpenAI got this wrong, as they focused their Codex/agent efforts on cloud‑container deployments orchestrated by ChatGPT rather than on localhost. While a swarm of cloud‑run agents feels like an “AGI endgame,” we live in a middle world of sawtooth capabilities where take‑off speed is slow enough that simply running an agent on a personal machine, hand‑in‑hand with developers and their specific setups, makes more sense. CC got the prioritization right and packaged it as a beautiful, minimal, engaging CLI‑style factor, changing the look of AI—it’s no longer just a website like Google, but a “sprite/ghost” that “lives” on your computer. This is a new paradigm for interacting with AI.
5. Vibe Coding
2025 was the year AI crossed a capability threshold that allowed people to build impressive programs through English while forgetting about code itself. Interestingly, I coined the term “vibe coding” in this tweet without expecting it to spread so far
With vibe coding, programming is no longer strictly limited to highly trained professionals—anyone can do it. In that sense, it is another example I wrote about in “Power to the People: How LLMs Are Disrupting Technology Diffusion Scripts,” showing (in stark contrast to all prior technologies) that ordinary people benefit far more from LLMs than professionals, companies, and governments. But vibe coding not only empowers laypeople to approach programming; it also enables trained professionals to write more software that wouldn’t have been written otherwise (vibe‑coded). In nanochat I vibe‑coded my own custom high‑efficiency BPE tokenizer in Rust instead of using an existing library or deeply learning Rust. This year I vibe‑coded many projects as quick demos of things I wanted to exist (see, for example, menugen, llm‑council, reader3, hacker news time capsule). I even vibe‑coded an entire temporary app just to find a bug, because why not—code suddenly became free, temporary, malleable, and disposable. Vibe coding will reshape software and change job descriptions.
6. Nano Banana / LLM GUI
Google Gemini Nano Banana is one of the most astonishing, paradigm‑shifting models of 2025. In my worldview, LLMs represent the next major computing paradigm, analogous to the personal computers of the 1970s, 1980s, etc. Thus we will see similar innovations driven by analogous reasons: personal computing, micro‑controllers (cognitive cores), agent‑based internet, and so on.
From a UI/UX perspective, chatting with an LLM is a bit like issuing commands to a computer console in the 1980s. Text is the raw/preferred data representation for computers (and LLMs), but it is not the human‑preferred format, especially for input. People actually don’t enjoy reading text—it’s slow and laborious. Instead, people prefer visual and spatial ways of consuming information, which is why GUIs were invented in traditional computing. Likewise, LLMs should converse with us in formats we like—images, infographics, slides, whiteboards, animations/videos, web apps, etc. Early and current versions include emojis and Markdown, which are ways to “dress up” and visually layout text using headings, bold, italics, lists, tables, and so on, to make consumption easier.
But who will actually build the LLM GUI? In this worldview, …
TL;DR: 2025 was a turning point for LLMs, largely because of the emergence of RLVR. It enabled LLMs to exhibit human‑like reasoning on verifiable tasks, gave us a more intuitive grasp of their “shape,” and sparked a wave of new applications—from Ghost‑summoning agents to vibe‑coded software—that are reshaping how we build and interact with AI.