Andrej Karpathy - 2025 LLM Year in Review

2025 — A Year of Paradigm Shifts in LLM Research

2025 was a watershed year for large language models (LLMs). A confluence of new research directions, scaling laws, and product releases dramatically reshaped the field. Below I outline the most consequential developments and what they mean for the future of AI.

1. Reinforcement Learning from Verifiable Rewards (RLVR)

In 2025, RLVR emerged as a new, powerful training stage that sits alongside the traditional supervised‑fine‑tuning (SFT) and RLHF phases. By training LLMs on automatically verifiable rewards across many environments (e.g., math or code puzzles), the models spontaneously develop strategies that look like “reasoning” to humans—they learn to break problems into intermediate steps, iterate, and converge on answers (see the DeepSeek R1 paper for examples).

Unlike SFT and RLHF (both short fine‑tuning phases with modest compute), RLVR uses an objective, non‑cheatable reward function, allowing much longer optimization. It turned out to be a very high capability‑to‑cost ratio, consuming compute that would otherwise have been allocated to pretraining. Consequently, most of the capability gains in 2025 came from LLM labs digesting this surplus compute in the new stage; we saw LLMs of roughly the same scale but with longer RL runtimes. Moreover, this new stage introduced a brand‑new control knob (and associated scaling laws), allowing capability to be modulated by generating longer reasoning traces and increasing “thinking time,” i.e., as a function of compute at inference. OpenAI’s o1 (late 2024) was the first demonstration of an RLVR model, but the release of o3 (early 2025) marked a clear inflection point—you can intuitively feel the difference.

2. Ghosts vs. Animals / Sawtooth Intelligence

2025 was the first year—at least for me and, I think, for the whole industry—that we began to internalize the “shape” of LLM intelligence more intuitively. We are not “evolving/cultivating animals,” but rather “summoning ghosts.” Everything in the LLM stack is different (neural architecture, training data, training algorithms, especially optimization pressures), so the entities we get in the intelligence space are fundamentally different, and it is inappropriate to think of them through an animal lens. From the perspective of supervision signals, human neural networks are optimized for tribal survival in a jungle, whereas LLM neural networks are optimized to imitate human text, collect rewards on math puzzles, and earn human up‑votes on LM Arena. Because verifiable domains enable RLVR, LLM capabilities near those domains “spike,” yielding an interesting sawtooth performance pattern—they are simultaneously brilliant generalists and confused, cognitively limited elementary‑schoolers, and can be fooled by jailbreak tricks to leak your data at any moment.

Human intelligence: blue, AI intelligence: red. I like this version of the meme (sorry, I lost the original post on X), which points out that human intelligence is also sawtooth in its own way.

Related to this, I observed a general apathy toward and loss of trust in benchmarks in 2025. The core issue is that benchmarks are essentially verifiable environments, making them immediately susceptible to RLVR and its weaker forms (such as synthetic data generation). In the typical “bench‑maxxing” process, LLM lab teams inevitably build environments near the small region of embedding space occupied by the benchmark and grow sawtooth patterns to cover them. Training on the test set has become a new art form.

What does it look like to crush every benchmark yet still not reach AGI?

I have written more on this topic:

3. Cursor / A New Tier of LLM Applications

The most noteworthy thing about Cursor (aside from its meteoric rise this year) is that it compellingly reveals a new tier of “LLM applications”—people are now talking about “Cursor for X.” As I emphasized in my Y Combinator talk this year (slides and video), LLM applications like Cursor bundle and orchestrate LLM calls for specific verticals:

  1. They perform “context engineering.”
  2. They orchestrate multiple LLM calls underneath, forming increasingly complex DAGs while carefully balancing performance and cost trade‑offs.
  3. They provide a GUI tailored to the specific application for human interaction.
  4. They offer “autonomous sliders.”

In 2025 there was a lot of discussion about the “thickness” of this new application tier. Will LLM labs capture all applications, or will there be ample greenfield for LLM applications? Personally, I suspect LLM labs will tend to nurture general‑capability “undergraduates,” while LLM applications will, by providing private data, sensors, actuators, and feedback loops, organize, fine‑tune, and actually drive themselves into specialized deployment teams for particular verticals.

4. Claude Code / AI Living on Your Computer

Claude Code (CC) is the first compelling demonstration of an LLM agent—it chains tool use and reasoning in a loop to perform extended problem solving. Moreover, what stands out to me about CC is that it runs on your own computer, together with your private environment, data, and context. I think OpenAI got this wrong, as they focused their Codex/agent efforts on cloud‑container deployments orchestrated by ChatGPT rather than on localhost. While a swarm of cloud‑run agents feels like an “AGI endgame,” we live in a middle world of sawtooth capabilities where take‑off speed is slow enough that simply running an agent on a personal machine, hand‑in‑hand with developers and their specific setups, makes more sense. CC got the prioritization right and packaged it as a beautiful, minimal, engaging CLI‑style factor, changing the look of AI—it’s no longer just a website like Google, but a “sprite/ghost” that “lives” on your computer. This is a new paradigm for interacting with AI.

5. Vibe Coding

2025 was the year AI crossed a capability threshold that allowed people to build impressive programs through English while forgetting about code itself. Interestingly, I coined the term “vibe coding” in this tweet without expecting it to spread so far :slight_smile: With vibe coding, programming is no longer strictly limited to highly trained professionals—anyone can do it. In that sense, it is another example I wrote about in “Power to the People: How LLMs Are Disrupting Technology Diffusion Scripts,” showing (in stark contrast to all prior technologies) that ordinary people benefit far more from LLMs than professionals, companies, and governments. But vibe coding not only empowers laypeople to approach programming; it also enables trained professionals to write more software that wouldn’t have been written otherwise (vibe‑coded). In nanochat I vibe‑coded my own custom high‑efficiency BPE tokenizer in Rust instead of using an existing library or deeply learning Rust. This year I vibe‑coded many projects as quick demos of things I wanted to exist (see, for example, menugen, llm‑council, reader3, hacker news time capsule). I even vibe‑coded an entire temporary app just to find a bug, because why not—code suddenly became free, temporary, malleable, and disposable. Vibe coding will reshape software and change job descriptions.

6. Nano Banana / LLM GUI

Google Gemini Nano Banana is one of the most astonishing, paradigm‑shifting models of 2025. In my worldview, LLMs represent the next major computing paradigm, analogous to the personal computers of the 1970s, 1980s, etc. Thus we will see similar innovations driven by analogous reasons: personal computing, micro‑controllers (cognitive cores), agent‑based internet, and so on.

From a UI/UX perspective, chatting with an LLM is a bit like issuing commands to a computer console in the 1980s. Text is the raw/preferred data representation for computers (and LLMs), but it is not the human‑preferred format, especially for input. People actually don’t enjoy reading text—it’s slow and laborious. Instead, people prefer visual and spatial ways of consuming information, which is why GUIs were invented in traditional computing. Likewise, LLMs should converse with us in formats we like—images, infographics, slides, whiteboards, animations/videos, web apps, etc. Early and current versions include emojis and Markdown, which are ways to “dress up” and visually layout text using headings, bold, italics, lists, tables, and so on, to make consumption easier.

But who will actually build the LLM GUI? In this worldview, …


TL;DR: 2025 was a turning point for LLMs, largely because of the emergence of RLVR. It enabled LLMs to exhibit human‑like reasoning on verifiable tasks, gave us a more intuitive grasp of their “shape,” and sparked a wave of new applications—from Ghost‑summoning agents to vibe‑coded software—that are reshaping how we build and interact with AI.

Andrej Karpathy: Software Is Changing (Again)

Software is changing (again)

Hello everyone.

Wow, there are a lot of people here. Hello.

I’m excited to be here today to talk about software in the age of AI. I’ve heard that many of you are students—undergraduates, master’s, PhDs, etc.—about to enter the industry. I think now is an extremely unique and fascinating time to get into it. The fundamental reason is that software is changing again.

I say “again” because I’ve actually given this talk before. The point is, software is constantly evolving, so I always have new material for a new talk. And this time the change is very fundamental. In my view, over the past 70 years software hasn’t changed much at this deep level, and then in the last few years it has changed dramatically about twice. So there is a huge amount of work to do, a huge amount of software to write and rewrite.

Let’s look at the software landscape. Imagine this as a map of software—a cool tool called the “GitHub map.” It contains all the software that has been written—code that tells computers what to do in digital space. If you zoom in, you see all the different repositories, which is essentially all the code that has ever been written.

A few years ago I observed that software was changing, and a new kind of software was emerging, which I called “Software 2.0.” The idea was: Software 1.0 is the code you write for a computer; Software 2.0 is basically neural networks, specifically the weights of neural networks. You don’t write those weights directly; you mostly adjust the dataset and then run an optimizer to create the network’s parameters. At that time, neural networks were viewed as classifiers, like decision trees, so this framework made sense. Now we actually have a GitHub‑like equivalent for the Software 2.0 world—Hugging Face is essentially the GitHub for Software 2.0. There’s also Model Atlas, where you can visualize all the code written there. By the way, the huge dot in the middle is the parameter set of the Flux image generator. Every time someone fine‑tunes a Flux model, it’s like committing a git change in this space, creating a different image generator.

So we have: Software 1.0 = code that programs a computer; Software 2.0 = weights that program a neural network. Here is an example of an AlexNet image‑recognition network.

Up until recently, all the neural networks we were familiar with were essentially fixed‑function computers—e.g., mapping an image to a class. The recent change—what I think is a very fundamental shift—is that neural networks have become programmable via large language models. To me, this is a brand‑new kind of computer that deserves a new name: Software 3.0. Basically, your prompt is now the program for an LLM. And the striking thing is that these prompts are written in English. It’s a very interesting programming language.

A simple way to see the difference: for sentiment classification you could imagine writing some Python code, or training a neural network, or prompting a large language model. Here’s a short prompt; you can imagine tweaking it to program the computer in slightly different ways. So we have Software 1.0, Software 2.0, and now I think we’re seeing—many of you have probably noticed that a lot of GitHub code is no longer just code, it’s interspersed with a lot of English—a new category of code is growing. It’s not just a new programming paradigm; what’s amazing is that it’s written in our native language, English.

A few years ago, when this idea blew my mind, I posted a tweet that I’ve pinned ever since: “It’s amazing that we’re now programming computers in English.”

When I was at Tesla we were working on Autopilot, trying to get the car to drive itself. I showed a slide: the car’s inputs at the bottom, and the software stack producing steering and acceleration. I observed that Autopilot contained a lot of C++ code (Software 1.0) and some image‑recognition neural networks. As we kept improving Autopilot, the neural networks grew in capability and scale, and all the C++ code was being eliminated. Many capabilities that were originally written in 1.0 migrated to 2.0. For example, a lot of cross‑camera and cross‑time stitching information is now done by neural networks, and we deleted a lot of code. So the Software 2.0 stack actually ate the Autopilot software stack. I thought that was awesome, and now we’re seeing the same thing again: a new kind of software is swallowing the stack. We have three completely different programming paradigms. If you’re about to enter the industry, being fluent in all three is a great idea because each has its own strengths and weaknesses. You might want to use 1.0, 2.0, or 3.0 to achieve a certain function: train a neural network? Just prompt an LLM? Or write explicit code? We all have to make those decisions and may need to fluidly switch between paradigms.

Next I want to talk about LLMs and how to think about this new paradigm, the ecosystem—what this new computer looks like and what its ecosystem is like. I was moved by a line Andrew Ng said many years ago (Andrew is about to speak next): “AI is the new electricity.” I do think that captures something interesting: LLMs truly have utility‑like characteristics.

LLM labs such as OpenAI, Google Gemini, Anthropic, etc., spend capital to train LLMs—this is analogous to building a power grid; then they have operating expenses, providing intelligence via APIs that are billed per‑million‑token, metered access. We have many utility‑like requirements for these APIs: low latency, high availability, consistent quality, etc. In electricity you have transfer switches to flip between the grid, solar, battery, or generator; in LLMs we have OpenRouter to easily switch between different LLM providers. Because LLMs are software, they don’t compete for physical space, so you can have six “electricity suppliers” and switch at any time. In the past few days we saw many LLM outages that stalled work. When the most advanced LLM goes down, it’s like a global intelligence brown‑out—the planet becomes dumber. Our dependence on these models is already huge and will keep growing.

But LLMs also have some “fab” (semiconductor‑foundry) characteristics. The capital expense to train them is massive, the technology tree evolves quickly, and labs concentrate deep technical knowledge and R&D secrets. The analogy is a bit fuzzy because this is software, which is more flexible and harder to protect.

The analogy I find most apt is: LLMs are very much like operating systems. It’s not just a commodity like electricity or water; it’s an increasingly complex software ecosystem. The way ecosystems form is similar: a few closed‑source providers (like Windows or macOS) and an open‑source alternative (like Linux). For LLMs we also have a few competing closed‑source providers, and the Llama ecosystem is currently the closest thing to what might grow into a Linux‑like platform. It’s still early; these are simple LLMs now, but we’re seeing them become more complex—not just the LLM itself, but tools, multimodality, etc.

I once sketched this: LLMs are like a new OS, the LLM is the CPU, the context window is memory, and the LLM coordinates memory and compute to solve problems—very OS‑like. Another analogy: you can download an app (e.g., VS Code) and run it on Windows, Linux, or Mac; similarly you can run an LLM app (e.g., Cursor) on GPT, Claude, or Gemini, just by switching a dropdown.

We’re a bit like the 1960s: LLM compute is still expensive, which forces LLMs to live in the cloud. We’re all thin clients interacting over the network; no one can fully own that compute, so time‑sharing makes sense—we’re just one slice of a batch‑processing system. That’s similar to early computers: the OS lived in the cloud, everything went over the network, with batch processing. The personal‑computer revolution hasn’t happened yet because it isn’t economical. Some people are already experimenting—e.g., a Mac Mini is suitable for some LLM batch‑1 inference (mostly memory‑bound). Those might be early signs of personal compute, but it’s unclear what they’ll look like; maybe some of you will invent it.

Another analogy: when I talk to ChatGPT or another LLM via plain text, it feels like talking to an OS through a terminal. There isn’t a truly universal GUI yet—chat is just text bubbles. Some apps have GUIs, but there’s no universal GUI that works across all tasks.

LLMs differ from early computing and OSes in some unique ways. I wrote a post about this characteristic: LLMs invert the usual diffusion of technology. Typically, new tech (electricity, cryptography, computing, flight, internet, GPS) is first adopted by governments and enterprises because it’s new and expensive, then later spreads to consumers. LLMs are the opposite: the earliest adopters were people figuring out how to boil an egg. It’s fascinating—we have a magical new computer that helps me boil an egg, not a government doing ballistic calculations. Enterprises and governments are actually lagging behind ordinary people in adoption.

To sum up so far: LLM labs produce LLMs—that’s a precise statement—but LLMs are complex operating systems; they’re in the “1960s” of computing, and we’re re‑doing the computing stack, currently delivered via time‑sharing and utility models. The brand‑new, unprecedented thing is that they’re not in the hands of a few governments or enterprises, but in the hands of all of us—because we all have computers, it’s just software, and ChatGPT is instantly delivered to billions of devices. It’s insane. Now is the time for us to enter the industry and program these computers.

Before we start programming LLMs, we need to spend some time thinking about what they are. I especially like to talk about their “psychology.” (The subtitles were cut off here, but the talk continued discussing LLM characteristics, prompt engineering, agents, etc.)

LLMs are like beings with a mind—they can “daydream” and make mistakes. So when building, keep humans in the loop: a generate‑validate loop must be extremely fast. Prompts need to be specific to avoid vague failures. Often we need intermediate artifacts to constrain the AI; for example in education we first let a teacher create a curriculum (auditable), then let the AI teach students based on that curriculum, so the AI is “tethered” and doesn’t drift.

I’m not unfamiliar with partial autonomy—I spent five years at Tesla on Autopilot, which is a partially autonomous product. The dashboard shows the Autopilot GUI, visualizing what the neural network sees, and there’s an autonomy slider—we gradually let the system do more for the user. My first experience with autonomous driving was in 2013 on a Waymo car around Palo Alto for 30 minutes, perfect with no intervention. At the time I thought autonomous driving was just around the corner. Twelve years later we’re still working on it. Waymo looks driver‑less, but there’s still a lot of remote operation and human intervention. We haven’t truly solved it yet, but we will, it just takes a long time. Software is hard, driving is hard. So when people say “2025 is the year of agents,” I’m worried—it should be “the decade of agents.” We need humans in the loop, and we must be careful; this is software.

I like the Iron Man analogy: the suit is both augmentation (Tony Stark can control it) and an agent (it can fly on its own sometimes). We have an autonomy slider. At this stage, using unreliable LLMs is more about building an “Iron Man suit” than a full robot: building partially autonomous products with custom GUIs and UI/UX, making the human generate‑validate loop extremely fast. But never forget that full automation is possible in principle; have the autonomy slider and think about how to move it gradually.

Another unique dimension: the programming language is English, and suddenly everyone is a programmer because everyone can speak natural language. This is incredibly optimistic and unprecedented. Previously you needed 5–10 years of learning to do anything in software; now you don’t.

“Vibe coding” – describing ideas in natural language and letting AI generate code – has become a meme. Kids are using it to program; it’s healthy and is the entry drug to software development. I tried it myself: I built an iOS app (I don’t know Swift) and got it running in a day; I also built menu.app—take a photo of a menu, generate dish images (now my main cost center, I’ve lost a lot of money).

Interestingly, the coding part is easy; the hardest part is making it “real”: authentication, payments, domains, deployment. Those are DevOps tasks, clicking around in a browser, painfully slow. Docs tell me “go to this URL, click this dropdown, select that…”—the computer is telling me what to click, why doesn’t it just do it itself?

So the final part: can we build things directly for agents? I don’t want to do all these clicks—can an agent do it?

Now there’s a new kind of digital information consumer and manipulator: before it was humans via GUI or computers via API. Now there are agents—human‑like souls on the internet. Can we build infrastructure for them?

For example, robots.txt tells crawlers how to behave; we could have lm.txt, a simple markdown file that tells an LLM what a domain is for. Lots of docs are written for humans (lists, bold, images), which LLMs struggle with. Some services now provide LLM‑specific docs (Vercel, Stripe) in markdown. “Click here” doesn’t help an LLM; you need to replace it with a curl command. Anthropic’s Model Context Protocol is also a direct‑to‑agent protocol.

There are also small tools that make data LLM‑friendly: change github.com to get‑ingest.com to get a single plain‑text concatenation of all files; Deep Wiki lets AI analyze a repo and generate documentation.

Even if LLMs eventually can click themselves, it’s still worth taking the first half‑step—making information easier to access—because visual/click interactions remain expensive and error‑prone.

Summary: It’s a great time to enter the industry. We need to rewrite massive amounts of code—both professional programmers and “vibe coders” will write it. LLMs are like utilities, like fabs, but especially like operating systems—now in their 1960s. They are fallible human‑like agents; we must learn to collaborate with them, adjust the infrastructure, and build LLM applications with ultra‑fast generate‑validate loops, create partially autonomous products, and even write code directly for agents.

Returning to the Iron Man metaphor: over the next decade we’ll slide the knob from left (augmentation) to right (agency). I can’t wait to build that with you.

Thank you all.

1 Like

@tension Let’s study together