LLMs excel at programming—how can they be so bad at it?
My explanation for the mystery of why LLMs can be both exceptionally good and quite terrible at programming.
Despite the overall hype in all things AI, in particular among the tech crowd, we have not yet seen much in terms of product–market fit and genuine commercial success for AIs—or more specifically, LLMs—outside a fairly narrow range of application areas. Other than sycophantic chatbots, AI girlfriends, and maybe efficient document search, the main application of LLMs seems to be computer programming. LLMs can be really good at programming. And yet, also, they are awful. Andrej Karpathy, the inventor of the term “vibe coding,” expressed in a recent interview that there continue to be major limitations in what kind of programming problems LLMs can tackle. So what’s going on here? How can LLMs be both great at programming and terrible? How can vibe coding sometimes succeed beyond our wildest imagination and at other times fail entirely?

I think there is a simple explanation for this seemingly paradoxical observation. And if you listen carefully to Andrej Karpathy’s interview, you will notice that he is aware of the explanation. Here is what I think is happening: There are two entirely distinct skillsets that both exist under the umbrella of being “good at programming.” Most people don’t distinguish between them. That’s because most people don’t have either skillset. They’re not even aware of the distinction. And the people who have exceptional command of one skillset typically are also at least comfortable with the other and consequently don’t think much about the distinction either. But LLMs only have one of the two skillsets. And for the one that they have, they by far exceed even the best human programmers. This can make them appear remarkably good at programming, in particular to less experienced developers. But whenever the other skillset is required, the one they lack, LLMs fail miserably.
So what are these two skillsets? The first is being able to reason deeply and innovatively about algorithms, data structures, or software architecture. This is the one LLMs lack. The second is being able to read, process, and memorize large amounts of API documentation, tutorial materials, and other existing code examples. This is the one LLMs excel at. For humans, it tends to be the reverse. Good programmers tend to be exceptional at conceptual thought, whereas reading large amounts of documentation is hard for anyone. However, experienced programmers can make up for their relative lack of ability to absorb massive amounts of text by memorizing the relevant parts (due to repeat use), and also by searching on Stack Overflow1 or reading the relevant documentation on the fly.
When Karpathy talks about LLMs being good at “boilerplate,” this is exactly what he means. LLMs excel at copying basic setup code from the documentation or from introductory tutorials. But LLMs can go beyond just boilerplate. They are definitely able to string API calls together, or to take the logic for a common problem and adapt it to a different programming language, or a different library, or even a somewhat modified use case. To people with little programming experience, this can appear magical, and it can convince them that an LLM can program anything a user may want. And to experienced programmers, this can save huge amounts of time and effort, in particular when working with a language or library or codebase they are not that familiar with.
But, as useful as this skill is, there comes a time in any programming project where deep conceptual thought is more important. Sometimes you do need to develop a novel algorithm that solves a tricky problem. Or you have to hunt down that weird bug that somehow, for no obvious reason, seems to involve three unrelated components in a large software project. Or you have to architect a new project and there are complex tradeoffs that need to be balanced carefully to arrive at a working solution. In 2025, no LLM can reliably tackle these types of problems.
Maybe eventually LLMs or some other form of AI will achieve proficiency in both skillsets. At that point, AI will be able to program truly autonomously. But we are not there today. Nevertheless, LLMs can be are tremendously useful. They just need to be understood as a more sophisticated version of Stack Overflow, not as an autonomous, junior software developer.
I have had a personal experience recently where I was lacking exactly the knowledge that LLMs can provide. As a consequence, I got huge time savings and increased efficiency out of LLM use.2 It was in the context of the graduate class I am teaching this fall, about AI models in molecular biology. The class covers both (i) the conceptual underpinnings of widely used models and (ii) practical, hands-on experience with building, training, and modifying various AI models, as well as analyzing and visualizing model outputs. I know a lot conceptually about how AI models work. I can explain attention and feed-forward layers and linear projections and activation functions till the cows come home. But I never actually code myself in pytorch.3 And similarly, I know a lot about data analysis and data visualization, but I only have experience doing these kinds of things in R, not in python.4
So with my deep conceptual knowledge about how things work in principle and complete ignorance about how any of this is done in practice, I’d ordinarily have to buckle down, read a ton of documentation and tutorials, and then painstakingly put together my demonstrations and hands-on experiences. It could easily take me two full work days for every one hour of practical in-class material. However, LLMs are exceptional at writing little code examples for a class. All I had to do was ask the AI for code that did what I wanted to do, and the AI would generally deliver useful results within one or sometimes a few tries. You can see an example of the type of prompts I would use here. This made preparing my in-class materials so much simpler and faster. I read every line of code the AI produced and I verified it did what I wanted it to do, but I didn’t have to also read hundreds of pages of documentation to find the exact function calls that would solve my specific problems.
Also, I had various existing code examples that were using pandas and matplotlib and I think both libraries have major conceptual flaws. I didn’t want to teach these libraries. So I needed to convert all these code examples into polars and plotnine. This is a perfect application area for LLMs. Paste the existing pandas/matplotlib code into the prompt box and ask the LLM to translate to polars/plotnine and it’ll zero-shot the answer every time.
Results were a bit more mixed when it came to fixing bugs. For simple bugs, things often worked out very well. I just pasted the error message into the prompt box and the model corrected the code. Typical use cases were situations where the model had hallucinated an API call or a function parameter or a return value, and when it saw the error message it recognized the problem and often came up with the right way to fix the issue. But sometimes this process could go haywire. Just the other day I asked for a fairly simple (I thought) function that could load two protein structures and align them. And the model just couldn’t figure out how to correctly call the superimpose() function from the biotite package. We went through six or seven iterations where the model would give me code, the code wouldn’t run, I’d paste in the error message, the model would respond with new code, which again wouldn’t run, and so on. At some point it felt like we were going in circles, where I got the exact same error messages I had seen in earlier iterations. Eventually, finally, we solved the issue, and arrived at a simple ten lines of working code. But the process felt painful, and in this particular case I suspect that if I had just read the documentation and coded this by hand it would have been faster.
This last example shows how quickly I reached the limits of what even state-of-the-art coding models can do today. Things work great when the task consists of reproducing or slightly modifying existing code examples, but when things go wrong and we need to find a subtle bug the models clearly don’t think. They end up flailing around like a beginner programmer, just trying things out until hopefully something works. In those moments it doesn’t feel like there’s a deep intellect on the other side that is carefully reasoning through the problem and systematically homing in on the root cause of the bug. This task is still on the human user. And more generally, it’s on the human user to realize when the model has gotten stuck, is going in circles, is hallucinating, or otherwise is no longer making useful suggestions.
I believe programming is a niche where LLMs can find product–market fit exactly because so much of programming is reading the documentation and tutorials and code examples. It is an application domain where for specific tasks LLMs are definitely better than humans, and therefore humans who know how to use LLMs appropriately in this context can derive great value. However, I think it is dangerous to get bamboozled by an LLM’s ability to spit out massive amounts of lightly transformed example code and think the model can reason deeply about complex algorithmic or architectural issues. A human who could write straightforward code examples at the speed of an LLM would likely be a superstar programmer, with the associated other qualities superstar programmers have, but LLMs work differently. They don’t have those other qualities. They can generate code, but they can’t program.
More from Genes, Minds, Machines
PhD-level intelligence or the graduate student from hell
AI companies love to tout that their models are approaching—or have reached—PhD-level intelligence. This is blatant nonsensical marketing geared towards an audience that deeply misunderstands what a PhD is and what it takes to get one. Hearing it makes me cringe. PhD-level intelligence is not a thing.
"I asked the AI" is not research
I had two experiences this past week where I saw how misleading it can be to take AI at face value. First, I was looking for an old blog post on writer’s block I had written. I did a simple Google search, “clauswilke blog writer’s block,” and the Google AI returned the following:
Yeah, I know, that is quickly fading into irrelevance. Let’s just memorialize, for the younger generations for whom this will be completely alien, that during the 2010s the number one skill a programmer needed to have was the ability to search Stack Overflow for the specific problems they needed to solve.
I will use the generic term LLM throughout. But if you’re wondering, the specific model I used for programming assistance was Claude Sonnet 4.5.
In my lab, the actual coding is mostly done by my graduate students.
In my AI class, because we’re already programming in pytorch, all data analysis and data visualization is done in python, to simplify things for the students. I continue to maintain that python is not a good language for data analysis. But that’s a topic for another post.



I think your notion in this article is excellent and goes beyond programming. LLM’s are great at fuzzily reproducing big fields (provided they’re not too niche), but they don’t have the ability to reason about how these facts really connect
Why don’t you like Python for data analysis? And what are your issues with pandas and matplotlib. I don’t do much Python anymore so I’m a bit behind the curve and am curious how things have progressed