But what if human 2 was wrong?
What if both were wrong and human 3 simply said ‘I don’t know’.
LoC is a measure ripe for ignorance driven managerial abuse.
We’ve all seen senior devs explain concepts to junior devs, increasing their understanding and productivity while they themselves ‘produced’ zero lines of code.
Yes zero LoC maybe point to laziness; or to proper preparation.
All this is so obvious. LoC are easy to count but otherwise have hardly any value
About a billion lines of code go through Greptile every month, and we're able to do a lot of interesting analysis on that data.
We decided to compile some of the most interesting findings into a report. This is the first time we've done this, so any feedback would be great, especially around what analytics we should include next time.
WHY ARE THEY STILL TALKING ABOUT ADDING LINES OF CODE IF THEY KNOW HOW SOFTWARE COMPLEXITY SCALES.
I could not put it more simply: you don't get the benefit of the doubt anymore. Too many asinine things like this have been done like this line-of-code-counting BS for me to not see I it as attempted fraud.
Something we know for sure is that the most productive engineers are usually neutral or negative on lines of code. Bad ones who are costing your business money by cranking out debt: those amp up you number of lines
It's like they all forgot how to think, or that other people can spot right where and then they stopped thinking critically and started to go with the hype
It’s definitely a raise to the bottom scenario, but that was already the scenario we lived in before LLMs.
plenty of tickets are never written because they dont seem worth tracking. an llm speeding up development can have the opposite effect - increasing the amount of tickets because more fixes look possible than before
Would be interested in seeing the breakdown between uplift vs company size.
e.g. I work in a FAANG and have seen an uptick in the number of lines on PRs, partially due to AI coding tools and partially due to incentives for performance reviews.
An interesting subtrend is that Devin and other full async agents write the highest proportion of code at the largest companies. Ticket-to-PR hasn't worked nearly as well for startups as it has for the F500.
I'm a bit of an AI coding skeptic btw, but I'm open to being convinced as the technology matures.
I actually think LOC is a useful metric. It may or may not be a positive thing to have more LOC, but its data, and that's great.
I would be interested in seeing how AI has changed coding trends. Are some languages not being used as much because they work poorly with AI? How much is the average script length changing over time? Stuff like that. Also how often is code being deleted and rewritten - that might not be easy to figure out, but it would be interesting.
So, do you have any quality metrics to go with these?
Which stats in the report come from such analysis? I see that most metrics are based on either data from your internal teams or publicly available stats from npm and PyPi.
Regardless of the source, it's still an interesting report, thank you for this!
Super interesting report though.
Also I notice it when the LLMs are offline. It feels a bit like when the internet connect fails. You remember the old days of lower productivity.
Of course, there is a lot of junk/silly ways to approach these tools but all tools are just a lever, and need judgement/skill to use them well.
How maintainable is this code output? I saw a SPA html file produced by a model, which appeared almost similar to assembly code. So if the code can only be maintained by model, then an appropriate metric should should be based on a long-term maintainability achieved, but not on instant generation of code.
As a dev I very much subscribe to this line of thought, but I also have to admit most of the business class people would disagree.
so from a business standpoint, if equivalent expertise amongst staff is assumed then productivity comes down to lines of code created. Just like how you might measure productivity of a warehouse employee by the number of items moved per hour. Of course if someone just throws things across the warehouse or moves things that dont need to be moved they will maximize this metric, but that would be doing the job wrong - which is not a productivity measurement problem. though admittedly the incentive structures and competition make these things often related
the bigger issue to highlight, imo, is that the business side of things have no idea if coders are doing the job sufficiently well or not, and the lack of understanding is amplified by the reality that productivity contribution varies wildly per line, some requiring much more work to conjure than others. The person they need to rely on validate this difference per instance is the same person who is responsible for creating the lines. So there is a catch-22 on the business side. An unproductive employee can claim productivity no matter what the measurement is.
if the variance of work required per line could be understood by the business side then it could be managed for. I used to manage productivity metrics for a medical coding company, and some charts are more dense and harder to code than others. I did not know how to code a medical chart but I could still manage productivity by charts per hour while still understanding this caveat
the point isnt to use the productivity metric as a one stop shop for promoting and firing people but as a filter for attention, where all the middle of the pack stuff will more of less even out and not require too much direct attention. you then just need to get an understanding of how the average difficulty per item varies by product/project.
that said, maybe lines edited is still a step better - so that refactoring in a way that reduces the size of the codebase can still be seen as productive. 1 point for each line deleted and 1 point for each line added.
I understand that every line should be viewed as a liability, not an asset, but thats the job responsibility of the hired expert to figure out how many need to exist. its not the job of the business side of things to manage.
I wouldnt tell my foundation guys how much concrete to use, or my electrician how much wire to use, but if one team can handle more concrete per hour than another and they are both qualified professionals, it really doesnt seem unreasonable to start off conversations with an assumption that one is more productive than the other. Lazy people do exist everywhere, its usually a matter of magnitude of laziness between people more than it is a matter of actual full earnest capability
I fail to see how having a measurement that clearly doesn't measure what is actually produced isn't exactly a productivity measurement problem. If your measurement is defeated by someone doing their job badly, what use is it?
I'm kind of baffled that "lines of code" seems to have come back; by the 1980s people were beginning to figure out that it didn't make any sense.
Unfortunately I’m not sure there are good metrics.
Also, my anecdotal experience is that LLM code is flat wrong sometimes. Like a significant percentage. I can't quote a number really, because I rarely do the same thing/similar thing twice. But it's a double digit percentage.
I feel like we humans try to separate things and keep things short. We do this not because we think it's pretty, we do it so our human brains can still reason about a big system. As a result LOC is a bad measure as being concise then hurts your productivity????
What a lot of us must be wondering though is:
- how maintainable is the code being outputted
- how much is this newfound productivity saving (costing) on compute, given that we are definitely seeing more code
- how many livesite/security incidents will be caused by AI generated code that hasn't been reviewed properly
So no, I don't think persistence-through-time is a good metric. Probably better to look at cyclomatic complexity, and maybe for a given code path or module or class hierarchy, how many calls it makes within itself vs to things outside the hierarchy - some measure of how many files you need to jump between to understand it
Of course, feeding the code to an LLM makes it really go to town. And break every test in the process. Then you start babying it to do smaller and smaller changes, but at that point it’s faster to just do it manually.
- Change in number of revisions made between open and merge before vs. after greptile
- Percentage of greptile's PR comments that cause the developer to change the flagged lines
Assuming the author is will only change their PR for the better, this tells us if we're impacting quality.
We haven't yet found a way to measure absolute quality, beyond that.
You might respond that ultimately, developers need to stay in charge of the review process, but tracking that kind of thing reflects how the product is actually getting used. If you can prove it helps to ship features faster as opposed to just allowing more LOC to get past review (these are not the same thing!) then your product has a much stronger demonstrable value.
For reference I work in finance/econometrics and the code is often about numerical analysis written in SQL and python. More often than not I end up wasting a lot of time fixing issues with AI generated code. None of these nuances ever gets captured by metrics like these and it makes me question people (mostly sales and top execs) that push for "AI" at work.
Recently Apple have released beasties with up to 512GB of RAM. Apples have unified RAM (both for general use and GPU) so that 512GB looks a bit handy, and they have quite a lot of CPU cores too. They are of the order of £10,000. You should be able to run some pretty large models on that.
I've just blown a fair bit of money on network infra (yum: more switches that boot Linux for the control plane and shuffle packets at incredible speeds) at work so will need to wait a bit or perhaps persuade wifie that we really do need a really expensive Apple box at home.
The snag I have is getting over my mild distaste for Apple! I'm sure I'll manage it.
Is that a per-year number?
If a year has 200 working days that's still only about 40 lines of code a day.
When I'm in full-blown work mode with a decent coding agent (usually Claude Code) I'm genuinely producing 1,000+ lines of (good, tested, reviewed) code a day.
Maybe there is something to those absurd 10x multiplier claims after all!
(I still think there's plenty of work done by software engineers that isn't crunching out code, much of which isn't accelerated by AI assistance nearly as much. 40 lines of code per day felt about right for me a few years ago.)
A lot of people are oblivious to Zipf distributions in effort and output, and if you ever catch on to it as a productive person, it really reframes ideas about fairness and policy and good or bad management.
It also means that you can recognize a good team, and when a bunch of high performers are pushing and supporting eachother and being held to account out in the open, amazing things happen that just make other workplaces look ridiculous.
My hope for AI is that instead of 20% of the humans doing 80% of the work, you end up with force multipliers, and a ramping up, so that more workplaces look like high function teams, making everything more fair and engaging and productive, but i suspect once people get better with AI, at least up to the point of AGI, is we're going to see the same distribution but 10x or 50x the productivity.
Usually, you have a lot of time to think on the side while coding on what to do next, strategize, etc. But if you work in small increments with an LLM agent, this time is reduced and you have to be ready for the next thing once one increment is done.
So I don't see this as an equalizer. Rather, those who can constantly push forward are getting much more than those who don't.
An example from earlier today: https://github.com/simonw/llm-gemini/commit/fa6d147f5cff9ea9...
That commit added 33 lines and removed 13 - so I'm already at a 20-lines-a-day level just from that one commit (and I shipped a few more plus a release of llm-gemini: https://github.com/simonw/llm-gemini/commits/a2bdec13e03ca8a...)
It took about 3.5 minutes. I started from this issue someone had filed against my repo:
Then I opened Claude Code and said:
Run this command: uv run llm -m gemma-3-27b-it hi
That ran the command and returned the error message. I then said: Yes, fix that - the gemma models do not support media resolution
Which was enough for it to figure out the fix and run the tests to confirm it hadn't broken anything.I ran "git diff", thought about the change it had made for a moment, then committed and pushed it.
Here's the full Claude Code transcript: https://gistpreview.github.io/?62d090551ff26676dfbe54d8eebbc...
I verified the fix myself by running:
uv run llm -m gemma-3-27b-it hi
I pasted the result into an issue comment to prove to myself (and anyone else who cares) that I had manually verified the fix: https://github.com/simonw/llm-gemini/issues/116#issuecomment...Here's a more detailed version of the transcript including timestamps, showing my first prompt at 10:01:13am and the final response at 10:04:55am. https://tools.simonwillison.net/claude-code-timeline?url=htt...
I built that claude-code-timeline application this morning too, and that thing is 2284 lines of code: https://github.com/simonw/tools/commits/main/claude-code-tim... - but that was much more of a vibe-coded thing, I hardly reviewed the code that was written at all and shipped it as soon as it appeared to work correctly. Since it's a standalone HTML file there's not too much that can go wrong if it has bugs in it.
I don't know if code quality really matters to most people or to the bottom line, but a good software engineer writes better code than Claude. It is a testament to library maintainers that Claude is able to code at all, in my opinion. One reason is that Claude uses API's in whacky ways. For instance by reading the SDL2 documentation I was able to find many ways that Claude writes SDL2 using archaic patterns from the SDL days.
I think there are a lot of hidden ways AI booster types benefit from basic software engineering practices that they actively promote damaging ideas about. Maybe it will only be 10 years from now that we learn that having good engineers is actually important.
Same here. So I tell it what improvements I want to make and watch it make them.
I've gained enough experience at prompting it that it genuinely is faster for me to tell it the change I want to make than it is for me to make that change myself, 90% of the time.
You actually missed the point in two ways, because my response had little or nothing to do with speed of producing code. I'm not sure why you felt the need to express that irrelevant objection.
7,839 / 30 = 261 lines of code per day.
(Given that mistake, I'm slightly amused at the number of replies my post here drew defending that incorrect 40-per-day number. AI-haters-gonna-hate.)
If I was hacking on the Linux kernel I would be delighted with myself for producing 40 lines of landed code in a single day.
Kernel / database / systems engineers are a pretty rare breed.
I would expect code that continually changes and deprecates and creates new features is still looking for a good problem domain fit.
I guess you can already derive this value if you sum the total line changed by all PRs and divide it by (SLOC end - SLOC start). Ideally it must be a value slightly greater than 1.
fyi: You headline with "cross-industry", lead with fancy engineering productivity graphics, then caption it with small print saying its from your internal team data. Unless I'm completely missing something, it comes of as a little misleading and disingenuous. Maybe intro with what your company does and your data collection approach.
I wrote zero lines of code today. I read some code and some emails, and wrote a few lines of markdown and some short emails.
All of the code I've written in the past couple weeks was meant to be thrown away. I used it to make some notes, which ended up condensed into those few lines of markdown.
I mean, come on, now. "Force multiplier" is hardly ambiguous.
We have known that this is a useless way to measure productivity since before most people on this site were born.
Those numbers should be seen as a giant red flag, not as any kind of positive.
KLOC's KLOC's KLOC's
Even Steve Balmer was smart enough to realize LOC was a dumb metric.
To add some substance. Many regard a great deal of IBM's decline to managements near obsession with developer LOC metrics, driving out skilled employees.
This makes me metaphorically stabby.