I think most software developers would find that performance on SWE-Bench (wherein an AI system has to resolve a real-world Github issue) is much more relevant to their day-to-day than the raw algorithmic problem-solving capabilities of an AI system.
Competitive programming questions are challenging, obviously, but they are tightly defined with prescribed inputs and outputs and the solutions are usually compact, though hard to arrive at.
In contrast, real-world programming is much fuzzier and more ambiguous. The solutions that one must implement are usually much larger in scope and require much greater project-specific context to build. And of course real-world tasks are generally not things that demand lots of formal data structures and algorithms knowledge, like one would need to do well at Codeforces.
What's interesting is that you can have systems like o1 or AlphaCodium that are much, much better than the median software developer at solving tricky algorithmic puzzles, but that also can't do very well at SWE-Bench, which largely comprises Github issues that have been estimated to take <1h for a human developer to do. Even though o1 would absolutely dominate Claude 3.5 Sonnet at competitive programming questions, it seems to basically be a wash on real-world tasks.
So I suppose my question to Qodo would be: Why the emphasis on Codeforces as a benchmark? It's quite clear -- and has been for some time, since e.g. AlphaCode in 2022 -- that AI systems can be really powerful when it comes to solving Codeforces-style questions. What seems much harder (for now) is making significant progress on real-world development tasks.
Set an experiment selecting 100 random software developers around the world and test this hypothesis. You're up to be surprised.
Nevertheless, most developers who wouldn't be able to "solve" a LeetCode challenge in a couple of hours even with access to Google, I bet would perform much better than o1 on real-world Github issues in their technical domain.
This highlights the essence behind OP's question about why focusing on Codeforces. And it shows me that "intelligence" involves a dimension that isn't logical and we don't understand yet.
> AlphaProof, a new reinforcement-learning based system for formal math reasoning
> AlphaGeometry 2, an improved version of our geometry-solving system
"There are 7 cats in a playground and 61 plushies, ..." (insert basically the same problem, requiring the same solution)
Well... Then the LLM shall be able to solve it.
And many people will consider it's a novel problem and hence a resounding success.
I mean: it is a success, but not anywhere near as impressive as most think.
Now, given the token encoding I think naive letter counting is not something we should expect from LLMs, but still serves as a nice reminder to actually ensure the test/validation data is not part of the training data.
With real world questions, everyone can have a different opinion. Let's say you must clasify photos as dogs, cats or other. What about an apple tree? A hyena? An AI generated version of a dog with cat fur? A lion? Hello Kitty?
It's interesting that it has the property of always returning _something_, so you have to be careful how you phrase. And the something returned will be optimized for looking right, but might only be so by accident.
In an apples to apples comparison, 4o can overlook important things while focusing on the kernel of the problem, while o1 is often more comprehensive.
it doesn't swear or yell at me, it's verbose, apologetic, overly correct in its language, gives me endless bulleted lists that convey little information.
unbearable.
Btw: How are people working on multiple code files in Claude?
Firstly, at claude.ai you can upload multiple files, so Claude will take those into account and even suggest changes to multiple files. You are then, however, still copy/pasting from a web interface.
Enter cursor (https://www.cursor.com/), you can either use a Claude API key (but it will warn you that all the features they want you to pay for then don't work), or just use the free version, like I am currently. It gets me enough prompts per day to improve my life.
Or you could pay for it, but I have a feeling that this is a sort of WinRAR situation...
https://aider.chat/docs/usage.html (not a vscode plugin, but with other advantages even when using this editor)
https://docs.continue.dev/getting-started/overview
https://github.com/cline/cline
was pretty good too, but it had taken on too much in its latest version 2.0.0. In other words, it is too unstable at the moment (but probably worth looking at again later).
1. You can configure which LLMs you want to use, whereas Copilot just supports OpenAI models. I just use Claude 3.5 for everything.
2. Chatting with the LLM can produce file edits that you can directly apply to your files. Cursor's experimental "Composer" UI lets you prompt to make changes to multiple files, and then you can apply all the changes with one click. This is way more powerful than just tab-complete or a chat interface. For example, I can prompt something like "Factor out the selected code into a new file" and it does everything properly.
3. Cursor lets you tune what's in LLM context much more precisely. You can @-mention specific files or folders, attach images, etc.
Note I have no affiliation whatsoever with Cursor, I've just really enjoyed using it. If you're interested, I wrote a blog post about my switch to Cursor here: https://www.vipshek.com/blog/cursor. My specific setup tips are at the bottom of that post.
I'd recommend just trying it, because it's hard to summarise how much it's not copilot.
As it is though, I suspect whatever model Cursor Tab is using under the hood has a fairly small context window, so the range of that "tab to move" feature ends up being pretty limited.
Overall my takeaway after months of using Cursor is that it has some really promising features but their LLM engine is too underpowered to deliver.
Hopefully that will change over the next few years. The potential is definitely there.
Honestly, once you learn the copilot specific hot keys you can do all of what cursor does and more. in fact there were times that i felt the team at VS code clearly could have added features that Cursor has but chose not to because they led to more unwanted code slipping through.
I did like the edit tab completions from Cursor but not worth 20$/month and guaranteed enshittification
We will see more of these frameworks for different use cases
What improvements am I missing? Honest question.
I also tried Replit and it felt amazing, but quickly got into a place it couldn't escape and it felt like it took too much effort to get it to change direction once it had committed to a plan. This was early, so it's almost definitely better by now.
I assume you get the latest model, plus some goodies that GH is working on
I don't use o1 simply because I work on one small problem at a time, and LLMs tend to go off the rails when giving multi-step tasks. o1 is not a silver bullet for this either.
Recently it doesn't seem to spend much time thinking, and honestly my results from o1 have been disappointing. I've been sticking with 4o and Claude 3.5 sonnet still.
My current ranking would be Cursor > Continue > Codium (haven't yet tried Copilot).
Codium seems to specialize in enterprise right now (where someone might be told to not use Cursor).
Aider is the best of them all, but I spend too much money... Like, I can easily spend $10, $20 in a day. Which is still a great deal for the added productivity it gives me, but it's $200-$400/mo, which is salty.
Cursor didn't impress me.