Hosting through z.ai and synthetic.new. Both good experiences. z.ai even answers their support emails!! 5-stars ;)
I've had much less luck with other agentic software, including Claude Code. For these kinds of tasks, only Codex seems to come close.
</rant>
When z.ia launched GLM-4.6, I subscribed to their Coding Pro plan. Although I haven't been coding as heavy this month as the prior two months, I used to hit Claude limits almost daily, often twice a day. That was with both the $20 and $100 plans. I have yet to hit a limit with z.ai and the server response is at least as good as Claude.
I mention synthetic.new as it's good to have options and I do appreciate them sponsoring the dev of Octofriend. z.ai is a China company and I think hosts in Singapore. That could be a blocker for some.
I cancelled Claude two weeks ago. Pure GLM-4.6 now and a tad of codex with my ChatGPT Pro subscription. I sometimes use ChatGPT for extended research stuff and non-tech.
I could deal with the limits, but holy shit is Sonnet 4.5 chatty. It produces as much useless crap as Opus 4.1 did. Might feel fun for Vibe Coders when the model pumps out tons of crap, but I want it to do what I asked, not try to get extra credit with "advanced" solutions and 500+ row "reports" after it's done. FFS.
Been testing crush + z.ai GLM 4.6 through Openrouter (had some credits in there it seems =) for this evening and I'm kinda loving it.
> “These entities advance the People’s Republic of China’s military modernization through the development and integration of advanced artificial intelligence research. This activity is contrary to the national security and foreign policy interests of the United States under Section 744.11 of the EAR.”
https://medium.com/ai-disruption/zhipu-ai-chinas-leading-lar...
At $6/month it's still pretty reasonable, IMO, and chucking less than $10 at it for three months probably gets you to the next pop-up token retailer offering introductory pricing, so long as the bubble doesn't burst before then.
Writing short runs of code and tests after I give an clear description of the expected behavior (because I have done the homework). I want to save the keystrokes and the mental energy spent on bookkeeping code, not have it think about the big problem for me.
Think short algorithms/transformations/script, and "smart" auto complete.
No writing entire systems/features or creating heavily interpolated things due to underspecified prompts - I'm not interested in those.
If you're looking for a cheap practical tool + don't care if it's not local, deepseek's non-reasoning model via openrouter is the most cost efficient by far for the work you describe.
I put 10 dollars in my account about 6 months ago and still haven't gotten through it, after heavy use semi regularly.
I haven't really stayed up on all the AI specific GPUs, but are there really cards with 300GB of VRAM?
My local HPC went for the 120GB version though, but 4 per node.
We are in this together! Hoping for more models to come from the labs in varying sizes that will fit on devices.
1. Donationware - Let's be real, tokens are expensive and if they ask for everyone to chip in voluntarily people wouldn't do that and Ollama would go bust quickly.
2. Subscriptions (bootstrapped and no VCs) again like 1. people would have to pay for the cloud service as a subscription to be sustainable (would you?) or go bust.
3. Ads - Ollama could put ads in the free version but to remove them the users can pay for a higher tier, a somewhat good compromise, except developers don't like ads and don't like pay for their tools unless their company does it for them. No users = Ollama goes bust.
4. VCs - This is the current model which is why they have a cloud product and it keeps the main product free (for now). Again, if they cannot make money or sell to another company Ollama goes bust.
5. Fully Open Source (and 100% free) with Linux Foundation funding - Ollama could also go this route, but this means they wouldn't be a business anymore for investors and rely on the Linux Foundation's sponsors (Google, IBM, etc) for funding the LF to stay sustainable. The cloud product may stay for enterprises.
Ollama has already taken money from investors so they need to produce a return for them so 5. isn't an option in the long term.
6. Acquisition by another company - Ollama could get acquired and the product wouldn't change* (until the acquirer jacks up prices or messes with the product) which ultimately kills it anyway as the community moves on.
I don't see any other way that Ollama can not be enshittified without making a quick buck.
You just need to avoid VC backed tools and pay for bootstrapped ones without any ties to investors.
Ollama gives me, essentially, a wrapper for llama.cpp and convenient hosting where I can download models.
I'm happy to pay for the bandwidth, plus a premium to cover their running this service.
I'm furthermore happy to pay a small charge to cover the development that they've done and continue to do to make local-inference easy for me.
Me neither. The mistake they did was getting outside investments, as now they're no longer in full control and eventually are gonna have to at least give the impression they give a shit about the investors, and it'll come at the cost of the users one way or another.
Please pay for your tools that are independently developed, we really need more community funding of projects so we can avoid this never-ending spiral of VC-fueled+killed tools.
But I understand the added zeros to the (maybe) future payout when you take VC funds is hard to ignore, I'm not blaming them for anything really.
I don’t know how much Ollama contributes to llama.cpp
If nothing else, Ollama is free publicity for llama.cpp, at least when they acknowledge they're mostly using the work of llama.cpp, which has happened at least once! I found llama.cpp by first finding Ollama and then figured I'd rather avoid the lock-in of Ollama's registry, so ended up using llama.cpp for everything.
Could you share a bit more of what you do with llama.cpp? I'd rather use llama-serve but it seems to require a good amount of fiddling with the parameters to have good performance.
We end up fiddling with other parameters because it provides better performance for a particular setup so it's well worth it. One example is the recent --n-cpu-moe switch to offload experts to CPU while filling all available VRAM that can give a 50% boost on models like gpt-oss-120b.
After tasting this, not using it is a no-go. Meanwhile on Ollama there's an open issue asking for this: https://github.com/ollama/ollama/issues/11772
Finally, llama-swap separately provides the auto-loading/unloading feature for multiple models.
Luckily, we have Jan.ai and LM Studio which are happy to run GGUF models at full-tilt on various hardware configs. Added bonus; both include very nice API server as well.
GGUF/GGML was like the 4th iteration of file type quantization from llama.cpp and I remember that I had to consciously begin watching the bandwidth usage from my ISP. Up to that point, I had never received an email warning me about reaching limits of my 2TB connection. All for the same models just in different forms. TheBloke was pumping out models like he had unlimited time/effort.
I say all that to say, llama.cpp was still trying, dare I say 'inventing', all the things throughout these transitions. Ollama comes in to make the running part easier and less CLI flag dependent building off of llama.cpp. Awesome.
GG and company are down in the trenches of the models architecture with CUDA, Vulkan, CPU, ROCm, etc. They are working on perplexity, token processing/generation and just look at the 'bin' folder when you compile the project. There are so many different aspects to make the whole thing work as well at it does. It's amazing that we have llama-server at all with the amount of work that has gone into making llama.cpp.
All that to say, Ollama shit the bed on attribution. They were called out on r/localllama very early on for not really giving credit to llama.cpp. They have a soiled reputation with the people that participate in that sub-reddit at least. They were called out for not contributing back if I remember correctly as well, which further stained their reputation among the folks who hang in that sub-reddit.
So it's not a matter of "ease" to build what Ollama built... At least from the perspective of someone who has been paying close attention from r/localllama; the problem was/is simply the perception (right or wrong) of the meme; Person 2 to person 1: "You built this?" -> Person 2: takes item/thing -> person 2: Holds up item/thing -> "I built this". A simple act that really pissed off the community in general.
supporting models so ollama can then 'support' them too
if you use llama.cpp server, it's quite a nice experience. you can even directly download stuff from Huggingface.