"how does a smart car compare to a ford f150? its different in its intent and intended audience.
Ollama is someone who goes to walmart and buys a $100 huffy mountain bike because they heard bikes are cool. Torchchat is someone who built a mountain bike out of high quality components chosen for a specific task/outcome with the understanding of how each component in the platform functions and interacts with the others to achieve an end goal." https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...
Longer Answer with some more details is
If you don't care about which quant you're using, only use ollama and want easy integration with desktop/laptop based projects use Ollama. If you want to run on mobile, integrate into your own apps or projects natively, don't want to use GGUF, want to do quantization, or want to extend your PyTorch based solution use torchchat
Right now Ollama (based on llama.cpp) is a faster way to get performance on a laptop desktop and a number of projects are pre-integrated with Ollama thanks to the OpenAI spec. It's also more mature with more fit and polish. That said the commands that make everything easy use 4bit quant models and you have to do extra work to go find a GGUF model with a higher (or lower) bit quant and load it into Ollama. Also worth noting is that Ollama "containerizes" the models on disk so you can't share them with other projects without going through Ollama which is a hard pass for any users and usecases since duplicating model files on disk isn't great. https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...
So you get to browse locally and remotely if you are able to expose the service remotely adjusting your router.
Coudflare will also expose services remotely if you wishhttps://developers.cloudflare.com/cloudflare-one/connections...
So you can also run on any LLM privately with ollama, lmstudio, and or LLamaSharp with windows, mac and iphone, all are opensource and customizable too and user friendly and frequently maintained.
This allows running models on GPU as well.
I've been considering buying one of those powerful Ryzen mini PCs to use as an LLM server in my LAN, but I've read before that the AMD backend (ROCm IIRC) is kinda buggy
But it seems like integrated GPUs are not supported
I'm not saying it isn't impressive being able to swap but I have trouble understanding how this integrates into my workflow and I don't really want to put much effort into exploring given that there are so many things to explore these days.
Of course, could be that I've just got used to GPT-4 and my prompting been optimized for GPT-4, and I try to apply the same techniques to other models where those prompts wouldn't work as great.
And if you work on something where the commercial models are trained to refuse answers and lecture the user instead, some of the freely available models are much more pleasant to work with. With 70B models you even get decent amounts of reasoning capabilities
what models did you try? There's a ton of new ones every month these days.
I also tried to hook it up to Claude and so far its flawless (didn't do a lot of testing though).
Most "well-known-name" open-source ML models, are very much "base models" — they are meant to be flexible and generic, so that they can be fine-tuned with additional training for task-specific purposes.
Mind you, you don't have to do that work yourself. There are open-source fine-tunes as well, for all sorts of specific purposes, that can be easily found on HuggingFace / found linked on applicable subreddits / etc — but these don't "make the news" like the releases of new open-source base models do, so they won't be top-of-mind when doing a search for a model to solve a task. You have to actively look for them.
Heck, even focusing on the proprietary-model Inference-as-a-Service space, it's only really OpenAI that purports to have a "general" model that can be set to every task with only prompting. All the other proprietary-model Inf-aaS providers also sell Fine-Tuning-as-a-Service of their models, because they know people will need it.
---
Also, if you're comparing e.g. ChatGPT-4o (~200b) with a local model you can run on your PC (probably 7b, or maybe 13b if you have a 4090) then obviously the latter is going to be "dumber" — it's (either literally, or effectively) had 95+% of its connections stripped out!
For production deployment of an open-source model with "smart thinking" requirements (e.g. a customer-support chatbot), the best-practice open-source-model approach would be to pay for dedicated and/or serverless hosting where the instances have direct-attached dedicated server-class GPUs, that can then therefore host the largest-parameter-size variants of the open-source models. Larger-parameter-size open-source models fare much better against the proprietary hosted models.
IMHO the models in the "hostable on a PC" parameter-size range, mainly exist for two use-cases:
• doing local development and testing of LLM-based backend systems (Due to the way pruning+quantizing parameters works, a smaller spin of a larger model will be probabilistically similar in behavior to its larger cousin — giving you the "smart" answer some percentage of the time, and a "dumb" answer the rest of the time. For iterative development, this is no problem — regenerate responses until it works, and if it never does, then you've got the wrong model/prompt.)
• "shrinking" an AI system that doesn't require so much "smart thinking", to decrease its compute requirements and thus OpEx. You start with the largest spin of the model; then you keep taking it down in size until it stops producing acceptable results; and then you take one step back.
The models of this size-range don't exist to "prove out" the applicability of a model family to a given ML task. You can do it with them — especially if there's an existing fine-tuned model perfectly suited to the use-case — but it'll be frustrating, because "the absence of evidence is not evidence of absence." You won't know whether you've chosen a bad model, or your prompt is badly structured, or your prompt is impossible for any model, etc.
When proving out a task, test with the largest spin of each model you can get your hands on, using e.g. a serverless Inf-aaS like Runpod. Once you know the model family can do that task to your satisfaction, then pull a local model spin from that family for development.
Have you had good results from any of these? I've not tried a model that's been fine-tuned for a specific purpose yet, I've just worked with the general purpose ones.
Last time I played with LLMs on CPU with pytorch you had to replace some stuff with libraries from Intel otherwise your performance would be really bad.
How about making libtorch a first class citizen without crashes and memory leaks? What happened to the "one tool, one job" philosophy?
As an interesting thought experiment: Should PyTorch be integrated into systemd or should systemd be integrated into PyTorch? Both seem to absorb everything else like a black hole.
It's just a showcase of existing PyTorch features (including libtorch) as an end-to-end example.
On the server-side it uses libtorch, and on mobile, it uses PyTorch's executorch runtime (that's optimized for edge)
What makes this cool is that you can use the same model and the same library and apply to server, desktop, laptop and mobile on iOS and Android, with a variety of quantization schemes and other features.
Definitely still some rough edges as I'd expect from any first software release!