FilterHN

Ollama is someone who goes to walmart and buys a $100 huffy mountain bike because they heard bikes are cool. Torchchat is someone who built a mountain bike out of high quality components chosen for a specific task/outcome with the understanding of how each component in the platform functions and interacts with the others to achieve an end goal." https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...

Longer Answer with some more details is

If you don't care about which quant you're using, only use ollama and want easy integration with desktop/laptop based projects use Ollama. If you want to run on mobile, integrate into your own apps or projects natively, don't want to use GGUF, want to do quantization, or want to extend your PyTorch based solution use torchchat

Right now Ollama (based on llama.cpp) is a faster way to get performance on a laptop desktop and a number of projects are pre-integrated with Ollama thanks to the OpenAI spec. It's also more mature with more fit and polish. That said the commands that make everything easy use 4bit quant models and you have to do extra work to go find a GGUF model with a higher (or lower) bit quant and load it into Ollama. Also worth noting is that Ollama "containerizes" the models on disk so you can't share them with other projects without going through Ollama which is a hard pass for any users and usecases since duplicating model files on disk isn't great. https://www.reddit.com/r/LocalLLaMA/comments/1eh6xmq/comment...

▲

dagaci

1 year ago

[-]

If you running windows anywhere then you better off using ollama, lmstudio, and or LLamaSharp for coding these are all cross-platform too.

▲

lostmsu

1 year ago

[-]

I found LlamaSharp to be quite unstable with random crashes in the built-in llama.cpp build.

▲

sunshinesfbay

1 year ago

[-]

Pretty cool! What are the steps to use these on mobile? Stoked about using ollama on my iPhone!

▲

dagaci

1 year ago

[-]

>> "If running windows" << All of these have web interfaces actually, and all of these implement the same openai api.

So you get to browse locally and remotely if you are able to expose the service remotely adjusting your router.

Coudflare will also expose services remotely if you wishhttps://developers.cloudflare.com/cloudflare-one/connections...

So you can also run on any LLM privately with ollama, lmstudio, and or LLamaSharp with windows, mac and iphone, all are opensource and customizable too and user friendly and frequently maintained.

▲

JackYoustra

1 year ago

[-]

Probably if you have any esoteric flags that pytorch supports. Flash attention 2, for example, was supported way earlier on pt than llama.cpp, so if flash attention 3 follows the same path it'll probably make more sense to use this when targeting nvidia gpus.

▲

sunshinesfbay

1 year ago

[-]

It would appear that Flash-3 is already something that exists for PyTorch based on this joint blog between Nvidia, Together.ai and Princeton about enabling Flash-3 for PyTorch: https://pytorch.org/blog/flashattention-3/

▲

JackYoustra

1 year ago

[-]

Right - my point about "follows the same path" mostly revolves around llama.cpp's latency in adopting it.

▲

jerrygenser

1 year ago

[-]

Olamma currently has only one "supported backend" which is llama.cpp. It enables downloading and running models on CPU. And might have more mature server.

This allows running models on GPU as well.

▲

Zambyte

1 year ago

[-]

I have been running Ollama on AMD GPUs (which support for came after NVIDIA GPUs) since February. Llama.cpp has supported it even longer.

▲

tarruda

1 year ago

[-]

How well does it run in AMD GPUs these days compared to Nvidia or Apple silicon?

I've been considering buying one of those powerful Ryzen mini PCs to use as an LLM server in my LAN, but I've read before that the AMD backend (ROCm IIRC) is kinda buggy

▲

SushiHippie

1 year ago

[-]

I have an RTX 7900 XTX and never had AMD specific issues, except that I needed to set some environment variable.

But it seems like integrated GPUs are not supported

https://github.com/ollama/ollama/issues/2637

▲

RealStickman_

1 year ago

[-]

Not sure about Ollama, but llama.cpp supports vulkan for GPU computing.

▲

darkteflon

1 year ago

[-]

Ollama runs on GPUs just fine - on Macs, at least.

▲

Kelteseth

1 year ago

[-]

Forks fine on Windows with an AMD 7600XT

▲

amunozo

1 year ago

[-]

I use it in Ubuntu and works fine too.

▲

ekianjo

1 year ago

[-]

it runs on GPUs everywhere. On Linux, on Windows...

▲

gleenn

1 year ago

[-]

This looks awesome, the instructions are basically a one-liner to get a Python program to start up a chat program, and it's optimized for a lot of hardware you can run locally like if you have an Nvidia GPU or Apple M processor. Super cool work bringing this functionality to local apps and to just play with a lot of popular models. Great work

▲

boringg

1 year ago

[-]

Can someone explain the use case? Is it so that I can run LLMs more readily in terminal instead of having to use a chat interface?

I'm not saying it isn't impressive being able to swap but I have trouble understanding how this integrates into my workflow and I don't really want to put much effort into exploring given that there are so many things to explore these days.

▲

sunshinesfbay

1 year ago

[-]

It's an end to end solution that supports the same model from server (including OpenAI API!) to mobile. To the extent that you just want to run on one specific platform, other solutions might work just as well?

▲

ipunchghosts

1 year ago

[-]

I have been using ollama and generally not that impressed with these models for doing real work. I can't be the only person who thinks this.

▲

diggan

1 year ago

[-]

Same conclusion here so far. Tested out various open source models, maybe once or twice per month, comparing them against GPT-4, nothing has come close so far. Even closed source models seems to not far very well, so far maybe Claude got the closest to GPT-4, but yet to find something that could surpass GPT-4 for coding help.

Of course, could be that I've just got used to GPT-4 and my prompting been optimized for GPT-4, and I try to apply the same techniques to other models where those prompts wouldn't work as great.

▲

wongarsu

1 year ago

[-]

They won't beat Claude or GPT-4. If you want a model that writes code or answers complex questions use one of those. But for many simpler tasks like summarization, sentiment analysis, data transformation, text completion, etc, self-hosted models are perfectly suited.

And if you work on something where the commercial models are trained to refuse answers and lecture the user instead, some of the freely available models are much more pleasant to work with. With 70B models you even get decent amounts of reasoning capabilities

▲

ekianjo

1 year ago

[-]

> various open source models

what models did you try? There's a ton of new ones every month these days.

▲

diggan

1 year ago

[-]

Most recently: Llama-3.1, Codestral, Gemma 2, Mistral NeMo.

▲

codetrotter

1 year ago

[-]

Which parameter counts, and which quantization levels?

▲

bboygravity

1 year ago

[-]

I wrote an automated form-filling Firefox extension and tested it with Ollama 3.1. Not perfect, quite slow, but better than any other form fillers I tested.

I also tried to hook it up to Claude and so far its flawless (didn't do a lot of testing though).

▲

Dowwie

1 year ago

[-]

Can you share what kind of real work you're trying?

▲

derefr

1 year ago

[-]

What's your example of "real work"?

Most "well-known-name" open-source ML models, are very much "base models" — they are meant to be flexible and generic, so that they can be fine-tuned with additional training for task-specific purposes.

Mind you, you don't have to do that work yourself. There are open-source fine-tunes as well, for all sorts of specific purposes, that can be easily found on HuggingFace / found linked on applicable subreddits / etc — but these don't "make the news" like the releases of new open-source base models do, so they won't be top-of-mind when doing a search for a model to solve a task. You have to actively look for them.

Heck, even focusing on the proprietary-model Inference-as-a-Service space, it's only really OpenAI that purports to have a "general" model that can be set to every task with only prompting. All the other proprietary-model Inf-aaS providers also sell Fine-Tuning-as-a-Service of their models, because they know people will need it.

---

Also, if you're comparing e.g. ChatGPT-4o (~200b) with a local model you can run on your PC (probably 7b, or maybe 13b if you have a 4090) then obviously the latter is going to be "dumber" — it's (either literally, or effectively) had 95+% of its connections stripped out!

For production deployment of an open-source model with "smart thinking" requirements (e.g. a customer-support chatbot), the best-practice open-source-model approach would be to pay for dedicated and/or serverless hosting where the instances have direct-attached dedicated server-class GPUs, that can then therefore host the largest-parameter-size variants of the open-source models. Larger-parameter-size open-source models fare much better against the proprietary hosted models.

IMHO the models in the "hostable on a PC" parameter-size range, mainly exist for two use-cases:

• doing local development and testing of LLM-based backend systems (Due to the way pruning+quantizing parameters works, a smaller spin of a larger model will be probabilistically similar in behavior to its larger cousin — giving you the "smart" answer some percentage of the time, and a "dumb" answer the rest of the time. For iterative development, this is no problem — regenerate responses until it works, and if it never does, then you've got the wrong model/prompt.)

• "shrinking" an AI system that doesn't require so much "smart thinking", to decrease its compute requirements and thus OpEx. You start with the largest spin of the model; then you keep taking it down in size until it stops producing acceptable results; and then you take one step back.

The models of this size-range don't exist to "prove out" the applicability of a model family to a given ML task. You can do it with them — especially if there's an existing fine-tuned model perfectly suited to the use-case — but it'll be frustrating, because "the absence of evidence is not evidence of absence." You won't know whether you've chosen a bad model, or your prompt is badly structured, or your prompt is impossible for any model, etc.

When proving out a task, test with the largest spin of each model you can get your hands on, using e.g. a serverless Inf-aaS like Runpod. Once you know the model family can do that task to your satisfaction, then pull a local model spin from that family for development.

▲

simonw

1 year ago

[-]

"There are open-source fine-tunes as well, for all sorts of specific purposes"

Have you had good results from any of these? I've not tried a model that's been fine-tuned for a specific purpose yet, I've just worked with the general purpose ones.

▲

daghamm

1 year ago

[-]

Does pytorch have better acceleration on x64 CPUs nowadays?

Last time I played with LLMs on CPU with pytorch you had to replace some stuff with libraries from Intel otherwise your performance would be really bad.

▲

gleenn

1 year ago

[-]

I can't find it again in this doc but pretty sure it supports MKL which at least is Intel's faster math library. Better than a stick in the eye. Also certainly faster than plain CPUs but almost certainly way slower than something with more massively parallel matrix processing.

▲

sva_

1 year ago

[-]

x86_64*

▲

ein0p

1 year ago

[-]

Selling it as a “chat” is a mistake imo. Chatbots require very large models with a lot of stored knowledge about the world. Small models are useful for narrow tasks, but they are not, and will never be, useful for general domain chat

▲

suyash

1 year ago

[-]

This is cool, how can I go about using this for my own dataset - .pdf, .html files etc?

▲

jiratemplates

1 year ago

[-]

looks great

▲

aklgh

1 year ago

[-]

A new PyTorch feature. Who knew!

How about making libtorch a first class citizen without crashes and memory leaks? What happened to the "one tool, one job" philosophy?

As an interesting thought experiment: Should PyTorch be integrated into systemd or should systemd be integrated into PyTorch? Both seem to absorb everything else like a black hole.

▲

smhx

1 year ago

[-]

it's not a new PyTorch feature.

It's just a showcase of existing PyTorch features (including libtorch) as an end-to-end example.

On the server-side it uses libtorch, and on mobile, it uses PyTorch's executorch runtime (that's optimized for edge)

▲

BaculumMeumEst

1 year ago

[-]

Did not know executorch existed! That's so cool! I have it on my bucket list to tinker with running LLMs on wearables after I'm a little further along in learning, great to see official tooling for that!

https://github.com/pytorch/executorch

▲

sunshinesfbay

1 year ago

[-]

I think this is not about new Pytorch features, although it requires the latest Pytorch and Executorch making me think that some features in pytorch and executorch got extended optimized for this use case?

What makes this cool is that you can use the same model and the same library and apply to server, desktop, laptop and mobile on iOS and Android, with a variety of quantization schemes and other features.

Definitely still some rough edges as I'd expect from any first software release!