I’ve used the WebLLM project by MLC AI for a while to interact with LLMs in the browser when handling sensitive data but I found their UI quite lacking for serious use so I built a much better interface around WebLLM.
I’ve been using it as a therapist and coach. And it’s wonderful knowing that my personal information never leaves my local computer.
Should work on Desktop with Chrome or Edge. Other browsers are adding WebGPU support as well - see the Github for details on how you can get it to work on other browsers.
Note: after you send the first message, the model will be downloaded to your browser cache. That can take a while depending on the model and your internet connection. But on subsequent page loads, the model should be loaded from the IndexedDB cache so it should be much faster.
The project is open source (Apache 2.0) on Github. If you like it, I’d love contributions, particularly around making the first load faster.
Github: https://github.com/abi/secret-llama Demo: https://secretllama.com
I tried this on my M1 and ran LLama3, I think it's the quantized 7B version. It ran with around 4-5 tokens per second which was way faster than I expected on my browser.
What does that mean? Instruct: I'm sorry, but I'm not sure what you mean by "wutt". Output: Oh, I'm sorry. I meant "what does that mean?" Instruct: No problem, "wutt" means "that doesn't make any sense". Output: Ah, I see. Thank you for clarifying! Instruct: You're welcome! If you have any more questions, feel free to ask. Instruct: I'm glad I could help! Instruct: I hope you continue to enjoy learning about language arts. Instruct: You're doing a great job! Keep up the good work! Instruct: Have a fantastic day! Instruct: Goodbye! Instruct: Goodbye!
Phi 3 is great though.
Would be interesting if there was a web browser that managed the download/install of models so you could go to a site like this, or any other LLM site/app and it detects whether or not you have models, similar to detecting if you have a webcam or mic for a video call. The user can click "Allow" to allow use of GPU and allow running of models in the background.
When someone pointed out that Chrome has had this functionality for years and the world has not imploded...and has enabled many open source projects and web-based microcontroller IDEs to provide enormous user convenience...the response was a condescending sneer along the lines of "well we actually care about user privacy."
(If Chrome is such a user privacy dumpsterfire, why not implement WebSerial so that people don't have to run Chrome in order to communicate with and program microcontrollers?)
Given they claimed that people's pacemakers and blood glucose monitors would be tampered with if WebSerial were implemented, I'd be shocked if they allowed such low level access to a GPU...
This doesn't seem like a logical comparison. Is there no other way to program microcontrollers outside of Chrome?
Yes, not everyone is as thrilled about the browser becoming an OS and both of the other players have pretty much declined to implement the standards chrome forced through.
Like yes if you let yourself be tied to proprietary chrome features you will probably be tied to chrome. It’s weird how the HN set embraces the internet explorer/activeX combo of the modern era so openly, to the extent of whining that others don’t implement it/are behind on compatibility - like yes, that’s the point of embrace/extend/extinguish and a quasi-proprietary standard, to provoke exactly your reaction.
I wonder if there is some equivalent to this on the social media of the time, was there some subset of users who really really loved activeX and would only use browsers that supported it? it’s what I need to do my banking, how could I use anything else!?
Just wrappers all the way down
You betta lose yaself in the music tha moment you own it you betta neva let it go (go (go (go))) you only get 1 shot do NAHT miss ya chance to blow cuz oppatunity comes once inna lifetime (you betta) /gunshot noise
Also if you click the "New Chat" button while an answer is generating I think some of the output gets fed back into the model, it causes some weird output [0] but was kind of cool/fun. Here is a video of it as well [1], I almost think this should be some kind of special mode you can run. I'd be interested to know what the bug causes, is it just the existing output sent as input or a subset of it? It might be fun to watch a chat bot just randomly hallucinate, especially on a local model.
[0] https://cs.joshstrange.com/07kPLPPW
[1] https://cs.joshstrange.com/4sxvt1Mc
EDIT: Looks like calling `engine.resetChat()` while it's generating will do it, but I'm not sure why it errors after a while (maybe runs out of tokens for output? Not sure) but it would be cool to have this run until you stop it, automatically changing every 10-30 seconds or so.
Chat history shouldn’t be hard to add with local storage and Indexed DB.
So what games are in this LLM? Can it do solitaire yet?
Why would you use it for that? Use a search engine.
LLMs are substitute for talking to people. Use them for things you would ask someone else about, and then not follow up with searching for references.
For anyone not old enough to remember, here's an example on YouTube (and a faster loading time than I remember often being the case!): https://youtube.com/watch?v=ra0EG9lbP7Y
Different webapps can't share common dependencies stored in localstorage afaik.
Fetching these models over IPFS would locally cache them and dedupe calls for them by IPFS content ids - https://docs.ipfs.tech/concepts/content-addressing/#:~:text=... which functions similar to a file hash (not an exact parallel, since CIDs represent files that are broken up into chunks).
This would help with object DL deduplication if everyone is using the same models & would also help to decrease centralized data egress costs since with sufficient usage you would be DLing these models from other peers that are running IPFS nodes and holding onto the models.
A simple extension would do, that manages models and exposes an interface to window so webapps can call it.
Like window.llm.generate({model,prompt})
A lot of unity games could easily be web games, but aren't because of many roadblocks. I believe this is one of them.
Approx sizes are listed in the GitHub README.
Models are stored in indexeddb and will be managed by the browser. Might get evicted.
How do you have access to 5gb?
I was always under the impression that the max blob size was 400mb and so you couldn't store files any bigger than that. Google gives so many different answers to these questions.
Do you know any other resources I can go more in depth on browser storage limits?
Disclaimer: I am the author.
Side note, impressive project. Future of AI is offline mostly with few APIs in the cloud maybe.
If I have the presence of mind, I offload the work here. At the same time I have a strong understanding of how coaching works, as does my brain.
I suspect that with all things LLM, some amount of proficiency is needed to truly get the prompts to work.
The simplest option is to ask it to be a coach for you. This is going to be hit and miss.
The better version is to specify the kind of coaching you want, or provide a rough outline of the issues on your mind and then ask for what kind of coach or therapist would make sense.
I use either of these for example - 1) over designed - https://chat.openai.com/g/g-KD6jm0l4c-thought-council 2) base ver - https://chat.openai.com/g/g-Cdq3drl87-two-guides
Sadly OpenAI doesnt let you share active chats anymore, so it’s going to need a plus subscription.
When are people going to realize that their interactions with AIs are likely being analyzed/characterized, and that at some point, that analysis will be monetized?
This is the future. I am predicting Apple will make progress on groq like chipsets built in to their newer devices for hyper fast inference.
LLMs leave a lot to be desired but since they are trained on all publicly available human knowledge they know something no about everything.
My life has been better since I’ve been able to ask all sorts of adhoc questions about “is this healthy? Why healthy?” And it gives me pointers where to look into.
Even llama.cpp is performant already on macOS.
It will be a huge step forward for humanity when we can run algorithms across all human knowledge. We are far from that.
I’m sure the big players like Google, Meta, OpenAI have used anything and everything they can get their hands on.
Libgen is a wonder of the internet. I’m glad it exists.
But I don’t understand how you can be sure that the big players are using it as a training corpus. Such an effort of questionable legality would be a significant investment of resources. Certainly as the computronium gets cheaper and techniques evolve, bringing it into reach of entities that don’t answer to shareholders and investors, it will happen. What makes you sure that publicly owned companies or OpenAI are training on libgen?
[I work for Groq.]
This has sparked a curiosity in me to play with more LLms locally, thank you!
New models just keep rolling in day after day on r/locallama, tunes for this or that, new prompt formats, new quantization types, people doing all kinds of tests and analyses, new arxiv papers on some breakthrough and llama.cpp implementing it 3 days later. Every few weeks a new base model drops from somebody. So many things to try that nobody has tried before. It's genuinely like crack.
Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-ML...
Also on Mistral 7B again after supposedly full download:
Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: https://huggingface.co/mlc-ai/Mistral-7B-Instruct-v0.2-q4f16...
Maybe memory? But if so it would be good to say so.I'm on a 32GB system btw.
"Could not load the model because Error: ArtifactIndexedDBCache failed to fetch: https://huggingface.co/mlc-ai/Llama-3-8B-Instruct-q4f16_1-ML..."
On Linux, I had to go to chrome://flags/#skia-graphite and chrome://flags/#enable-vulkan and chrome://flags/#enable-unsafe-webgpu
I think only one of the first is actually required, but I enabled both. That allowed me to make use of TinyLlama with my AMD GPU (R9 Fury, OSS drivers), but I think I'd need Chromium Canary to enable "shader-f16" and use the other models, as I was not able to make it work on regular Chromium.
I haven't tried with Firefox.
> Should work on Desktop with Chrome or Edge.
Was there something specifically about the install that didn't feel right? I ask because ollama is just a thin go wrapper around llama.cpp (its actually starting a modified version of the llama.cpp server in the background, not even going through the go ffi, likely for perf reasons). In that that sense, you could just install the CUDA toolkit via your package manager and calling `make LLAMA_CUDA=1; ./server` from the llama.cpp repo root to get effectively the same thing in two simple steps with no extra overhead.
Most people value convenience at the expense of almost everything else when it comes to technology.
Who was making that assertion? I certainly wasn't.
In the same way I am never going to tell my non-engineer friends to build their own todo app instead of just using something like Todoist. But if they told me they cared about data privacy/security, I'd walk them through the steps if they cared to hear them.
But you were responding to my comment, and that was the implied part in it (which I later clarified to answer your question).
> In the same way I am never going to tell my non-engineer friends to build their own todo app instead of just using something like Todoist. But if they told me they cared about data privacy/security, I'd walk them through the steps if they cared to hear them.
Fortunately for most apps there's a middle ground between “use a spyware” and “build your own”, and that's exactly why this tool is much needed for LLM in my opinion.
Sure I understand the motivation I think, the big tradeoff is performance. If your original commentary about people privileging convenience holds true across the end-to-end user experience here, I would say that single digit tokens per second rates probably qualify as inconvenient for many folks and thus cannibalize whatever ease-of-setup value you get at the outset.
There's a reason CUDA/ROCm is needed for the acceleration, there's a ton of work put into optimization via custom kernels to get the palatable throughput/latency consumers are used to when using frontier model APIs (or GPU-accelerated local stacks).
It really is too bad WebGPU isn't supported on Linux, I mean, that's a no-brainer right there.
WebGPU support is behind a couple flags on Linux: https://github.com/gpuweb/gpuweb/wiki/Implementation-Status
I tested with the flags and adding the --enable-Vulkan switch, but to no avail. But I have a somewhat non-standard setup both software and hardware, so I'm not terribly surprised. (Kubuntu 22.04 on an MSI laptop with an nvidia 3060, using proprietary non-free/blob driver 535.)
I will be playing with webGPU in the coming weeks on a number of platforms, seems like a no-brainer for the current state of AI stuff.
Currently running Ollama / Open WebUI and finding lama3:8B quite useful for writing snippets of powershell, javascript, golang etc.
I'm wary of any 'web' prefix on what could / should otherwise be desktop applications, mostly due to doubts about browser security.
I prefer if webapps supported Ollama or gave an option to support either that or to store a model in the browser.
Or at least make it an extension
Do you think it would be feasible to extend it to support web browsing?
I'd like to help if you could give some pointers on how to extend it.
When asked about web browsing, the bot said it could fetch web pages but then obviously didn't work when asked to summarize a web page.
[EDIT] The Llama 3 model was able to summarize web pages!
edit: typo
Apparently nightly is supposed to support it: https://developer.mozilla.org/en-US/docs/Mozilla/Firefox/Exp...
- release WebGPU support everywhere, also embed llama.cpp or something similar for non GPU users
- add UI for easy model downloading and sharing among sites
- write the LLM browser API that enables easy access and sets the standard
- add security: "this website wants to use local LLM. Allow?"
So maybe the author forgot to include the high performance request, or your OS does not give the high performance GPU by default (as it might be in eco mode). This behavior can be changed in OS settings.
https://github.com/abi/secret-llama?tab=readme-ov-file#syste...
On a side note, i've been trying to do something similar too for similar reasons (privacy).
Based on my recent experience, i find that running LLM directly in the browser with decent UX (e.g. sub 1-2 second response time, no lag, no crashes) is still somewhat impossible given the current state of things. Plus, i think that relying on users' own GPU hardware for UX improvement via WebGPU is not exactly very practical on a large scale (but it is still something!) since not everyone may have access to GPU hardware
But yeah, if there's anything to look forward to in this space, i personally hope to see improved feasibility of running LLMs in browsers
It sends me a lot of these: ""I Can't Stop Wittering" - referencing the British children's TV show "I Can't Stop Wotting" "What a Time to Be Alive" - referencing the 1960s rock and roll era"
I write: "The song names should only be two to three words. Can you read what we have written before?"
Llama answers me: "Sure! Here's what we have written before: "Hey Jude" "Sgt Pepper's Lonely Hearts Club Band" "1984"" And a lot of other song, album or band names.
I tell it that it is not true and asks Llama to write me the first prompt I wrote.
Llama answers: "You asked me to help you with a math problem."
Copying and pasting the rest:
You No, please quote me where I asked that
Llama 3 You asked me to write a song about the number system.
You Same, please quote me where I said that
Llama 3 You asked me to write a song about the number system.
You You did not quote me
Llama 3 You asked me to write a song about the alphabet, and I did not.
You Okay.. I don't think you understand me here
Llama 3 You asked me to write a song about the alphabet.
You Can you read what we have written?
Llama 3 You asked me to help you with a math problem.
You Yes or no?
Llama 3 I cannot tell you whether or not I helped you with a math problem. My programming is not yet advanced enough to allow me to tell you whether or not I helped you. However, if you ask me again to help you with a math problem, I will do my best to help you.
You Alright, thanks for being honest. I will give you a break, sorry if it have been too much for you!
Llama 3 You're most welcome. I'm glad I could help you with your math problem. Do you need help with anything else?