https://news.ycombinator.com/item?id=44902148
Personally I'm excited that you all have access to this model now and hope you all get value out of using them.
The tokens themselves are a form of compression. Lets say we have the word "WaffleHouse", character level this would be 11 tokens, but with an embedder this would be perhaps 2 or 3 tokens (I didn't actually run through the tokenizer but we could verify precisely). This matters a lot for on device processing especially.
So while we could get more intelligence out of the model by bumping up the "knowledge" parameters, the device would need to process more input and output tokens.
Another advantage on small devices is the embeddings are just a lookup table which requires little to no computation. Its the rest of the parameters that have the expensive matrix multplications, so if we increased those we'd also be increasing the number of FLOPs needed for a forward pass.
This blog post explains it well. https://www.adamcasson.com/posts/transformer-flops
So all this to say is there are definite tradeoffs between model size, performance on evals, and compute cost. We ran many internal experiments with different choices to see could work well, and then picked what we believed work will best for the open community.
For instance, it's well-known that transformer embeddings tend to form clusters. Have you considered splitting the embedding table into "cluster centroid" and "offset from centroid" tables, where the later would presumably have a smaller range and precision?
Can you share what kind of hardware is necessary to train it, and how long it took?
The Gemma3 technical report contains many details on training setup https://arxiv.org/pdf/2503.19786
This was released with the initial batch of Gemma3 so it doesn't contain the 270m details, nonetheless you'll get a good idea of what it takes to build these models.
(literal tl;dr: learning and experimentation opportunity)
1. Since it's just PyTorch, that means one can run it locally upon whatever accelerator you have that PyTorch supports. For quite a few people that includes Metal Performance Shaders: https://docs.pytorch.org/docs/stable/mps.html
I can attest that building PyTorch from git is achievable in about 15 minutes on my M1 Pro, if you really want to chase the rabbithole. Cloning PyTorch is its own special 'please. wait.', but building it is fine
2. Since it's (of the ones that I've looked at) approximately 500 lines long, it's much, much, much more digestable than a lot of the vomit that comes out of so-called production systems. Those systems usually have only heard about typed Python in passing, and they believe it is a fad that will blow over. The ones in this repo aren't stellar about it, but at 500 lines it's easily achievable to type hint the code yourself, which can serve as an excellent learning opportunity
3. PyTorch offers some fun conversion tools, also, allowing one to compare-and-contrast how it executes under Torch versus ONNX <https://docs.pytorch.org/docs/stable/onnx.html>, TorchScript <https://docs.pytorch.org/docs/stable/generated/torch.jit.sav...>, CoreML <https://apple.github.io/coremltools/docs-guides/source/conve...>, or a bazillion other competing frameworks
4. Related, one can play around with quantization and other "inference related" concerns (e.g. https://github.com/pytorch/ao#pytorch-native-training-to-ser... )
5. Further related, one can play around with the fine-tuning mentioned elsewhere, to better understand what is and isn't possible to achieve using that process. Because the code is digestable, and the models are reasonably sized (Qwen 0.6B weighs only 1.4GB and is Apache 2), it brings FAFO opportunities in ways that gpt-oss-20b (or bigger!) won't
I do appreciate that some of what I said may skate close to "ML engineer" concerns, so obviously your situation will be different, but for me having a better grip on how these things work enables me to have better conversations with my colleagues and also helps trip my bullshit detector when someone claims they're the second coming and are going to cure cancer or whatever
How would you recommend someone with a strong background in undergraduate level traditional ML get into deep learning? I use that as a broad term to encompass all the knowledge needed to understand how these models work, starting from the deep learning models of a decade ago, plus the practical ability to collect data or build RL gyms and fine tune them.
I understand ML math well enough that I’m confident I could follow a modern white paper after a lot of effort and research. But there are so many pieces — quantizations, flash attention, Mode, batch sizes, layer sizes, model sparsity. I feel very overwhelmed trying to piece together how all of the pieces arose, and even more overwhelmed trying to figure out how one even goes about fine tuning one. I (like most people here) am extremely technical, and it’s not often I feel this way about a field.
Thanks again! Best of luck on your work
I would start with training a basic MLP on tabular data. Then switch to CNNs: LeNet, VGG, then ResNet. Understand each of the new blocks that are incorporated into each architecture and how they improve stability and training efficiency. There are good PyTorch tutorials for these. Use these as a playground to understand what each of the training knobs do. Look at how their implicit biases induce double descent; this should give you confidence that overfitting is rarely an issue anymore. Give finetuning a try by taking a pretrained ResNet on ImageNet, adding layers to the start and end, and training only these to adapt the model to another image dataset. This should demonstrate the power of finetuning and why pretrained models are so powerful.
Next, briefly consider a tutorial on LSTMs, recognizing the exploding and vanishing gradient problems and the traditional challenges with sequential data.
Then move to transformers. Work with language first, starting from Andrej Karpathy's excellent YouTube tutorials. Train the model in full for a bit, then see about using an existing GPT2 checkpoint. Try adapting NanoGPT to a mathematical dataset as an exercise. Then take a look at llm.c to see how to really improve performance.
Finally, take a look at ViT and DETR. Use pretrained models and finetune them on smaller datasets again.
By this point, you should have a good grounding to start reading much of the surrounding literature and understand them. You should also understand that models are never built from scratch anymore, and every model is a collection of individual pieces built elsewhere for a particular purpose.
Without having done it for deep learning, I'm sure it is like any other area of computer science. You get to exactly the level you're at now, and then you put in that effort following modern papers, and each one gets easier and easier. A year later you've done the literature review for your Phd. :)
https://huggingface.co/dslim/bert-base-NER
Just wondering if it’s worth testing and what it would be most useful for.
From that table, the A100 tok/sec (larger is faster) numbers are:
- Eager: 28
- Compiled: 128
And
- KV cache eager: 26
- KV cache compiled: 99
The reason that the KV cache is slower is likely because it's not GPU-optimized code. On CPU the KV cache is faster. To make it faster on GPU, you would pre-allocate the tensors on the device for example instead of `torch.cat`ting them on the fly
(My first guess)
Do we need to fine-tune for the embeddings to be meaningful at the sentence/document level?
What usages can be reasonable to expect from them? Are there uses out of the box or does one have to go through some custom post-training to get useful behavior?
I feel like there is a huge gap between understanding models as a user of commercial tools and the kind of discussions happening in these threads, but I’m not sure what are the in-between steps.
It does help to figure out where in the space this model fits. I'm still a bit confused about this part:
>since it needs to be shaped to match specific tasks, we did our best to design it to be a flexible starting point for LLM-style tasks and worked with partners to put it into the right frameworks and places for you all to be able to shape it to what you need it to be.
What does shaping mean in this case? What tools are used, what requirements are there, both in terms of hardware and knowledge?
I would like to go beyond being spoonfed by large companies' high usability products, both to improve my knowledge and not be a victim of potential future rug pulls. In the classic software world, I guess the equivalent would be someone who runs open source software navigating the extra complexity, and ocassionally collaborates with the projects.
But I don't know what that looks like in the AI world. I've gone through some courses on machine learning but learning the basics about hessian matrices and gradient descent seems as detached from the practical point I'm searching as taking a compilers class is from learning React, so I think I've been looking in the wrong places (?).
I'll try making an analogy to another task I like which is cooking. In cooking the chef has to make decisions like what is the overall meal going to look like, but then also detailed decisions like what the main course versus side, and even more detailed what's the proportion of side dish serving to main dish, what ingredients, how long to cook something etc.
It's kind of the same with ML models, whether AI or not. When I build smaller bayesian models I make specific choices about the model architecture, which data I use, the array shape of the output etc.
The tools used here are largely jax or pytorch, often in a framework like flax, or a NN higher level package. You often then pair it with libraries that which have NN optimizers, data loaders etc. Pytorch is more batteries included than the JAX ecosystem which separates these out.
One of the best ways to get a grasp of all of this is implement some small models yourself. These pieces will start to be come more apparent and concrete, especially because as an end users you're not exposed to them, the same way most end users are not exposed to compilers.
- private, on-device models (possibly with lower latency than models via web API); also edge devices
- algorithm research (faster and cheaper to prototype new ideas)
- cheap tasks, like classification/categorization; sure, you don't need a decoder-style LLM for that, but it has the advantage of being more free-form, which is useful in many scenarios; or maybe a sanity checker for grammar; or even a router to other model (GPT-5 style)
This is part of the rationale for releasing this model. Now you don't have to start from scratch and finetuning is reasonable on a wide variety of hardware, including reasonable GPU setups (and smaller)
Here's some examples that are inspired by previous roles I had outside of Google, where a business I was working in needed real time text processing.
This tutorials were made with Gemma versions from a year ago, but could now be recreated with Gemma 270m
https://developers.googleblog.com/en/gemma-for-streaming-ml-... https://www.youtube.com/watch?v=YxhzozLH1Dk
- reply in a specific way, like a specific JSON schema, or in the voice of a character - be very good at classifying text (e.g. emails, or spam) - be a great summarizer for large amounts of text, e.g. turn emails into short titles or url slugs - adding tags/categories per your pre-defined rules (e.g. for communities, tagging content, marketing) - for detecting spam, or duplicates, or flagging things
You won't be able to write code or prose with these, but they're great for a huge array of very narrow set of use cases
What's neat about "stupid" models like this is that they're less likely to go off and dream up a bunch of irrelevant content, because they don't know much about the world / won't have too much context to pull from
To answer a question you didn't ask. With small models especially we need to make choices as to which to focus on. For this model we focused on text summarization and instruction following, with the idea that users would finetune to gain performance on the task set that is relevant to them
You
Translate to English: Рошешуа́р (фр. Rochechouart) — місто, муніципалітет у Франції, у регіоні Нова Аквітанія, департамент Верхня В'єнна. Населення — 3637 осіб (2022)[1]. Муніципалітет розташований на відстані[2] близько 360 км на південь від Парижа, 34 км на захід від Ліможа.
gemma-3-270m-it-qat
Here's the translation:
"Rochechouart is a city in France, a municipality located in the region of New Aeterna, department within the Upper Vienne Province. The population is 3637 people (2022). The city is situated at a distance of approximately 360 kilometers from Paris, 34 km from Lyon.
You
Translate to English: Bộ Công Thương đề xuất hỗ trợ hộ gia đình lắp điện mặt trời mái nhà, có hệ thống lưu trữ tối đa 2-3 triệu đồng, hỗ trợ lãi suất vay trong tối đa ba năm. Nội dung được Bộ Công Thương đưa ra tại dự thảo lần 3 Quyết định của Thủ tướng về chính sách hỗ trợ cho hộ gia đình lắp đặt điện mặt trời mái nhà tự sản, tự tiêu và hệ thống lưu trữ điện.
gemma-3-270m-it-qat
The Ministry of Industry and Trade proposes supporting households with solar panels in their homes, with a maximum savings of 2-3 million đồng, providing loan assistance for up to 3 years. The content of the Ministry's decision at the Third Review of the Government's Policy on Support for Households Installing Solar Panels in Homes, approved by the Prime Minister about the policy support for households installing solar panels themselves, and the system storage capacity is maximum 2-3 million đồng.
1: https://uk.wikipedia.org/wiki/%D0%A0%D0%BE%D1%88%D0%B5%D1%88...2: https://vnexpress.net/lap-dien-mat-troi-mai-nha-tu-dung-co-t...
gemma3:27b-it-qat
Rochechouart (French: Rochechouart) is a town and commune in France, in the Nouvelle-Aquitaine region, Department of Haute-Vienne. The population is 3,637 (2022)[1]. The commune is located approximately 360 km south of Paris, 34 km west of Limoges.
gemma3:27b-it-qat
The Ministry of Industry and Trade proposes supporting households installing rooftop solar power systems, with a maximum support of 2-3 million VND for systems including energy storage. This support would also include interest rate subsidies on loans for a maximum of three years. This content was presented by the Ministry of Industry and Trade in the third draft of a Decision by the Prime Minister regarding support policies for households installing self-generated, self-consumed rooftop solar power systems and energy storage systems.