Google released Gemma 4 yesterday — April 2, 2026 — and the coverage has mostly focused on the benchmarks and the fact that it outperforms previous versions. Both things are true and both things are, honestly, the least interesting part of the story.
The interesting part is this: Google just released a model built on the same research as Gemini 3 — their best proprietary AI — under an Apache 2.0 licence. That licence means you can use it commercially, modify it, deploy it in a product, and Google cannot come back later and change the terms. No subscription. No per-token costs. No data sent to a server somewhere. You download the model. You run it. That's it.
For developers and companies who have been watching the open-source AI space with one eyebrow permanently raised — waiting for the inevitable "we're changing our terms" moment — this matters. Apache 2.0 is as permissive as it gets.
That's the real headline. The benchmark numbers will be debated and contextualised over the coming weeks. The licence is a fact.
What Gemma 4 Actually Is
Gemma 4 is a family of open-weight large language models from Google DeepMind. It's not one model — it's three, designed for different hardware and use cases.
E2B — the smallest option at 2.3 billion effective parameters. Don't let the size fool you. It's three times faster than its predecessor and runs on hardware that would have laughed at running AI two years ago: phones, Raspberry Pi, NVIDIA Jetson embedded devices. It supports text and image input.
E4B — 4.5 billion effective parameters, still multimodal (text and images), with better reasoning capability than the E2B. This is the sweet spot for most developers: capable enough for serious tasks, small enough to run on a laptop without melting it.
27B — the flagship option, a 26-billion parameter mixture-of-experts architecture with 128 experts and 3.8 billion parameters active during any single inference. In plain English: it's a large model that's engineered to behave like a much smaller one when running, which makes it practical on consumer hardware. A single RTX 4090 can run it in 4-bit quantised form. The full unquantised version needs an H100, which puts it firmly in server/cloud territory.
All three support a 128,000 token context window — that's roughly 100,000 words of input, enough to feed an entire book and ask questions about it. All three support 140-plus languages. And all three are up to four times faster than Gemma 3, with up to 60 percent less battery consumption on mobile hardware.
The multimodal capability — text, images, and audio input on E2B and E4B — is new and significant. Previous Gemma versions were text-only. This brings the smaller models much closer to what you'd expect from a cloud API.
What's Actually New vs Gemma 3
Speed and efficiency are the headline improvements, and they're real. Four times faster inference is not a marginal gain — it's the difference between a model that feels like a tool and one that feels like waiting.
The multimodal upgrade is the other meaningful change. Gemma 3 was text-in, text-out. Gemma 4's E2B and E4B models accept images and audio alongside text, which opens up a much wider range of applications — document analysis, image description, audio transcription — all running locally, without any external API.
The architectural change in the 27B model is also worth noting. Mixture-of-experts is an approach where the model routes each input through a subset of its total parameters rather than running everything every time. The result is a model with 26 billion parameters that behaves more like a 4-billion parameter model in terms of compute during inference. This is how you get a large capable model that still runs on a single consumer GPU.
Function calling, chain-of-thought reasoning, tool use, structured output, and system prompts are all natively supported across the family. For developers, this matters: you don't need to hack these capabilities in through prompt engineering. They're built in.
The Android AICore Developer Preview is also worth mentioning for anyone building mobile applications. It's forward compatible with Gemini Nano 4, which means apps built on Gemma 4 now have a cleaner path to upgrading without architectural rewrites.
What hasn't changed: this is still Google's open-weight offering, not Google's best model. Gemini 3 Ultra remains the proprietary flagship. Gemma 4 is built on the same research, but the weights are different. Think of it as the best Google is willing to give away, not everything Google has.
Who This Is For
Developers Who Don't Want Cloud API Costs or Data Exposure
If you're building an application that processes user data — documents, emails, health records, financial information — sending that data to a cloud API creates real problems. Legal exposure, compliance overhead, user trust issues. Running inference locally eliminates all of that.
Gemma 4 makes this practical at a scale that wasn't possible before. The E4B model running on a laptop-class GPU can handle most business document tasks — summarisation, classification, extraction — fast enough to feel responsive in a production application. The cost per query after hardware is essentially zero.
For startups and indie developers, the economics are also worth thinking about. A cloud LLM API at scale is a meaningful cost centre. A model you run yourself, after the initial hardware investment, has a different cost curve entirely.
Companies That Need On-Device AI Without a Google Dependency
The Apache 2.0 licence is the specific thing that makes Gemma 4 interesting for enterprise use. It means you can deploy it in a product, modify the weights (within the terms of the licence), and there is no mechanism by which Google can later restrict your use or change the pricing on you.
This is not a hypothetical concern. The open-source AI space has seen enough "we're adjusting our terms" announcements that legal and procurement teams at larger organisations have started asking hard questions about model licences. Apache 2.0 is a known quantity. It answers those questions.
For companies with genuinely sensitive data — defence adjacent work, healthcare, legal — on-device inference with a commercially permissive licence is the only viable path. Gemma 4 is now a serious option in that conversation.
Hobbyists and Tinkerers
The Raspberry Pi angle is not marketing. The E2B model genuinely runs on ARM hardware at inference speeds that are actually useful. If you've been following the local AI movement — running models on consumer hardware, building home automation pipelines, experimenting with private assistants — Gemma 4 is a significant step forward in what's achievable without spending serious money.
The Ollama installation path (more on that below) makes this accessible without deep ML engineering knowledge. If you can follow terminal instructions, you can run Gemma 4 locally within fifteen minutes.
How It Compares to the Alternatives
This section needs a caveat upfront: Gemma 4 was released yesterday. Real-world testing is still emerging. The benchmarks Google published were designed by Google. Independent evaluation takes time. What follows is based on specs, architectural comparisons, and early community reports — not definitive verdicts.
vs Llama 4 (Meta)
Llama 4 is Meta's most capable open release and Gemma 4's most direct competitor. Both use mixture-of-experts architecture at the larger sizes. Both have competitive licences. Based on what's public so far, Llama 4 Scout and Maverick are the relevant comparisons to Gemma 4's 27B — broadly similar capability tiers, though real-world performance varies significantly by task.
Where Gemma 4 has a clear edge in the specs: the smaller models. Llama 4's smallest useful option is less efficient on constrained hardware than Gemma 4's E2B, which is specifically engineered for mobile and edge deployment. If you're targeting a phone or a Raspberry Pi, Gemma 4 currently looks like the better option. If you're running on server hardware and want maximum capability, both are worth testing.
vs Mistral
Mistral has been the default recommendation for self-hosted open-source AI for a couple of years now. It has a mature ecosystem, solid documentation, and a community that's had time to develop tooling and best practices.
Gemma 4 is newer and larger at the equivalent tier. Whether it performs better on your specific tasks is something you'll need to test — general benchmarks don't always translate to your use case. What Gemma 4 adds that Mistral's smaller models don't currently match is the multimodal capability and the tighter integration with mobile hardware.
vs Phi-4 (Microsoft)
Phi-4 is Microsoft's small model, competing in the same efficiency-focused tier as Gemma 4's E2B and E4B. Phi-4 has performed well on reasoning benchmarks relative to its size. Early comparisons suggest Gemma 4 E4B is competitive on reasoning tasks, but again — one day of real-world data is not enough to draw firm conclusions.
The practical difference is hardware optimisation. Gemma 4 has specific optimisations for NVIDIA RTX hardware and Android devices. If your deployment target is NVIDIA consumer GPUs or Android phones specifically, Gemma 4 currently has an edge in that pipeline.
What we genuinely don't know yet: how these models perform over weeks of real production use, what edge cases and failure modes emerge, and how the community tooling develops around Gemma 4. These things matter, and they take time to discover.
How to Actually Get Started
There are three sensible paths depending on what you want to do.
Ollama (Recommended for Local Use)
Ollama is the simplest way to run Gemma 4 locally. If you don't have it installed:
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Then pull and run Gemma 4
ollama run gemma4
For the specific size variants:
ollama run gemma4:2b # E2B — fastest, runs on almost anything
ollama run gemma4:4b # E4B — better reasoning, still efficient
ollama run gemma4:27b # 27B — needs a capable GPU
Ollama handles quantisation and hardware detection automatically. On most consumer hardware it will select an appropriate quantisation level without you having to manage it. The first run downloads the model weights, which are several gigabytes depending on the variant — plan accordingly.
Google AI Studio (No Setup Required)
If you want to test Gemma 4 before committing to a local setup, Google AI Studio has it available now. It's free to use with a Google account, runs in the browser, and gives you access to the full capability without any local hardware requirements.
This is the right starting point if you want to evaluate whether Gemma 4 fits your use case before investing time in a local deployment. The experience is roughly analogous to using a cloud API — fast, convenient, but your data goes to Google's servers.
Hugging Face
For developers who want to integrate Gemma 4 into a Python pipeline, Hugging Face is the standard path:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-4b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-4-4b")
You'll need to accept the model terms on Hugging Face (a formality, not a licence restriction) and have enough GPU VRAM for the model size you choose. The transformers library handles the rest. Kaggle also hosts the weights if you prefer that ecosystem.
The Bigger Picture
Google releasing Gemma 4 on Apache 2.0 is not charity. It's strategy.
The context matters: Chinese AI labs — Moonshot AI, Alibaba, Z.AI and others — have been releasing competitive open-weight models at an accelerating pace. They're targeting international developers and companies who would otherwise use American models. Google's response, at least in part, is to make its open-source offering compelling enough that developers default to Gemma rather than Qwen or Kimi.
There's also the developer ecosystem argument. The more developers build on Gemma, the more the broader AI ecosystem orbits Google's research direction. Open-sourcing a capable model is a way of setting standards — training data expectations, API conventions, architectural patterns — without having to convince anyone to adopt them through mandate.
None of this means Gemma 4 isn't genuinely useful. It clearly is. But understanding why Google is giving this away helps you think about the risks. The Apache 2.0 licence substantially mitigates the licence-change risk. What remains is the question of continued investment: if Google's open-source AI strategy changes, will Gemma 5 come? The track record so far suggests yes — the Gemma series has been consistent — but it's worth acknowledging that you're building on something Google controls.
The deeper shift Gemma 4 represents is the continued movement of meaningful AI capability toward the edge. The performance gap between a model you can run on your phone and a model that requires a data centre is closing. Not closed — the best proprietary models are still substantially more capable than anything you can run locally. But the gap is narrowing, and the tasks that can be handled locally are expanding.
For a lot of real-world applications — document processing, summarisation, classification, question answering over private data — local models are already capable enough. Gemma 4 moves that line further.
Verdict
Gemma 4 is a serious open-source release that warrants attention, particularly the E4B model and its Apache 2.0 licence.
The practical case is strongest for three groups: developers who need local inference for privacy or cost reasons, companies with regulatory constraints that make cloud APIs awkward, and hobbyists who want capable AI running on consumer hardware. For all three, Gemma 4 is now a top-tier option.
The caveats are real. This is one day old. Real-world performance across diverse tasks will take weeks to properly evaluate. The benchmark numbers are impressive, but benchmarks are designed to be impressive — what matters is how the model performs on your specific workload.
What's encouraging is what doesn't need further evaluation: the licence, the hardware support, the architecture, and the clear engineering focus on efficiency. Those are facts, not claims to be tested.
The things to watch over the coming weeks: independent benchmark comparisons with Llama 4 and Mistral, community reports on specific use cases, and whether the Android AICore integration delivers on its promise for mobile developers.
For now: if you've been waiting for an open-source model that's genuinely capable, genuinely free to use commercially, and genuinely runs on the hardware you already have — this is worth your time.