9 Language Models with the Best Cost-Benefit to Run Locally

Aug 30, 2025

Large Language Models (LLMs) are dominating the Artificial Intelligence landscape. But despite their impressive performance, these giant models usually require powerful GPUs and/or expensive cloud infrastructure.

The good news is that there are already several smaller, efficient, open-source models that can be run on a regular laptop or on desktops with modest GPUs.

Generally, models ranging from 100M to 5B parameters are considered SLMs (Small Language Models): they are lighter models, but still very capable. (Source)

For developers, researchers, and companies seeking more control, privacy, and above all, better cost-effectiveness, running models locally is no longer just an alternative, but a smart strategy.

Running a language model on your own hardware brings several advantages, such as:

Zero cost per inference (no reliance on external APIs)
Enhanced data privacy
Flexibility to customize and integrate AI directly into your workflows

The big question is: which models actually offer good performance without requiring super powerful machines?

Below, we present 9 suggested models you can run locally, most available for commercial use.

1. Phi-4 Mini (Microsoft)

The Phi-4 family of models was developed to deliver high performance in natural language tasks while maintaining a compact and efficient architecture. By focusing on data quality instead of data quantity during training, these models compete with much larger ones despite their smaller size.

Parameters: 3.8B (Phi-4 Mini), 14.7B
Ideal for: Applications that require efficient and low-cost natural language processing, demanding logical reasoning and math capabilities; processing long contexts and running in resource-constrained environments, such as mobile devices and edge computing.
Context window: 128k tokens
Where to find: Phi-4 Mini and Phi-4 Mini Reasoning
License: MIT.

2. Gemma 3 (Google DeepMind)

Gemma 3 is a family of advanced and lightweight language models, designed to deliver state-of-the-art performance with computational efficiency. Built with the same technology that powers the Gemini 2.0 models, Gemma 3 models are ideal for running on resource-constrained devices such as laptops, desktops, and even smartphones (see here how to run Gemma3 270M on a mobile phone).

Parameters: 270M, 1B, 4B, 12B, 27B
Ideal for: Tasks involving text and image processing, such as multimodal content analysis, projects seeking to integrate advanced AI into mobile devices and edge computing environments, long-context processing, and applications requiring multilingual support.
Context window: 128K tokens (4B, 12B, and 27B), 32K tokens (1B and 270M)
Where to find: Hugging Face
License: The models are released with open weights under the Gemma license. It is important to note the Gemma Terms of Use, which include specific requirements for distribution and modification.

3. Mistral 7B (Mistral AI)

Mistral-7B is a 7-billion-parameter language model developed by Mistral AI. It offers a balance between performance and computational requirements, making it a viable option for local execution on systems with adequate resources.

Parameters: 7B
Ideal for: AI applications that require efficient execution on resource-constrained devices, tasks such as text analysis, summarization, and translation, and applications requiring multilingual support.
Context window: 32k tokens
Where to find: Hugging Face
License: Apache 2.0

4. LLaMA 3.2 (Meta)

Llama 3.2 is a family of language models developed by Meta. Designed to be lightweight and efficient, they are ideal for local execution on devices with adequate resources, such as laptops and desktops (see my post about Llama 3.2 here).

Parameters: 1B and 3B
Ideal for: AI applications that require efficient execution on resource-constrained devices, tasks such as text analysis, summarization, and translation, applications needing multilingual support, and long-context processing.
Context window: 128k tokens
Where to find: Llama 3.2 1B and Llama 3.2 3B
License: Llama 3.2 Community License (Meta’s custom commercial license). It is important to note the requirements for distribution and modification. For example, if the number of monthly active users exceeds 700 million, an additional license must be requested from Meta.

5. Qwen3 (Alibaba Cloud)

Qwen3 is a family of language models developed by the Qwen team at Alibaba Cloud. The models balance strong language understanding, reasoning, and generation performance with efficient deployment at smaller scales. Trained on a large amount of multilingual data, Qwen3 supports a wide range of tasks and can be fine-tuned for specialized domains.

Parameters: 0.6B, 1.7B, 4B, 8B, 14B, 32B, 235B
Ideal for: AI applications that require efficient execution on resource-constrained devices, tasks such as text analysis, summarization, and translation, AI integration into mobile devices and edge computing environments, and applications requiring multilingual support, covering more than 100 languages and dialects.
Context window: 32k tokens
Where to find: Hugging Face
License: Apache 2.0

6. DeepSeek-R1-Distill-Qwen (DeepSeek AI)

DeepSeek-R1-Distill-Qwen is a series of compact and efficient language models distilled from the powerful reasoning model DeepSeek-R1. Based on the Qwen 2.5 architecture, they were refined with data generated by DeepSeek-R1 to optimize reasoning abilities, especially for mathematical tasks, coding, and general problem-solving (see more about DeepSeek here).

Parameters: 1.5B, 7B, 14B, 32B
Ideal for: AI applications that require efficient execution on resource-constrained devices, such as advanced mathematics, coding, logic, or other challenging tasks.
Context window: 32k tokens
Where to find: Hugging Face
License: MIT

7. SmolLM3 (Hugging Face)

SmolLM3 is a series of language models developed by the Smol Models team at Hugging Face. Designed to deliver high performance with efficiency in compact models, SmolLM3 stands out for its hybrid reasoning capability, support for 6 languages (including Portuguese), and long-context processing. It is a robust option for applications that require strong performance in a compact format.

Parameters: 3B
Ideal for: AI applications requiring hybrid reasoning, such as virtual assistants, text analysis, summarization, and multilingual translation.
Context window: 64k tokens
Where to find: Hugging Face
License: Apache 2.0

8. Command R -7B (CohereLabs)

Command-R-7B is a 7-billion-parameter autoregressive language model developed by the Cohere and CohereLabs team, focused on reasoning, RAG, and tool use.

Parameters: 7B
Ideal for: Applications that require long-context processing, interactive chatbots and instruction-following assistants, RAG systems, and agents that combine multi-step reasoning with data retrieval.
Context window: 128k tokens
Where to find: Hugging Face
License: CC-BY-NC-4.0.

9. Gpt-oss (OpenAI)

Gpt-oss is a new series of open language models from OpenAI, released under the Apache 2.0 license. Using a Mixture-of-Experts (MoE) architecture, the models are designed to optimize performance and computational efficiency (see more about the model here).

Although not considered an SLM, Gpt-oss is worth trying if you have a bit more hardware!

Parameters: 20B and 120B
Ideal for: AI applications that require efficient execution on resource-constrained devices, tasks such as text analysis, summarization, and translation, and applications requiring multilingual support.
Context window: 130k tokens
Where to find: Hugging Face
License: Apache 2.0

How to run models locally?

Running models on your own computer is simpler than it seems. You can use libraries such as Hugging Face’s Transformers or tools like Ollama and LM Studio, which make it easier to download and run LLMs without depending on the cloud.

In the links below, you’ll find practical guides to run the models directly on your computer, using tools that simplify installation and execution:

To achieve the best performance, check whether your machine has enough memory and, when available, a compatible GPU. In addition, techniques such as quantization can reduce resource consumption. This way, you can test different models at low cost and with greater privacy, adapting them to your workflow.

Conclusion

In today’s post, we saw that it is possible to run LLMs locally with great cost-effectiveness, without relying exclusively on the cloud. Even though not all are classified as SLMs, they offer viable alternatives for different use cases, balancing quality, resource consumption, and privacy.

The choice of the ideal model will always depend on context: in some cases, the priority may be running on more modest machines; in others, ensuring higher performance even at a greater hardware cost. What matters most is recognizing that the ecosystem is expanding rapidly and already offers practical solutions for research, development, and personal use.

For those who want to follow the latest releases and compare performance, the LM Arena Leaderboard is an excellent reference, providing updated evaluations of models across different metrics.

👉 Comment below which of these models you are already using!

Luis Pessoa

Aug 30

Hi Elisa,

Great post! This is a very good list of models.

Personally, I tend to use the 3B/4B models more often on CPU — specifically Gemma3, Llama3.2, and Qwen3. They provide decent responses.

I use Ollama (with Docker) for apps.

I really like LM Studio, but since it has some restrictions at work, I often use Jan for easily testing models in a ChatGPT-like interface. It also offers an OpenAI-like API, which can be handy if needed. Anyway, they are both easy to use.

Expand full comment

1 reply by Elisa Terumi

1 more comment...

Exploring Artificial Intelligence

Discussion about this post