Over the past few months, the editorial team at computer weekly's French Sister Title, Lemagit, has been evaluating different versions of Several Free Downloadable Large Language Models (LLMS) on Personal Machines. These llms currently include Google's Gemma 3, Meta's LLAMA 3.3, Anthropic's Claude 3.7 Sonnet, Several Versions of Mistral (Mistral, Mistral Small 3.1, Mistral Nemo, Mistral Nemo, Mixtral IBM's Granite 3.2, Alibaba's Qwen 2.5, and Deepsek R1, which is primarily a reasoning overlay on top of “distilled” versions of Qwen or llam.
The Test Protocol Consists of Trying to Transform Interviews Recorded by Journalists during their reporting into articles that can be published direct LemagitWhat follows is the lemagit team's experiences:
We are assessing the technical feasibility of doing this on a personal machine and the quality of the output with the resources available. Let's make it clear from the outset that we have never yet managed to get an ai to work properly for us. The only point of this exercise is to understand the real passibility of ai based on a concrete case.
Our Test Protocol is a prompt that include 1,500 tokens (6,000 characters, or two magazine pages) to explain to the ai how to write an article, plus an ages of 11,000 tokens for the trainscription of an analyx Interview Lasting Around 45 Minutes. Such a prompt is generally too heavy to fit into the free window of an online ai. That's why it's a good idea to download an ai onto a personal machine, since the processing remains free, wheatver its size.
The protocol is launched from the lm Studio Community Software, which Mimics The Online Chatbot Interface on the Personal Computer. Lm Studio has a function for downloading llms directly. However, all the llms that can be downloaded free of charge are available on the hugging face website.
What are the Technical Limitations?
Technically, the quality of the result depends on the amount of memory used by the ai. At the time of written, the best result is achieved with an Llm Of 27 Billion Parameters Encoded on 8 Bits (Google's Gemma, in the “27b Q8_0” version), with a context window of 32,000 tokens and a prompt length of 15,000 tokens, on a mac with MAC with Soc Max and 64 GB of Ram With 48 GB Shared Between the Processor CORES (Orchestration), The GPU CORES (Vector Acceleration for Searching for Answers) and the NPU CORES (Matris Acceleration for Understanding Input Data).
In this configuration, the processing speed is 6.82 tokens/second. The only way to speed up processing without damaging the result is to opt for an sooc with a higher ghz frequency, or with more processing cores.
In this configuration, llms with more parameters (32 billion, 70billion, etc) Exced Memory Capacity and Either Do's'T even load, or generate truncated results (a Single-PARGRAGRAMH ARTICLAL). With fewer parameters, they use less memory and the quality of written falls dramatically, with repetitions and untilar information. Using parameters encoded on more bits (3, 4, 5 or 6) significantly speeds up processing, but also reduces the quality of writing, with grammatical errors and honored innovated words.
Finally, the size of the prompt window in tokens depends on the size of the data to be supplied to the ai. It is non-negotiable. If this size saturated memory, then you should opt for an llm with fewer parameters, which will free up ram to the detrament of the quality of the financial result.
What Quality Can We Expect?
Our tests have resulted in articles that are well written. They have an angle, a coharent chronology of Several Themetic Sections, Quotations in the right place, a dynamic headline and concluding sentence.
Regardless of the LLM used, the AI is incapable of correctly prioritying the points discussed during the interview
However, we have Never Managed to Obtain a Publishable Article. Regardless of the LLM used, Including Deepsek R1 and its supposed Reasoning Abilites, The AI is systematically incapable of correctly prioritising the varietied points during the DOSCUSED DURING DURING It Always Misses The Point and Often Generates Pretty But Uninteresting Articles. Occasionally, it will write an entrance, well-argued speech to tell its readers that the company interviewed … have competitors.
Llms are not all equal in the vocabulary and written style they choose. At the time of writing, meta's llama 3.x is producing sentences that are different to read, to a lesser expert and, to a lesser extra Adjectives but devoid of concrete information.
Surprisingly, the llm that writs the most beautifully in French within the limits of the test configuration is chinese Qwen. Initially, the most commentant llm on our test platform was mixtral 8x7b (with an x instead of an s), which mixes eight themetic llms, each with just 7 billion parameters.
However, the best options for fitting Qwen and Mixtral ITO the 48GB of our test configuration are, for the former, a version with only 14 billion parameters and, for the latter, for the latter. The Former Writes Unclear and Uninteresting Information, even when mixed with Deepsek R1 The latter is riddled with syntax errors.
The version of Mixtral with Parameters encoded on 4 bits offered an interesting Compromise, but recent developments in LM Studio, with a Larger Memory FootPrint, Prevent the AI from works. Mixtral “8x7b Q4_K_m” now produces truncated results.
An Interesting Alternative to Mixtral is the very recent mistral 3.1 with 24 billion parameters encoded on 8 bits, which, according to our tests, produce a Quality Fairi Fairi Fairi Fairi to gemma 3. What's more, it is slightly faster, with a speed of 8.65 tokens per second.
What are the Possible Hardware Optimizations?
According to the specialists interviewed by lemagit, the hardware architecture most like to support Computing Cos at the same time. In Practice, this means using a machine based on a System-on-chip (SoC) Processor Where the CPU, GPU and NPU CORES are connected together to the same physical and logical access to the ram, with data located at the same addresses for all the circuits.
When this is not the case – that is, when the personal machine has an external gpu with its Own memory, or when the processor is indeed a soc that integrations the cpu, gpu and npu cores, But WHERE EACHHE HAS Access to a dedicated part in the common ram – then the llms need more memory to function. This is because the same data needs to be replicated in each part dedicated to the circuits.
So, while it is indeed possible to run an llm with 27 billion parameters encoded in 8 bits on a silicon m mac with 48 gb of shared ram, using the same evaluation criteria, we walds to mak have to make With 13 Billion Parameters on a PC where a Total of 48 GB of RAM would be divided between 24 gb of ram for the processor and 24 gb of ram for the graphics card.
This explains the initial success of silicon m-based macs for running llms locally, as this chip is a soc where all the circuits benefits from Uma (Unifided Memory Architecture) Access. In Early 2025, AMD IMITATED This Architecture in Its Ryzen AI Max Soc Range. At the time of written, Intel's Core Ultra Socs, which combine cpu, gpu and npu, do not have such unified memory access.
How do you write a good Prompt?
Writing the prompt that explains how to write a particular type of article is An Engineering JobThe trick to geting off to a good start is to give the ai a piece of work that has alredy been don by a human – in our case, a final article acompanied by a transcript of the interviel Should have been given to do the same job. Around five very different examples are enough to determine the essential points of the prompt to be written, for a particular type of article.
The trick is to give the ai a Piece of work that has alredy been don by a human and ask what prompt it should have been held to do the same job
However, AI Systematically Productions Prompts that are too short, which will never be enough to write a full article. So the job is to use the leads it gives us and back them up with all the business knowledge we can Muster.
Note that the more pleasantly the prompt is written, the less preachisely the ai understands what is being said in certain sentences. To avoid this bias, avoid pronouns as much as possible (“he”, “this”, “that”, etc) and reepeat the Subject Each Time (“The Article”, “The Article”, “The Article” …). This will make the prompt even harder to read for a human, but more effective for ai.
Ensuring that the ai has sufficient latitude to produce varied content Each time is a matter of trial and error. Despite our best efforts, all the articles produced by our test protocol have a family result. It would be an effort to synthesise the full range of human creativity in the form of different competing prompts.
The usefulness of ai needs to be put into view
Within the framework of our test protocol and in the context of ai capability at the time of writing, it is illusory to think that an ai would be cained on deetermining on it is of relevant comments made during an interview. Trying to get it to write a relevant article therefore necessarily involves a preliminary stage of stripping down the transcript of the interview.
In practice, stripping the transcript of an interview of all the elements that are unnecessary for the final article, without howyver eliminating elements of context that have no place in the Final Article, but Which guide the ai towards better results, requires the transcript to be rewritten. This rewriting costs human time, to the benefit of the ai's work, but not to the benefit of the journey's work.
This is a very important point – from that point onwards, ai stops saving the user time. As it stands, Using AI means shifting work time from an existing task (written the first draft of an article) to a new task (preparation data before delivery delivering it to an ai).
Secondly, the description in 1,500 tokens of the outline to follow when written an article only works for a particular type of article. In other words, you need to write one outline for articles about a startup proposing an innovation, a completely different outline for that for there that about Another outline for a player setting out a new strategic direction, and so on. The more use cases there are, the longer the upstream engineering work will take.
WorsE Still, to Date Our Experiences Have Only Involved Writing Articles Based on a Single Interview, Usually at Press Conferences, So in a context where the interviae has a alredye Her comments before delivering them. In other words, after more than six months of experimentation, we are still only at the Simplest Stage. We have not yet able to invest time in more complex Scenarios, which are nevertheless the daily lot of lemagit's production, starting with articles written on the basis of seveural interaction.
The Paradox is as follows – for Ai to relieve a user of some of their workThat user has to work more. On the other hand, on these issues, ai on a personal machine is on a par with paid ai online.