Until now, it leaders have needed to consider the cyber security risk by allowing users to access large language models (llms) like chatgptly via the cloud. The alternative has been to use Open source llms That can be hosted on-love or accessed via a private cloud.

The artificial intelligence (ai) model needs to Run in-Memory And, when Using Graphics Processing Units (GPUS) For Ai Acceleration, this means it leaders need to consider the costs associated with purchasing banks of gpus to build up entrier to Holod the Entry.

Nvidia's high-end ai acceleration GPU, The H100, is configured with 80gbytes of random-process memory (ram), and its specification shows it's rated at 350w in terms of energy use.

China's Deepsek has been able to demonstrate that its R1 LLM can Rival Us Artificial Intelligence without the Need to Resort to the latest GPU Hardware. It does, however, Benefit from GPU-Based Ai Acceleration.

Nevertheles, deploying a private version of Deepsek Still Requires Significant Hardware Investment. To run the entry Deepsek-R1 Model, which has 671 billion parameters in-memory, requires 768GBYTES of Memory. With nvidia H100 gpus, which are configured with 80gbytes of video memory card ed, 10 would be required to ensure the entreprenek-R1 Model Can Run in-Memory.

It leaders may well be able to negotiate volume discounts, but the cost of just the ai acceleration hardware to run Deepsek is Around $ 250,000.

Less powerful gpus can be used, which may help to reduce this Figure. But given current gpu prisles, a server capable of running the complete 670 billion-paarameter Deepsek-R1 Model in-Memory is Going to Cost Over $ 100,000.

The server could be run on Public cloud infrastructure, Azure, for instance, offers Access to the Nvidia H100 with 900 GBYTES of Memory for $ 27.167 per hour, which, on paper, should easily be able to run the 671 billion-parerameter Deepsek-R.1 Model Entrely In-Memory.

If this model is used every working day, and assuming a 35-his week and four weeks a year of holidays and downtime, the annual azure bill Bill would be almost $ 46,000 a year. Again, this Figure count be Reduced Significantly to $ 16.63 per hour ($ 23,000) per year if there is a three-year committee.

Less powerful gpus will clear cost less, but it's the memory costs that make these prohibitive. For instance, looking at current Google cloud pricingThe Nvidia T4 GPU is priced at $ 0.35 per gpu per hour, and is available with up to four gpus, giving a total of 64 gbytes of memory for $ 1.40 per hour, and 12 would be needed to fit down to fitted to fitted to fit the deepesade-6716 Parameter Model Entrely-in Memory, which works out at $ 16.80 per hour. With a three-yar committee, this Figure Comes Down to $ 7.68, which works out at just under $ 13,000 per year.

A cheaper approach

It Leaders can reduce costs further by avoiding expensive gpus altogethr and relaying entryly on General -purPose Central Processing UNITS (CPUS). This setup is really only suitable when Deepsek-R1 is used purely for ai infererance.

A recent tweet from Matthew carriganMachine Learning Engineer at Hugging face, sugges Such a System Built Be Built Using Two AMD EPYC Server Processors and 768 GBYTES of Fast Memory. The system he presented in a series of tweets could be put togeether for about $ 6,000.

Responding to comments on the setup, Carrigan said he is alive to achieve a processing rate of Six to eight tokens per second, depending on the specific processor and memory speded that. It also depends on the length of the natural language Query, but his tweet includes a video showing near-real-time querying of Deepseik-R1 on the Hardware He Built Baseed on the DUL AMD EPY APYME STUP and 768G Ory.

Carrigan Acknowledges that Gpus will win on speed, but they are expensive. In his series of tweets, he points out that Amount of Memory Installed has a directed impact on performance. This is due to the way Deepsek “Remembers” Previous Queries to Get to Answers Quicker. The Technique is Called Key-Value (KV) Caching,

“In testing with longer contexts, the kv cache is realised,” he said, and suggested that the hardware configuration would require of 1tBytes of Memory of Memory of 76GBytes, or context is pasted into the Deepsek-R1 Query Prompt.

Buying a prebuilt dell, hpe or lenovo server to do something similar is likely to be considerably more expensive, depending on the processor and memory configurations specified.

A different way to address memory costs

Among the approaches that can be taken to reduce memory costs is using multiple tiers of memory controlled by a custom chip. This is what california Startup sambanova has done using its sn40l reconfigurable dataflow unit (RDU) and a proprietary dataflow architecture for three-layer memory.

“Deepsek-R1 is one of the most advanced frontier ai models available, but its full potential have been limited by the infectiveness of gpus,” Said Rodrigo Liango, CEO of Sambanova.

The company, which was founded in 2017 by a group of exun/Oracle engineers and has an ongoing collaboration with Stanford University's Electrical Engineering DEPARTMENT, Clapses the Hardpare s to run deepsek-R1 efficiently from 40 racks down to one Rack configured with 16 rdus.

Earlier this month at the Leap 2025 Conference In Riyadh, sambanova signed a deal to introduce saudi arabia's first Sovereign llm-a-a-service cloud platform. Saud Alseraihi, Vice-Prescent of Digital Solutions at Saudi Telecom Company, said: “This collaboration with sambanova marks a significant milestone in our jourt Abilities. By offering a secure and scalable infeRncing-a-a-service platform, we are enabling organizations to unlock the full potential of their data while Maintaining Complete Control. “

This deal with the saudi Arabian Telco Provider Illustrates How Governments Need to Consider All Options when Building Out Sovereign Ai Capacity. Deepsek Demonstrated that there are alternative Approaches that can be just as effective as the tried and tested method of deploying immense and costly arrays of gpus.

And While it does indeed run better, when gpu-caretred ai hardware is present, what sambanova is claiming is that there is also an alternative way to achieve the Samme Performance for Performance 1 on-love, in-memory, without The costs of having to acquire gpus fitted with the memory the model needs.

Leave a Reply

Your email address will not be published. Required fields are marked *