04/30/2024 | News release | Distributed by Public on 05/01/2024 11:53
At Presidio, no technology solution is deemed "final" - instead we focus on a continuous cycle of refinement to extend cost-effective performance at every turn. That's especially true in the rapidly shifting field of artificial intelligence (AI), where advancements are pushing the limits of today's most robust hardware and software.
As part of that commitment to constant improvement, Presidio has teamed with its longtime technology ally Intel to test its emerging solutions in a variety of generative AI-focused use cases. Recently, we tested the performanceof 4th Generation Intel® Xeon® Scalable processors compared to GPUs in machine learning for facial recognition and the results were promising.
Next, we wanted to evaluate the feasibility of running certain large language models (LLMs), in the range of 7 billion parameters, with private data in RAG architecture on 4th Gen Intel Xeon Scalable CPUs and how they compare in price/performance against GPUs.
In addition, we wanted to determine the impact of using Intel® Advanced Matrix Extensions (Intel® AMX), which is a built-in AI accelerator within Intel Xeon Scalable processors. Test results showed that Intel AMX delivered a clear performance boost.
Finding an efficient solution for inferencing on LLMs using private data
Enterprises often consider running LLMs using their private data in RAG architecture to avoid uploading their data in hosted managed services within cloud platforms, to preserve confidentiality. In such circumstances, organizations are now recognizing the potential of using GenAI with their private data employing RAG architecture for their own benefit, such as running internal applications for increased productivity.
Inferencing on LLMs using private data in a cost-effective and efficient way allows for:
Traditionally LLM training and inference have been run on GPUs, owing to their robust parallel processing capabilities. However, inferencing LLMs on GPUs is expensive - and so the goal of our test was to determine whether the Intel Xeon Scalable processor with Intel AMX could prove to be a practical, performant and cost-effective alternative.
Why test Intel CPUs with Intel AMX and INT8 quantization for inferencing using LLMs
Presidio has long seen the potential of the 4th Gen Intel Xeon Scalable processor. Launched in early 2023, it is designed with 14 built-in acceleration features to offload specific demanding workloads, freeing up CPU compute cycles for other tasks.
One of these accelerators, Intel AMX, is designed to improve processor performance of deep-learning training and inference. Offloading the CPU enhances its ability to perform the matrix multiplication that AI models require. This feature makes it ideal for natural-language processing (NLP), recommendation systems, image recognition and other such workloads.
As noted, Intel AMX provides acceleration of matrix multiplication functions for INT8 and BF16 data precision types. In addition, Presidio explored the benefit of using Intel AMX while quantizing the model weights to INT8 precision type. For those unfamiliar, INT8 is a data type which can be used for inferencing when single precision or double precision are not needed for the usage model. This is particularly effective for language models and computer vision models where speed and efficiency are prioritized over precision.
The weight only quantization technique for the model is achieved using Intel's extension for Pytorch with a few lines of code.
Our process in testing CPUs versus GPU for LLMs
The model used is Intel/neural-chat-7b-v3-3 from Huggingface. Using an AI Chatbot appropriate for our purpose, Presidio posed two types of questions:
Presidio then measured two industry-standard metrics:
Presidio ran the test using three separate configurations:
The test was conducted using complex prompts with larger input sizes of 600 to 1,000 tokens.
Test results and key findings
Intel Xeon CPU with Neural chat - No AMX | GPU Instance p3.2xlarge with Neural Chat | Intel Xeon CPU with Neural chat - INT8 Quantized for AMX | |
Generic Question |
Tokens per second - 5-6
Time to first token - ~750ms |
Tokens per second - 100
Time to first token - <50ms |
Tokens per second - 25-27
Time to first token - <200ms |
RAG Question
[Final input token size - 600-1000 tokens] |
Tokens per second - 3-4
Time to first token ~6s |
Tokens per second - 120
Time to first token - <500ms |
Tokens per second - 35-40
Time to first token ~2-4s |
Overall, it comes down to your requirements for delivering an AI-powered experience to the end users. Depending on the use case and the requirements for speed, accuracy, and cost, there is a price/performance analysis that is meaningful to address for selecting the optimum hardware to run LLM inferencing; the same holds true when customers are doing continuous inferencing as a dedicated AI workload versus using the same infrastructure for sparse inferencing and other general-purpose workloads.
Presidio envisions the potential of Intel CPUs with Intel AMX for LLMs
Given the test results, Presidio envisions many use case scenarios where the Intel CPU with Intel AMX may be a better option for running LLMs. For instance:
These examples are just the beginning. For companies looking to future-proof their IT capabilities for the age of AI, 4th Gen Intel Xeon Scalable processors are fully capable of inferencing on LLMs privately, without a costly infrastructure investment in GPUs. What's more, those latest processors are equipped with Intel AMX acceleration built in.
And this is just the beginning: already the Intel® Gaudi® 2 AI accelerator is driving improved deep learning price-performance and operational efficiency for training and running state-of-the-art models, from the largest LLMs to basic computer vision and NLP models. In keeping with our continuous cycle of refinement and innovation, Presidio plans to evaluate Intel Gaudi 2 to determine how it can accelerate our customers' Gen AI transformation journey.
Consider the advantages you can achieve by inferencing on LLMs privately on CPUs. Start the conversation by bringing us your use case; we'll help you envision a viable plan for LLM performance and a richer engagement for everyone.