04/30/2024 | Press release | Distributed by Public on 04/30/2024 14:14
With the huge demand of generative AI opportunities worldwide, <_w3a_sdt id="-794282655" sdttag="goog_rdk_7">planning for the required compute capacity is crucial. While<_w3a_sdt id="-1436278271" sdttag="goog_rdk_11"> NVIDIA A100 and H100 Tensor Core GPUs <_w3a_sdt id="-937057586" sdttag="goog_rdk_12">offer great performance for large scale LLM deployments, they can <_w3a_sdt id="-661693584" sdttag="goog_rdk_14">be complemented with <_w3a_sdt id="2136979906" sdttag="goog_rdk_16">mainstream GPUs like T4 , P100, and A10 <_w3a_sdt id="559282146" sdttag="goog_rdk_18">for smaller scale deployments.
<_w3a_sdt sdttag="goog_rdk_18">With the well-engineered Oracle Generative AI services, Oracle Cloud Infrastructure (OCI) also allows customers to bring their own models (open source or custom) for inferencing on the highly efficient OCI Servers. When running bring-your-own models purely on OCI, one might need to benchmark and optimize<_w3a_sdt id="1977402378" sdttag="goog_rdk_20"> by<_w3a_sdt id="2066450975" sdttag="goog_rdk_21"><_w3a_sdt id="251783702" sdttag="goog_rdk_22"><_w3a_sdt id="756401489" sdttag="goog_rdk_24"> running the LLMs on <_w3a_sdt id="264500259" sdttag="goog_rdk_25">mainstream NVIDIA-accelerated OCI servers. This blog details how mainstream GP<_w3a_sdt id="1335186231" sdttag="goog_rdk_29">Us accelerated OCI <_w3a_sdt id="-1175805343" sdttag="goog_rdk_34">servers (both bare metal and virtual machine) can be used for running a wide range of inferencing <_w3a_sdt id="165686369" sdttag="goog_rdk_35"><_w3a_sdt id="-875770997" sdttag="goog_rdk_36">scenarios using Opensource LLMs.
Following are the different set of parameters which influence the inferencing test scenarios and results:
The following server configurations are used for the benchmarking purposes:
The following LLM models ( quantized and unquantized versions ) are used for this benchmarking exercise:
The following table shows the test results done on fin-llama models using llama.cpp on a single OCI bare metal server.
Table 1: Single-server single-user inferencing tests-finllama
Model type |
Transformer model, quantization |
Deployment config |
OCI instance type |
Accelerator type |
Number of GPUs |
<_w3a_sdt id="-934669071" sdttag="goog_rdk_50">Throughput across all GPUs(tokens/second) |
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33b.Q2_K.gguf |
BM with 4 A10s |
A10 |
4 |
29.<_w3a_sdt id="493461710" sdttag="goog_rdk_51">2<_w3a_sdt id="1649394474" sdttag="goog_rdk_52" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33b.Q3_K_L.gguf |
BM with 4 A10s |
A10 |
4 |
28.<_w3a_sdt id="1434478725" sdttag="goog_rdk_53">2<_w3a_sdt id="198360161" sdttag="goog_rdk_54" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33b.Q3_K_M.gguf |
BM with 4 A10s |
A10 |
4 |
2<_w3a_sdt id="-848712729" sdttag="goog_rdk_55">9<_w3a_sdt id="683485464" sdttag="goog_rdk_56" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33b.Q3_K_S.gguf |
BM with 4 A10s |
A10 |
4 |
28.<_w3a_sdt id="-1708098574" sdttag="goog_rdk_57">4<_w3a_sdt id="1301799846" sdttag="goog_rdk_58" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33b.Q4_0.gguf |
BM with 4 A10s |
A10 |
4 |
30.9<_w3a_sdt id="348448176" sdttag="goog_rdk_59" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33B-GGUF.Q5_K_M.gguf |
BM with 4 A10s |
A10 |
4 |
29.<_w3a_sdt id="-1994242809" sdttag="goog_rdk_60">2<_w3a_sdt id="1736045087" sdttag="goog_rdk_61" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33b.Q4_K_M.gguf |
BM with 4 A10s |
A10 |
4 |
28.5<_w3a_sdt id="1438260802" sdttag="goog_rdk_62" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33b.Q4_K_S.gguf |
BM with 4 A10s |
A10 |
4 |
28.<_w3a_sdt id="-1867279261" sdttag="goog_rdk_63">6<_w3a_sdt id="-20239383" sdttag="goog_rdk_64" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33B-GGUF.Q5_K_M.gguf |
BM with 4 A10s |
A10 |
4 |
29.<_w3a_sdt id="-2126383437" sdttag="goog_rdk_65">2<_w3a_sdt id="-1638025097" sdttag="goog_rdk_66" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33b.Q5_0.gguf |
BM with 4 A10s |
A10 |
4 |
27.<_w3a_sdt id="827092694" sdttag="goog_rdk_67">7<_w3a_sdt id="1756164355" sdttag="goog_rdk_68" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33b.Q5_K_M.gguf |
BM with 4 A10s |
A10 |
4 |
27.<_w3a_sdt id="1261337160" sdttag="goog_rdk_69">6<_w3a_sdt id="-2012682399" sdttag="goog_rdk_70" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33b.Q5_K_S.gguf |
BM with 4 A10s |
A10 |
4 |
28<_w3a_sdt id="455916194" sdttag="goog_rdk_71" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33b.Q6_K.gguf |
BM with 4 A10s |
A10 |
4 |
25.1<_w3a_sdt id="205074197" sdttag="goog_rdk_72" showingplchdr="t"> |
|
fin-llama-33B-GGUF |
llama.cpp, GGUF, fin-llama-33b.Q8_0.gguf |
BM with 4 A10s |
A10 |
4 |
23.<_w3a_sdt id="691570091" sdttag="goog_rdk_73">5<_w3a_sdt id="1600904273" sdttag="goog_rdk_74" showingplchdr="t"> |
The following table shows the test results done on Llama 2 models using llama.cpp on a single Oracle Roving Edge (RED) server.
Table 2: Single-server single-user test inferencing LLama2 on RED
Model type |
Transformer model, quantization |
Deployment config |
OCI instance type |
Accelerator type |
Number of GPUs |
Throughput across all GPUs (tokens/second) |
Llama-2-7b |
Llama-cpp , ggml-model-q4_0.gguf |
llama-2-7b.Q4_0.gguf · TheBloke/Llama-2-7B-GGUF at main (huggingface.co) |
RED |
T4 |
1 |
51.9<_w3a_sdt id="1770114857" sdttag="goog_rdk_75" showingplchdr="t"> |
Llama-2-13b |
Llama-cpp , ggml-model-q4_0.gguf |
https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q4_0.gguf |
RED |
T4 |
1 |
28.<_w3a_sdt id="-1163158191" sdttag="goog_rdk_76">6<_w3a_sdt id="2062050646" sdttag="goog_rdk_77" showingplchdr="t"> |
Llama-2-70b |
Llama-cpp , ggml-model-q4_0.gguf |
https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q4_0.gguf |
RED |
T4 |
1 |
1.<_w3a_sdt id="1366565368" sdttag="goog_rdk_78">6<_w3a_sdt id="1866166680" sdttag="goog_rdk_79" showingplchdr="t"> |
The following table shows the test results done on quantized LLaMa2 70B models using llama.cpp on a single OCI bare metal server.
Table 3: Single-server single-user inferencing quantized Llama2 70B model on OCI bare metal servers
Model type |
Transformer model, quantization |
Deployment config |
OCI instance type |
Accelerator type |
Number of GPUs |
Throughput across all GPUs (tokens/second) |
Llama-2-70B-Chat-GPTQ |
llama.cpp, gptq |
BM with 4 A10s |
A10 |
4 |
11.2<_w3a_sdt id="-1005979820" sdttag="goog_rdk_81" showingplchdr="t"> |
|
Llama-2-70B-Chat-GPTQ |
llama.cpp, gptq |
TheBloke/Llama-2-70B-Chat-GPTQ at gptq-3bit--1g-actorder_True (huggingface.co) |
BM with 4 A10s |
A10 |
4 |
10.5<_w3a_sdt id="491447124" sdttag="goog_rdk_82" showingplchdr="t"> |
Llama-2-70B-Chat-AWQ |
llama.cpp,AWQ |
BM with 4 A10s |
A10 |
4 |
13.<_w3a_sdt id="-1401351358" sdttag="goog_rdk_83">6<_w3a_sdt id="227971400" sdttag="goog_rdk_84" showingplchdr="t"> |
|
Llama-2-70B-Chat-GGUF |
llama.cpp, GGUF, llama-2-70b-chat.Q3_K_L.gguf |
BM with 4 A10s |
A10 |
4 |
17.5<_w3a_sdt id="-1691283483" sdttag="goog_rdk_85" showingplchdr="t"> |
|
Llama-2-70B-Chat-GGUF |
llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf |
BM with 4 A10s |
A10 |
4 |
19.<_w3a_sdt id="-1668857077" sdttag="goog_rdk_86">2<_w3a_sdt id="1407109879" sdttag="goog_rdk_87" showingplchdr="t"> |
|
Llama-2-70B-Chat-GGUF |
llama.cpp, GGUF, llama-2-70b-chat.Q4_K_M.gguf |
BM with 4 A10s |
A10 |
4 |
17.<_w3a_sdt id="313686180" sdttag="goog_rdk_88">9<_w3a_sdt id="-2125298622" sdttag="goog_rdk_89" showingplchdr="t"> |
|
Llama-2-70B-Chat-GGUF |
llama.cpp, GGUF, llama-2-70b-chat.Q5_K_M.gguf |
BM with 4 A10s |
A10 |
4 |
16.<_w3a_sdt id="1854914467" sdttag="goog_rdk_90">8<_w3a_sdt id="1129431540" sdttag="goog_rdk_91" showingplchdr="t"> |
The following table shows the test results done on Llama 2 models using llama.cpp for concurrent users on a single OCI bare metal server.
Table 4: Single-server, multiuser inferencing Llama on OCI bare metal servers
Model type |
Transformer model, quantization |
Deployment config |
OCI instance type |
Accelerator Type |
Number of GPUs |
Concurrent users |
Throughput across all GPUs (tokens/second) |
Llama-2-70B-Chat-GGUF |
llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf |
BM with 4 A10s |
A10 |
4 |
5 |
10.3<_w3a_sdt id="-1833360044" sdttag="goog_rdk_92" showingplchdr="t"> |
|
Llama-2-70B-Chat-GGUF |
llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf |
BM with 4 A10s |
A10 |
4 |
10 |
8.7<_w3a_sdt id="-664003286" sdttag="goog_rdk_93" showingplchdr="t"> |
|
fin-llama-33b.Q4_0.gguf |
llama.cpp, GGUF, llama-2-33b-chat.Q4_0.gguf |
BM with 4 A10s |
A10 |
4 |
5 |
22<_w3a_sdt id="-1500726100" sdttag="goog_rdk_94" showingplchdr="t"> |
|
fin-llama-33b.Q4_0.gguf |
llama.cpp, GGUF, llama-2-33b-chat.Q4_0.gguf |
BM with 4 A10s |
A10 |
4 |
10 |
10.2<_w3a_sdt id="286401761" sdttag="goog_rdk_95" showingplchdr="t"> |
The following table shows the test results done on quantized Llama 2 models using llama.cpp with four OCI RED servers and using Message Passing Interface (MPI).
Table 5: Distributed inferencing of quantized Llama2 on multiple OCI RED servers
Model type |
Transformer model, quantization |
Deployment config |
OCI instance type |
Accelerator type |
Number of GPUs |
Throughput across all GPUs (tokens/second) |
Llama-2-7b MPI Run |
Llama-cpp , ggml-model-q4_0.gguf |
llama-2-7b.Q4_0.gguf · TheBloke/Llama-2-7B-GGUF at main (huggingface.co) |
4 REDs |
T4 |
4 |
52.<_w3a_sdt id="-91172598" sdttag="goog_rdk_96">2<_w3a_sdt id="-721742608" sdttag="goog_rdk_97" showingplchdr="t"> |
Llama-2-13b MPI Run |
Llama-cpp , ggml-model-q4_0.gguf |
https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q4_0.gguf |
4 REDs |
T4 |
4 |
28.7 |
Llama-2-70b MPI Run |
Llama-cpp , ggml-model-q4_0.gguf |
https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q4_0.gguf |
4 REDs |
T4 |
4 |
1.<_w3a_sdt id="-1339381619" sdttag="goog_rdk_99">6<_w3a_sdt id="-2053913382" sdttag="goog_rdk_100" showingplchdr="t"> |
For running a unquantized Llama transformer model on A10s, the following memory calculation is used:
Based on this calculation, the Llama 70B unquantized model can run on two OCI bare metal servers with eight A10 GPUs using any distributed inferencing framework, such as torchrun, Ray or MPI.
Table 6: Distributed inferencing of unquantized Llama2 70B model on 2 OCI bare metal servers
Model Type |
Transformer Model, Quantization |
Deployment config |
OCI instance type |
Accelerator type |
Number of GPUs |
Throughput across all GPUs (tokens/second) |
Llama-2-70b |
Llama2 , 70B model, torchrun |
2 BM Servers |
A10s |
8 |
8.8<_w3a_sdt id="-1560095349" sdttag="goog_rdk_101" showingplchdr="t"> |
<_w3a_sdt id="-1674799883" sdttag="goog_rdk_102">
Figure 1: Inference run of unquantized Llama 70B model on two bare metal servers with eight A10sThe following table shows the test results of the Llama 70B unquantized model using four VM servers with two A10s each:
Table 7: Distributed inferencing of unquantized Llama 70B model on 4 OCI VM Servers
Model type |
Transformer model, quantization |
Deployment config |
OCI instance type |
Accelerator type |
Number of GPUs |
Throughput across all GPUs (tokens/second) |
Llama-2-70b |
Llama2 , 70B model, torchrun |
4 VM Servers |
A10s |
8 |
4<_w3a_sdt id="1461070622" sdttag="goog_rdk_103" showingplchdr="t"> |
The following table shows the test results of Llama2 models using vLLM transformer model ( with Paged Attention) using two bare metal servers with four A10s each:
Table 8: Distributed inferencing of Llama using vLLM on 2 OCI bare metal servers
Model type |
Transformer model, quantiation |
Deployment config |
OCI instance type |
Accelerator type |
Number of GPUs |
Throughput across all GPUs (tokens/second) |
Llama-2-7b |
vLLM /PagedAttention/Ray |
2 BM Servers |
A10s |
8 |
30.<_w3a_sdt id="1429240008" sdttag="goog_rdk_104">1<_w3a_sdt id="-349172897" sdttag="goog_rdk_105" showingplchdr="t"> |
|
Llama-2-13b |
vLLM /PagedAttention/Ray |
2 BM Servers |
A10s |
8 |
27.3<_w3a_sdt id="1545099510" sdttag="goog_rdk_106" showingplchdr="t"> |
|
Llama-2-70b |
vLLM /PagedAttention/Ray |
2 BM Servers |
A10s |
8 |
12.9<_w3a_sdt id="762657383" sdttag="goog_rdk_107" showingplchdr="t"> |
Following are the results of LLama3 Runs on 2 BM with 8 A10 GPUs using any distributed inferencing framework like torchrun, Ray , MPI etc.
Table 9: Distributed inferencing of Llama3 using Transformer model on multiple OCI bare metal servers
Model Type |
Transformer Model, Quantisation |
Deployment Config |
OCI Instance Type |
Accelerator Type |
Num GPUs |
Throughput across all GPUs (tokens/second) |
Meta-Llama-3-70B |
llama 3, 70B model, torchrun |
2 BM Servers |
A10s |
8 |
12.44 |
|
Meta-Llama-3-70B-Instruct |
llama 3, 70B model, torchrun |
2 BM Servers |
A10s |
8 |
12.24 |
|
Meta-Llama-3-8B |
llama 3, 8B model, torchrun |
1 BM Server |
A10 |
1 |
27.10 |
|
Meta-Llama-3-8B-Instruct |
llama 3, 8B model, torchrun |
1 BM Server |
A10 |
1 |
27.04 |
The following table shows the test results of Llama3 models using vLLM transformer model ( with Paged Attention) using two bare metal servers with four A10s each:
Table 10: Distributed inferencing of Llama3 using vLLM on 2 OCI bare metal servers
Model Type |
Transformer Model, Quantisation |
Deployment Config |
OCI Instance Type |
Accelerator Type |
Num GPUs |
Throughput across all GPUs (tokens/second) |
Meta-Llama-3-8B |
vLLM /PagedAttention/Ray |
BM 2 Servers |
A10s |
8 |
24.61 |
|
Meta-Llama-3-70B |
vLLM /PagedAttention/Ray |
BM 2 Servers |
A10s |
8 |
11.23 |
The following chart summarizes the inferencing performance of LLaMa2 and LLaMA3 unquantized models of Transformer and vLLM on A10 accelerated OCI VM and BM servers.
Figure 2 : Inferencing of unquantized Llama 70B model on OCI BM and VM servers.<_w3a_sdt id="-906527838" sdttag="goog_rdk_108"><_w3a_sdt id="1559667233" sdttag="goog_rdk_109">The above benchmarking exercises show<_w3a_sdt id="1051203805" sdttag="goog_rdk_110" showingplchdr="t"> that mainstream GPU accelerated OCI servers (like A10s) can be used for inferencing activities of different sizes of Opensource large language models (LLMs) . <_w3a_sdt id="1178919687" sdttag="goog_rdk_116">When the best performance is needed for larger scale deployments, OCI offers advanced NVIDIA GPUs deployed with NVIDIA TensorRT-LLM, which delivers great results as shown in the recent MLPerf Inference v4.0 benchmarks. Depending on the requirements and the scale of the solution,one can start working with smaller LLMs, such as 7B<_w3a_sdt id="-422875893" sdttag="goog_rdk_117"> and<_w3a_sdt id="-1417090637" sdttag="goog_rdk_118" showingplchdr="t"> 13B<_w3a_sdt id="698126012" sdttag="goog_rdk_119"><_w3a_sdt id="-615602605" sdttag="goog_rdk_120"> models <_w3a_sdt id="-926724245" sdttag="goog_rdk_121" showingplchdr="t"> on <_w3a_sdt id="94988755" sdttag="goog_rdk_122">mainstream GPU-accelerated servers, and then migrate to larger clusters with <_w3a_sdt id="1714536906" sdttag="goog_rdk_124">advanced<_w3a_sdt id="-875384823" sdttag="goog_rdk_125" showingplchdr="t"> GPUs ( A100s, H100s etc) as demand<_w3a_sdt id="-1709628596" sdttag="goog_rdk_126"> and model size increases. This scaling helps in quicker adoption of generative AI solutions for the customer. <_w3a_sdt id="1522206780" sdttag="goog_rdk_127">
The author wants to thank Mohan Srinivasan, Sreedhara Narayanaswamy, Ram Sivaram, and Hiten Goradia for their guidance, leadership, and support in this endeavour. The author also wants to thank James George for his expertise in setting up the MPI cluster on Oracle Roving Edge Devices(RED) .
The benchmarking exercise published in this paper are for general guidance only. Individual test results can vary based on model size, testing parameters,performance techniques and the hardware/softwar<_w3a_sdt id="702755176" sdttag="goog_rdk_5">e<_w3a_sdt id="593909730" sdttag="goog_rdk_6"> stack used.
For further information, please visit the following links
Oracle Generative AI Solutions: https://www.oracle.com/artificial-intelligence/generative-ai/
Oracle GPU accelerated BareMetal Servers:https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#bm-gpu
Oracle GPU accelerated VM Servers:https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#vm-gpu
Oracle Roving Edge Servers: https://www.oracle.com/a/ocom/docs/data-sheet-roving-edge-device.pdf
NVIDIA A10 GPUs: https://www.nvidia.com/en-au/data-center/products/a10-gpu/
LLaMA CPP Source Code: https://github.com/ggerganov/llama.cpp
Meta LLaMA2 : meta-llama/llama: Inference code for Llama models (github.com)
Meta LLaMA3: meta-llama/llama3: The official Meta Llama 3 GitHub site