noodls browser compatibility check

The security settings of your browser are blocking the execution of scripts.

To use noodls, javascript support must be enabled. Please change your browser's security settings to enable javascript.

If you have changed your browser's security settings, you can click here.

related announcements

News

Story County, IA

Tornado Damage in Cambridge, Nevada, and Colo, Iowa
Baker & McKenzie LLP

Financial Times Recognizes Baker McKenzie as One of the Most Innovative[...]
CSBS - Conference of State Bank[...]

Understanding How Congress Included State Bank Supervisors in the FDIC[...]

Finance

Oracle Corporation

04/30/2024 | Press release | Distributed by Public on 04/30/2024 14:14

Practical inferencing of open source models on mainstream ...

With the huge demand of generative AI opportunities worldwide, <_w3a_sdt id="-794282655" sdttag="goog_rdk_7">planning for the required compute capacity is crucial. While<_w3a_sdt id="-1436278271" sdttag="goog_rdk_11"> NVIDIA A100 and H100 Tensor Core GPUs <_w3a_sdt id="-937057586" sdttag="goog_rdk_12">offer great performance for large scale LLM deployments, they can <_w3a_sdt id="-661693584" sdttag="goog_rdk_14">be complemented with <_w3a_sdt id="2136979906" sdttag="goog_rdk_16">mainstream GPUs like T4 , P100, and A10 <_w3a_sdt id="559282146" sdttag="goog_rdk_18">for smaller scale deployments.

<_w3a_sdt sdttag="goog_rdk_18">With the well-engineered Oracle Generative AI services, Oracle Cloud Infrastructure (OCI) also allows customers to bring their own models (open source or custom) for inferencing on the highly efficient OCI Servers. When running bring-your-own models purely on OCI, one might need to benchmark and optimize<_w3a_sdt id="1977402378" sdttag="goog_rdk_20"> by<_w3a_sdt id="2066450975" sdttag="goog_rdk_21"><_w3a_sdt id="251783702" sdttag="goog_rdk_22"><_w3a_sdt id="756401489" sdttag="goog_rdk_24"> running the LLMs on <_w3a_sdt id="264500259" sdttag="goog_rdk_25">mainstream NVIDIA-accelerated OCI servers. This blog details how mainstream GP<_w3a_sdt id="1335186231" sdttag="goog_rdk_29">Us accelerated OCI <_w3a_sdt id="-1175805343" sdttag="goog_rdk_34">servers (both bare metal and virtual machine) can be used for running a wide range of inferencing <_w3a_sdt id="165686369" sdttag="goog_rdk_35"><_w3a_sdt id="-875770997" sdttag="goog_rdk_36">scenarios using Opensource LLMs.

Benchmarking <_w3a_sdt id="-199561864" sdttag="goog_rdk_37">parameters

Following are the different set of parameters which influence the inferencing test scenarios and results:

Generative AI model specifications: Model type and size
GPU specifications: Model and number of GPUs
CPU specifications: CPU type and number of CPUs
Maximum context window
Performance optimizations
- Quantized and unquantized models
- Different LLM models like transformer , transformer with KV cache optimization and paged attention, transformer with flash attention etc
Performance measured in terms of tokens per second

Testing environment

The following server configurations are used for the benchmarking purposes:

OCI server types and specifications
- GPU accelerated bare metal
  - Intel Xeon Platinum 8358 CPU @ 2.60GHz (128 cores)
  - Four NVIDIA A10<_w3a_sdt id="182563678" sdttag="goog_rdk_38"> Tensor Core<_w3a_sdt id="-1918858089" sdttag="goog_rdk_39"> GPUs<_w3a_sdt id="-738709592" sdttag="goog_rdk_40">, each with 24GB GDDR6 <_w3a_sdt id="-1657299459" sdttag="goog_rdk_41">memory
  - 1TB RAM
- GPU accelerated VM
  - Intel Xeon Platinum 8358 CPU @ 2.60GHz (60 cores)
  - Two NVIDIA A10<_w3a_sdt id="1114172954" sdttag="goog_rdk_42"> GPUs, each with 24GB GDDR6 memory<_w3a_sdt id="-745803464" sdttag="goog_rdk_43" showingplchdr="t">
  - 480GB RAM
- GPU Accelerated Roving Edge Device ( RED)
  - Intel(R) Xeon(R) Gold 6230T CPU @ 2.10GHz ( 32 cores)
  - <_w3a_sdt id="1416739095" sdttag="goog_rdk_45">One NVIDIA T4 GPU <_w3a_sdt id="-1180426136" sdttag="goog_rdk_46">with 16<_w3a_sdt id="397788601" sdttag="goog_rdk_47" showingplchdr="t"> GB GDDR6 memory<_w3a_sdt id="-1825345494" sdttag="goog_rdk_48" showingplchdr="t">
  - 512 GB RAM

The following LLM models ( quantized and unquantized versions ) are used for this benchmarking exercise:

Llama 2 models (7B, 13B, and 70B)
Llama 2 HF models (7B, 13B, and 70B)
Llama 3 models ( 8B , 70B )
Fin-llama-33B

<_w3a_sdt id="1625579471" sdttag="goog_rdk_49">Single-server, single-user inferencing tests

The following table shows the test results done on fin-llama models using llama.cpp on a single OCI bare metal server.

Table 1: Single-server single-user inferencing tests-finllama

Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	<_w3a_sdt id="-934669071" sdttag="goog_rdk_50">Throughput across all GPUs(tokens/second)
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q2_K.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	29.<_w3a_sdt id="493461710" sdttag="goog_rdk_51">2<_w3a_sdt id="1649394474" sdttag="goog_rdk_52" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q3_K_L.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28.<_w3a_sdt id="1434478725" sdttag="goog_rdk_53">2<_w3a_sdt id="198360161" sdttag="goog_rdk_54" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q3_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	2<_w3a_sdt id="-848712729" sdttag="goog_rdk_55">9<_w3a_sdt id="683485464" sdttag="goog_rdk_56" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q3_K_S.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28.<_w3a_sdt id="-1708098574" sdttag="goog_rdk_57">4<_w3a_sdt id="1301799846" sdttag="goog_rdk_58" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q4_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	30.9<_w3a_sdt id="348448176" sdttag="goog_rdk_59" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33B-GGUF.Q5_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	29.<_w3a_sdt id="-1994242809" sdttag="goog_rdk_60">2<_w3a_sdt id="1736045087" sdttag="goog_rdk_61" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q4_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28.5<_w3a_sdt id="1438260802" sdttag="goog_rdk_62" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q4_K_S.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28.<_w3a_sdt id="-1867279261" sdttag="goog_rdk_63">6<_w3a_sdt id="-20239383" sdttag="goog_rdk_64" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33B-GGUF.Q5_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	29.<_w3a_sdt id="-2126383437" sdttag="goog_rdk_65">2<_w3a_sdt id="-1638025097" sdttag="goog_rdk_66" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q5_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	27.<_w3a_sdt id="827092694" sdttag="goog_rdk_67">7<_w3a_sdt id="1756164355" sdttag="goog_rdk_68" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q5_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	27.<_w3a_sdt id="1261337160" sdttag="goog_rdk_69">6<_w3a_sdt id="-2012682399" sdttag="goog_rdk_70" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q5_K_S.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28<_w3a_sdt id="455916194" sdttag="goog_rdk_71" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q6_K.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	25.1<_w3a_sdt id="205074197" sdttag="goog_rdk_72" showingplchdr="t">
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q8_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	23.<_w3a_sdt id="691570091" sdttag="goog_rdk_73">5<_w3a_sdt id="1600904273" sdttag="goog_rdk_74" showingplchdr="t">

The following table shows the test results done on Llama 2 models using llama.cpp on a single Oracle Roving Edge (RED) server.

Table 2: Single-server single-user test inferencing LLama2 on RED

Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-7b	Llama-cpp , ggml-model-q4_0.gguf	llama-2-7b.Q4_0.gguf · TheBloke/Llama-2-7B-GGUF at main (huggingface.co)	RED	T4	1	51.9<_w3a_sdt id="1770114857" sdttag="goog_rdk_75" showingplchdr="t">
Llama-2-13b	Llama-cpp , ggml-model-q4_0.gguf	https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q4_0.gguf	RED	T4	1	28.<_w3a_sdt id="-1163158191" sdttag="goog_rdk_76">6<_w3a_sdt id="2062050646" sdttag="goog_rdk_77" showingplchdr="t">
Llama-2-70b	Llama-cpp , ggml-model-q4_0.gguf	https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q4_0.gguf	RED	T4	1	1.<_w3a_sdt id="1366565368" sdttag="goog_rdk_78">6<_w3a_sdt id="1866166680" sdttag="goog_rdk_79" showingplchdr="t">

The following table shows the test results done on quantized LLaMa2 70B models using llama.cpp on a single OCI bare metal server.

Table 3: Single-server single-user inferencing quantized Llama2 70B model on OCI bare metal servers

Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-70B-Chat-GPTQ	llama.cpp, gptq	TheBloke/Llama-2-70B-Chat-GPTQ at main (huggingface.co)	BM with 4 A10s	A10	4	11.2<_w3a_sdt id="-1005979820" sdttag="goog_rdk_81" showingplchdr="t">
Llama-2-70B-Chat-GPTQ	llama.cpp, gptq	TheBloke/Llama-2-70B-Chat-GPTQ at gptq-3bit--1g-actorder_True (huggingface.co)	BM with 4 A10s	A10	4	10.5<_w3a_sdt id="491447124" sdttag="goog_rdk_82" showingplchdr="t">
Llama-2-70B-Chat-AWQ	llama.cpp,AWQ	TheBloke/Llama-2-70B-Chat-AWQ · Hugging Face	BM with 4 A10s	A10	4	13.<_w3a_sdt id="-1401351358" sdttag="goog_rdk_83">6<_w3a_sdt id="227971400" sdttag="goog_rdk_84" showingplchdr="t">
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q3_K_L.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	17.5<_w3a_sdt id="-1691283483" sdttag="goog_rdk_85" showingplchdr="t">
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	19.<_w3a_sdt id="-1668857077" sdttag="goog_rdk_86">2<_w3a_sdt id="1407109879" sdttag="goog_rdk_87" showingplchdr="t">
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q4_K_M.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	17.<_w3a_sdt id="313686180" sdttag="goog_rdk_88">9<_w3a_sdt id="-2125298622" sdttag="goog_rdk_89" showingplchdr="t">
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q5_K_M.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	16.<_w3a_sdt id="1854914467" sdttag="goog_rdk_90">8<_w3a_sdt id="1129431540" sdttag="goog_rdk_91" showingplchdr="t">

Single-server, multiuser concurrency inferencing tests

The following table shows the test results done on Llama 2 models using llama.cpp for concurrent users on a single OCI bare metal server.

Table 4: Single-server, multiuser inferencing Llama on OCI bare metal servers

Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator Type	Number of GPUs	Concurrent users	Throughput across all GPUs (tokens/second)
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	5	10.3<_w3a_sdt id="-1833360044" sdttag="goog_rdk_92" showingplchdr="t">
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	10	8.7<_w3a_sdt id="-664003286" sdttag="goog_rdk_93" showingplchdr="t">
fin-llama-33b.Q4_0.gguf	llama.cpp, GGUF, llama-2-33b-chat.Q4_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	5	22<_w3a_sdt id="-1500726100" sdttag="goog_rdk_94" showingplchdr="t">
fin-llama-33b.Q4_0.gguf	llama.cpp, GGUF, llama-2-33b-chat.Q4_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	10	10.2<_w3a_sdt id="286401761" sdttag="goog_rdk_95" showingplchdr="t">

Distributed inferencing results on multiple servers

The following table shows the test results done on quantized Llama 2 models using llama.cpp with four OCI RED servers and using Message Passing Interface (MPI).

Table 5: Distributed inferencing of quantized Llama2 on multiple OCI RED servers

Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-7b MPI Run	Llama-cpp , ggml-model-q4_0.gguf	llama-2-7b.Q4_0.gguf · TheBloke/Llama-2-7B-GGUF at main (huggingface.co)	4 REDs	T4	4	52.<_w3a_sdt id="-91172598" sdttag="goog_rdk_96">2<_w3a_sdt id="-721742608" sdttag="goog_rdk_97" showingplchdr="t">
Llama-2-13b MPI Run	Llama-cpp , ggml-model-q4_0.gguf	https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q4_0.gguf	4 REDs	T4	4	28.7
Llama-2-70b MPI Run	Llama-cpp , ggml-model-q4_0.gguf	https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q4_0.gguf	4 REDs	T4	4	1.<_w3a_sdt id="-1339381619" sdttag="goog_rdk_99">6<_w3a_sdt id="-2053913382" sdttag="goog_rdk_100" showingplchdr="t">

Memory calculation for unquantized LLaMA 70B Models

For running a unquantized Llama transformer model on A10s, the following memory calculation is used:

Model type: Llama
Model size: 70B
Total memory requirement: 70B X 2 byte (16 bit) = 140 GB
Memory of 1 A10 GPU = 24 GB
Memory of 8 GPU = 160 GB (excluding GPU memory overheads on each A10 GPU).

Based on this calculation, the Llama 70B unquantized model can run on two OCI bare metal servers with eight A10 GPUs using any distributed inferencing framework, such as torchrun, Ray or MPI.

Table 6: Distributed inferencing of unquantized Llama2 70B model on 2 OCI bare metal servers

Model Type	Transformer Model, Quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-70b	Llama2 , 70B model, torchrun	GitHub - meta-llama/llama: Inference code for Llama models	2 BM Servers	A10s	8	8.8<_w3a_sdt id="-1560095349" sdttag="goog_rdk_101" showingplchdr="t">

<_w3a_sdt id="-1674799883" sdttag="goog_rdk_102">

Figure 1: Inference run of unquantized Llama 70B model on two bare metal servers with eight A10s

The following table shows the test results of the Llama 70B unquantized model using four VM servers with two A10s each:

Table 7: Distributed inferencing of unquantized Llama 70B model on 4 OCI VM Servers

Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-70b	Llama2 , 70B model, torchrun	GitHub - meta-llama/llama: Inference code for Llama models	4 VM Servers	A10s	8	4<_w3a_sdt id="1461070622" sdttag="goog_rdk_103" showingplchdr="t">

The following table shows the test results of Llama2 models using vLLM transformer model ( with Paged Attention) using two bare metal servers with four A10s each:

Table 8: Distributed inferencing of Llama using vLLM on 2 OCI bare metal servers

Model type	Transformer model, quantiation	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-7b	vLLM /PagedAttention/Ray	GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	2 BM Servers	A10s	8	30.<_w3a_sdt id="1429240008" sdttag="goog_rdk_104">1<_w3a_sdt id="-349172897" sdttag="goog_rdk_105" showingplchdr="t">
Llama-2-13b	vLLM /PagedAttention/Ray	GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	2 BM Servers	A10s	8	27.3<_w3a_sdt id="1545099510" sdttag="goog_rdk_106" showingplchdr="t">
Llama-2-70b	vLLM /PagedAttention/Ray	GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	2 BM Servers	A10s	8	12.9<_w3a_sdt id="762657383" sdttag="goog_rdk_107" showingplchdr="t">

Following are the results of LLama3 Runs on 2 BM with 8 A10 GPUs using any distributed inferencing framework like torchrun, Ray , MPI etc.

Table 9: Distributed inferencing of Llama3 using Transformer model on multiple OCI bare metal servers

Model Type	Transformer Model, Quantisation	Deployment Config	OCI Instance Type	Accelerator Type	Num GPUs	Throughput across all GPUs (tokens/second)
Meta-Llama-3-70B	llama 3, 70B model, torchrun	https://github.com/meta-llama/llama3/tree/main	2 BM Servers	A10s	8	12.44
Meta-Llama-3-70B-Instruct	llama 3, 70B model, torchrun	https://github.com/meta-llama/llama3/tree/main	2 BM Servers	A10s	8	12.24
Meta-Llama-3-8B	llama 3, 8B model, torchrun	https://github.com/meta-llama/llama3/tree/main	1 BM Server	A10	1	27.10
Meta-Llama-3-8B-Instruct	llama 3, 8B model, torchrun	https://github.com/meta-llama/llama3/tree/main	1 BM Server	A10	1	27.04

The following table shows the test results of Llama3 models using vLLM transformer model ( with Paged Attention) using two bare metal servers with four A10s each:

Table 10: Distributed inferencing of Llama3 using vLLM on 2 OCI bare metal servers

Model Type	Transformer Model, Quantisation	Deployment Config	OCI Instance Type	Accelerator Type	Num GPUs	Throughput across all GPUs (tokens/second)
Meta-Llama-3-8B	vLLM /PagedAttention/Ray	GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	BM 2 Servers	A10s	8	24.61
Meta-Llama-3-70B	vLLM /PagedAttention/Ray	GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	BM 2 Servers	A10s	8	11.23

The following chart summarizes the inferencing performance of LLaMa2 and LLaMA3 unquantized models of Transformer and vLLM on A10 accelerated OCI VM and BM servers.

Figure 2 : Inferencing of unquantized Llama 70B model on OCI BM and VM servers.

Conclusion

<_w3a_sdt id="-906527838" sdttag="goog_rdk_108"><_w3a_sdt id="1559667233" sdttag="goog_rdk_109">The above benchmarking exercises show<_w3a_sdt id="1051203805" sdttag="goog_rdk_110" showingplchdr="t"> that mainstream GPU accelerated OCI servers (like A10s) can be used for inferencing activities of different sizes of Opensource large language models (LLMs) . <_w3a_sdt id="1178919687" sdttag="goog_rdk_116">When the best performance is needed for larger scale deployments, OCI offers advanced NVIDIA GPUs deployed with NVIDIA TensorRT-LLM, which delivers great results as shown in the recent MLPerf Inference v4.0 benchmarks. Depending on the requirements and the scale of the solution,one can start working with smaller LLMs, such as 7B<_w3a_sdt id="-422875893" sdttag="goog_rdk_117"> and<_w3a_sdt id="-1417090637" sdttag="goog_rdk_118" showingplchdr="t"> 13B<_w3a_sdt id="698126012" sdttag="goog_rdk_119"><_w3a_sdt id="-615602605" sdttag="goog_rdk_120"> models <_w3a_sdt id="-926724245" sdttag="goog_rdk_121" showingplchdr="t"> on <_w3a_sdt id="94988755" sdttag="goog_rdk_122">mainstream GPU-accelerated servers, and then migrate to larger clusters with <_w3a_sdt id="1714536906" sdttag="goog_rdk_124">advanced<_w3a_sdt id="-875384823" sdttag="goog_rdk_125" showingplchdr="t"> GPUs ( A100s, H100s etc) as demand<_w3a_sdt id="-1709628596" sdttag="goog_rdk_126"> and model size increases. This scaling helps in quicker adoption of generative AI solutions for the customer. <_w3a_sdt id="1522206780" sdttag="goog_rdk_127">

Acknowledgments

The author wants to thank Mohan Srinivasan, Sreedhara Narayanaswamy, Ram Sivaram, and Hiten Goradia for their guidance, leadership, and support in this endeavour. The author also wants to thank James George for his expertise in setting up the MPI cluster on Oracle Roving Edge Devices(RED) .

Disclaimer

The benchmarking exercise published in this paper are for general guidance only. Individual test results can vary based on model size, testing parameters,performance techniques and the hardware/softwar<_w3a_sdt id="702755176" sdttag="goog_rdk_5">e<_w3a_sdt id="593909730" sdttag="goog_rdk_6"> stack used.

References

For further information, please visit the following links

Oracle Generative AI Solutions: https://www.oracle.com/artificial-intelligence/generative-ai/

Oracle GPU accelerated BareMetal Servers:https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#bm-gpu

Oracle GPU accelerated VM Servers:https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#vm-gpu

Oracle Roving Edge Servers: https://www.oracle.com/a/ocom/docs/data-sheet-roving-edge-device.pdf

NVIDIA A10 GPUs: https://www.nvidia.com/en-au/data-center/products/a10-gpu/

LLaMA CPP Source Code: https://github.com/ggerganov/llama.cpp

Meta LLaMA2 : meta-llama/llama: Inference code for Llama models (github.com)

Meta LLaMA3: meta-llama/llama3: The official Meta Llama 3 GitHub site

Sharing and Personal Tools

Please select the service you want to use:

Back

View original format