Oracle Corporation

04/30/2024 | Press release | Distributed by Public on 04/30/2024 14:14

Practical inferencing of open source models on mainstream ...

With the huge demand of generative AI opportunities worldwide, <_w3a_sdt id="-794282655" sdttag="goog_rdk_7">planning for the required compute capacity is crucial. While<_w3a_sdt id="-1436278271" sdttag="goog_rdk_11"> NVIDIA A100 and H100 Tensor Core GPUs <_w3a_sdt id="-937057586" sdttag="goog_rdk_12">offer great performance for large scale LLM deployments, they can <_w3a_sdt id="-661693584" sdttag="goog_rdk_14">be complemented with <_w3a_sdt id="2136979906" sdttag="goog_rdk_16">mainstream GPUs like T4 , P100, and A10 <_w3a_sdt id="559282146" sdttag="goog_rdk_18">for smaller scale deployments.

<_w3a_sdt sdttag="goog_rdk_18">With the well-engineered Oracle Generative AI services, Oracle Cloud Infrastructure (OCI) also allows customers to bring their own models (open source or custom) for inferencing on the highly efficient OCI Servers. When running bring-your-own models purely on OCI, one might need to benchmark and optimize<_w3a_sdt id="1977402378" sdttag="goog_rdk_20"> by<_w3a_sdt id="2066450975" sdttag="goog_rdk_21"><_w3a_sdt id="251783702" sdttag="goog_rdk_22"><_w3a_sdt id="756401489" sdttag="goog_rdk_24"> running the LLMs on <_w3a_sdt id="264500259" sdttag="goog_rdk_25">mainstream NVIDIA-accelerated OCI servers. This blog details how mainstream GP<_w3a_sdt id="1335186231" sdttag="goog_rdk_29">Us accelerated OCI <_w3a_sdt id="-1175805343" sdttag="goog_rdk_34">servers (both bare metal and virtual machine) can be used for running a wide range of inferencing <_w3a_sdt id="165686369" sdttag="goog_rdk_35"><_w3a_sdt id="-875770997" sdttag="goog_rdk_36">scenarios using Opensource LLMs.

Benchmarking <_w3a_sdt id="-199561864" sdttag="goog_rdk_37">parameters

Following are the different set of parameters which influence the inferencing test scenarios and results:

  • Generative AI model specifications: Model type and size
  • GPU specifications: Model and number of GPUs
  • CPU specifications: CPU type and number of CPUs
  • Maximum context window
  • Performance optimizations
    • Quantized and unquantized models
    • Different LLM models like transformer , transformer with KV cache optimization and paged attention, transformer with flash attention etc
  • Performance measured in terms of tokens per second

Testing environment

The following server configurations are used for the benchmarking purposes:

  • OCI server types and specifications
    • GPU accelerated bare metal
      • Intel Xeon Platinum 8358 CPU @ 2.60GHz (128 cores)
      • Four NVIDIA A10<_w3a_sdt id="182563678" sdttag="goog_rdk_38"> Tensor Core<_w3a_sdt id="-1918858089" sdttag="goog_rdk_39"> GPUs<_w3a_sdt id="-738709592" sdttag="goog_rdk_40">, each with 24GB GDDR6 <_w3a_sdt id="-1657299459" sdttag="goog_rdk_41">memory
      • 1TB RAM
    • GPU accelerated VM
      • Intel Xeon Platinum 8358 CPU @ 2.60GHz (60 cores)
      • Two NVIDIA A10<_w3a_sdt id="1114172954" sdttag="goog_rdk_42"> GPUs, each with 24GB GDDR6 memory<_w3a_sdt id="-745803464" sdttag="goog_rdk_43" showingplchdr="t">
      • 480GB RAM
    • GPU Accelerated Roving Edge Device ( RED)
      • Intel(R) Xeon(R) Gold 6230T CPU @ 2.10GHz ( 32 cores)
      • <_w3a_sdt id="1416739095" sdttag="goog_rdk_45">One NVIDIA T4 GPU <_w3a_sdt id="-1180426136" sdttag="goog_rdk_46">with 16<_w3a_sdt id="397788601" sdttag="goog_rdk_47" showingplchdr="t"> GB GDDR6 memory<_w3a_sdt id="-1825345494" sdttag="goog_rdk_48" showingplchdr="t">
      • 512 GB RAM

The following LLM models ( quantized and unquantized versions ) are used for this benchmarking exercise:

  • Llama 2 models (7B, 13B, and 70B)
  • Llama 2 HF models (7B, 13B, and 70B)
  • Llama 3 models ( 8B , 70B )
  • Fin-llama-33B

<_w3a_sdt id="1625579471" sdttag="goog_rdk_49">Single-server, single-user inferencing tests

The following table shows the test results done on fin-llama models using llama.cpp on a single OCI bare metal server.

Table 1: Single-server single-user inferencing tests-finllama

Model type

Transformer model, quantization

Deployment config

OCI instance type

Accelerator type

Number of GPUs

<_w3a_sdt id="-934669071" sdttag="goog_rdk_50">Throughput across all GPUs(tokens/second)

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q2_K.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

29.<_w3a_sdt id="493461710" sdttag="goog_rdk_51">2<_w3a_sdt id="1649394474" sdttag="goog_rdk_52" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q3_K_L.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

28.<_w3a_sdt id="1434478725" sdttag="goog_rdk_53">2<_w3a_sdt id="198360161" sdttag="goog_rdk_54" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q3_K_M.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

2<_w3a_sdt id="-848712729" sdttag="goog_rdk_55">9<_w3a_sdt id="683485464" sdttag="goog_rdk_56" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q3_K_S.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

28.<_w3a_sdt id="-1708098574" sdttag="goog_rdk_57">4<_w3a_sdt id="1301799846" sdttag="goog_rdk_58" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q4_0.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

30.9<_w3a_sdt id="348448176" sdttag="goog_rdk_59" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33B-GGUF.Q5_K_M.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

29.<_w3a_sdt id="-1994242809" sdttag="goog_rdk_60">2<_w3a_sdt id="1736045087" sdttag="goog_rdk_61" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q4_K_M.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

28.5<_w3a_sdt id="1438260802" sdttag="goog_rdk_62" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q4_K_S.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

28.<_w3a_sdt id="-1867279261" sdttag="goog_rdk_63">6<_w3a_sdt id="-20239383" sdttag="goog_rdk_64" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33B-GGUF.Q5_K_M.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

29.<_w3a_sdt id="-2126383437" sdttag="goog_rdk_65">2<_w3a_sdt id="-1638025097" sdttag="goog_rdk_66" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q5_0.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

27.<_w3a_sdt id="827092694" sdttag="goog_rdk_67">7<_w3a_sdt id="1756164355" sdttag="goog_rdk_68" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q5_K_M.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

27.<_w3a_sdt id="1261337160" sdttag="goog_rdk_69">6<_w3a_sdt id="-2012682399" sdttag="goog_rdk_70" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q5_K_S.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

28<_w3a_sdt id="455916194" sdttag="goog_rdk_71" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q6_K.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

25.1<_w3a_sdt id="205074197" sdttag="goog_rdk_72" showingplchdr="t">

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q8_0.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

23.<_w3a_sdt id="691570091" sdttag="goog_rdk_73">5<_w3a_sdt id="1600904273" sdttag="goog_rdk_74" showingplchdr="t">

The following table shows the test results done on Llama 2 models using llama.cpp on a single Oracle Roving Edge (RED) server.

Table 2: Single-server single-user test inferencing LLama2 on RED

Model type

Transformer model, quantization

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs (tokens/second)

Llama-2-7b

Llama-cpp , ggml-model-q4_0.gguf

llama-2-7b.Q4_0.gguf · TheBloke/Llama-2-7B-GGUF at main (huggingface.co)

RED

T4

1

51.9<_w3a_sdt id="1770114857" sdttag="goog_rdk_75" showingplchdr="t">

Llama-2-13b

Llama-cpp , ggml-model-q4_0.gguf

https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q4_0.gguf

RED

T4

1

28.<_w3a_sdt id="-1163158191" sdttag="goog_rdk_76">6<_w3a_sdt id="2062050646" sdttag="goog_rdk_77" showingplchdr="t">

Llama-2-70b

Llama-cpp , ggml-model-q4_0.gguf

https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q4_0.gguf

RED

T4

1

1.<_w3a_sdt id="1366565368" sdttag="goog_rdk_78">6<_w3a_sdt id="1866166680" sdttag="goog_rdk_79" showingplchdr="t">

The following table shows the test results done on quantized LLaMa2 70B models using llama.cpp on a single OCI bare metal server.

Table 3: Single-server single-user inferencing quantized Llama2 70B model on OCI bare metal servers

Model type

Transformer model, quantization

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs (tokens/second)

Llama-2-70B-Chat-GPTQ

llama.cpp, gptq

TheBloke/Llama-2-70B-Chat-GPTQ at main (huggingface.co)

BM with 4 A10s

A10

4

11.2<_w3a_sdt id="-1005979820" sdttag="goog_rdk_81" showingplchdr="t">

Llama-2-70B-Chat-GPTQ

llama.cpp, gptq

TheBloke/Llama-2-70B-Chat-GPTQ at gptq-3bit--1g-actorder_True (huggingface.co)

BM with 4 A10s

A10

4

10.5<_w3a_sdt id="491447124" sdttag="goog_rdk_82" showingplchdr="t">

Llama-2-70B-Chat-AWQ

llama.cpp,AWQ

TheBloke/Llama-2-70B-Chat-AWQ · Hugging Face

BM with 4 A10s

A10

4

13.<_w3a_sdt id="-1401351358" sdttag="goog_rdk_83">6<_w3a_sdt id="227971400" sdttag="goog_rdk_84" showingplchdr="t">

Llama-2-70B-Chat-GGUF

llama.cpp, GGUF, llama-2-70b-chat.Q3_K_L.gguf

TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

17.5<_w3a_sdt id="-1691283483" sdttag="goog_rdk_85" showingplchdr="t">

Llama-2-70B-Chat-GGUF

llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf

TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

19.<_w3a_sdt id="-1668857077" sdttag="goog_rdk_86">2<_w3a_sdt id="1407109879" sdttag="goog_rdk_87" showingplchdr="t">

Llama-2-70B-Chat-GGUF

llama.cpp, GGUF, llama-2-70b-chat.Q4_K_M.gguf

TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

17.<_w3a_sdt id="313686180" sdttag="goog_rdk_88">9<_w3a_sdt id="-2125298622" sdttag="goog_rdk_89" showingplchdr="t">

Llama-2-70B-Chat-GGUF

llama.cpp, GGUF, llama-2-70b-chat.Q5_K_M.gguf

TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

16.<_w3a_sdt id="1854914467" sdttag="goog_rdk_90">8<_w3a_sdt id="1129431540" sdttag="goog_rdk_91" showingplchdr="t">

Single-server, multiuser concurrency inferencing tests

The following table shows the test results done on Llama 2 models using llama.cpp for concurrent users on a single OCI bare metal server.

Table 4: Single-server, multiuser inferencing Llama on OCI bare metal servers

Model type

Transformer model, quantization

Deployment config

OCI instance type

Accelerator Type

Number of GPUs

Concurrent users

Throughput across all GPUs (tokens/second)

Llama-2-70B-Chat-GGUF

llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf

TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

5

10.3<_w3a_sdt id="-1833360044" sdttag="goog_rdk_92" showingplchdr="t">

Llama-2-70B-Chat-GGUF

llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf

TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

10

8.7<_w3a_sdt id="-664003286" sdttag="goog_rdk_93" showingplchdr="t">

fin-llama-33b.Q4_0.gguf

llama.cpp, GGUF, llama-2-33b-chat.Q4_0.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

5

22<_w3a_sdt id="-1500726100" sdttag="goog_rdk_94" showingplchdr="t">

fin-llama-33b.Q4_0.gguf

llama.cpp, GGUF, llama-2-33b-chat.Q4_0.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

10

10.2<_w3a_sdt id="286401761" sdttag="goog_rdk_95" showingplchdr="t">

Distributed inferencing results on multiple servers

The following table shows the test results done on quantized Llama 2 models using llama.cpp with four OCI RED servers and using Message Passing Interface (MPI).

Table 5: Distributed inferencing of quantized Llama2 on multiple OCI RED servers

Model type

Transformer model, quantization

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs (tokens/second)

Llama-2-7b

MPI Run

Llama-cpp , ggml-model-q4_0.gguf

llama-2-7b.Q4_0.gguf · TheBloke/Llama-2-7B-GGUF at main (huggingface.co)

4 REDs

T4

4

52.<_w3a_sdt id="-91172598" sdttag="goog_rdk_96">2<_w3a_sdt id="-721742608" sdttag="goog_rdk_97" showingplchdr="t">

Llama-2-13b

MPI Run

Llama-cpp , ggml-model-q4_0.gguf

https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q4_0.gguf

4 REDs

T4

4

28.7

Llama-2-70b MPI Run

Llama-cpp , ggml-model-q4_0.gguf

https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q4_0.gguf

4 REDs

T4

4

1.<_w3a_sdt id="-1339381619" sdttag="goog_rdk_99">6<_w3a_sdt id="-2053913382" sdttag="goog_rdk_100" showingplchdr="t">

Memory calculation for unquantized LLaMA 70B Models

For running a unquantized Llama transformer model on A10s, the following memory calculation is used:

  • Model type: Llama
  • Model size: 70B
  • Total memory requirement: 70B X 2 byte (16 bit) = 140 GB
  • Memory of 1 A10 GPU = 24 GB
  • Memory of 8 GPU = 160 GB (excluding GPU memory overheads on each A10 GPU).

Based on this calculation, the Llama 70B unquantized model can run on two OCI bare metal servers with eight A10 GPUs using any distributed inferencing framework, such as torchrun, Ray or MPI.

Table 6: Distributed inferencing of unquantized Llama2 70B model on 2 OCI bare metal servers

Model Type

Transformer Model, Quantization

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs (tokens/second)

Llama-2-70b

Llama2 , 70B model, torchrun

GitHub - meta-llama/llama: Inference code for Llama models

2 BM Servers

A10s

8

8.8<_w3a_sdt id="-1560095349" sdttag="goog_rdk_101" showingplchdr="t">

<_w3a_sdt id="-1674799883" sdttag="goog_rdk_102">

Figure 1: Inference run of unquantized Llama 70B model on two bare metal servers with eight A10s

The following table shows the test results of the Llama 70B unquantized model using four VM servers with two A10s each:

Table 7: Distributed inferencing of unquantized Llama 70B model on 4 OCI VM Servers

Model type

Transformer model, quantization

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs (tokens/second)

Llama-2-70b

Llama2 , 70B model, torchrun

GitHub - meta-llama/llama: Inference code for Llama models

4 VM Servers

A10s

8

4<_w3a_sdt id="1461070622" sdttag="goog_rdk_103" showingplchdr="t">

The following table shows the test results of Llama2 models using vLLM transformer model ( with Paged Attention) using two bare metal servers with four A10s each:

Table 8: Distributed inferencing of Llama using vLLM on 2 OCI bare metal servers

Model type

Transformer model, quantiation

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs (tokens/second)

Llama-2-7b

vLLM /PagedAttention/Ray

GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

2 BM Servers

A10s

8

30.<_w3a_sdt id="1429240008" sdttag="goog_rdk_104">1<_w3a_sdt id="-349172897" sdttag="goog_rdk_105" showingplchdr="t">

Llama-2-13b

vLLM /PagedAttention/Ray

GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

2 BM Servers

A10s

8

27.3<_w3a_sdt id="1545099510" sdttag="goog_rdk_106" showingplchdr="t">

Llama-2-70b

vLLM /PagedAttention/Ray

GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

2 BM Servers

A10s

8

12.9<_w3a_sdt id="762657383" sdttag="goog_rdk_107" showingplchdr="t">

Following are the results of LLama3 Runs on 2 BM with 8 A10 GPUs using any distributed inferencing framework like torchrun, Ray , MPI etc.

Table 9: Distributed inferencing of Llama3 using Transformer model on multiple OCI bare metal servers

Model Type

Transformer Model, Quantisation

Deployment Config

OCI Instance Type

Accelerator Type

Num GPUs

Throughput across all GPUs (tokens/second)

Meta-Llama-3-70B

llama 3, 70B model, torchrun

https://github.com/meta-llama/llama3/tree/main

2 BM Servers

A10s

8

12.44

Meta-Llama-3-70B-Instruct

llama 3, 70B model, torchrun

https://github.com/meta-llama/llama3/tree/main

2 BM Servers

A10s

8

12.24

Meta-Llama-3-8B

llama 3, 8B model, torchrun

https://github.com/meta-llama/llama3/tree/main

1 BM Server

A10

1

27.10

Meta-Llama-3-8B-Instruct

llama 3, 8B model, torchrun

https://github.com/meta-llama/llama3/tree/main

1 BM Server

A10

1

27.04

The following table shows the test results of Llama3 models using vLLM transformer model ( with Paged Attention) using two bare metal servers with four A10s each:

Table 10: Distributed inferencing of Llama3 using vLLM on 2 OCI bare metal servers

Model Type

Transformer Model, Quantisation

Deployment Config

OCI Instance Type

Accelerator Type

Num GPUs

Throughput across all GPUs (tokens/second)

Meta-Llama-3-8B

vLLM /PagedAttention/Ray

GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

BM 2 Servers

A10s

8

24.61

Meta-Llama-3-70B

vLLM /PagedAttention/Ray

GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

BM 2 Servers

A10s

8

11.23

The following chart summarizes the inferencing performance of LLaMa2 and LLaMA3 unquantized models of Transformer and vLLM on A10 accelerated OCI VM and BM servers.

Figure 2 : Inferencing of unquantized Llama 70B model on OCI BM and VM servers.

Conclusion

<_w3a_sdt id="-906527838" sdttag="goog_rdk_108"><_w3a_sdt id="1559667233" sdttag="goog_rdk_109">The above benchmarking exercises show<_w3a_sdt id="1051203805" sdttag="goog_rdk_110" showingplchdr="t"> that mainstream GPU accelerated OCI servers (like A10s) can be used for inferencing activities of different sizes of Opensource large language models (LLMs) . <_w3a_sdt id="1178919687" sdttag="goog_rdk_116">When the best performance is needed for larger scale deployments, OCI offers advanced NVIDIA GPUs deployed with NVIDIA TensorRT-LLM, which delivers great results as shown in the recent MLPerf Inference v4.0 benchmarks. Depending on the requirements and the scale of the solution,one can start working with smaller LLMs, such as 7B<_w3a_sdt id="-422875893" sdttag="goog_rdk_117"> and<_w3a_sdt id="-1417090637" sdttag="goog_rdk_118" showingplchdr="t"> 13B<_w3a_sdt id="698126012" sdttag="goog_rdk_119"><_w3a_sdt id="-615602605" sdttag="goog_rdk_120"> models <_w3a_sdt id="-926724245" sdttag="goog_rdk_121" showingplchdr="t"> on <_w3a_sdt id="94988755" sdttag="goog_rdk_122">mainstream GPU-accelerated servers, and then migrate to larger clusters with <_w3a_sdt id="1714536906" sdttag="goog_rdk_124">advanced<_w3a_sdt id="-875384823" sdttag="goog_rdk_125" showingplchdr="t"> GPUs ( A100s, H100s etc) as demand<_w3a_sdt id="-1709628596" sdttag="goog_rdk_126"> and model size increases. This scaling helps in quicker adoption of generative AI solutions for the customer. <_w3a_sdt id="1522206780" sdttag="goog_rdk_127">

Acknowledgments

The author wants to thank Mohan Srinivasan, Sreedhara Narayanaswamy, Ram Sivaram, and Hiten Goradia for their guidance, leadership, and support in this endeavour. The author also wants to thank James George for his expertise in setting up the MPI cluster on Oracle Roving Edge Devices(RED) .

Disclaimer

The benchmarking exercise published in this paper are for general guidance only. Individual test results can vary based on model size, testing parameters,performance techniques and the hardware/softwar<_w3a_sdt id="702755176" sdttag="goog_rdk_5">e<_w3a_sdt id="593909730" sdttag="goog_rdk_6"> stack used.

References

For further information, please visit the following links

Oracle Generative AI Solutions: https://www.oracle.com/artificial-intelligence/generative-ai/

Oracle GPU accelerated BareMetal Servers:https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#bm-gpu

Oracle GPU accelerated VM Servers:https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#vm-gpu

Oracle Roving Edge Servers: https://www.oracle.com/a/ocom/docs/data-sheet-roving-edge-device.pdf

NVIDIA A10 GPUs: https://www.nvidia.com/en-au/data-center/products/a10-gpu/

LLaMA CPP Source Code: https://github.com/ggerganov/llama.cpp

Meta LLaMA2 : meta-llama/llama: Inference code for Llama models (github.com)

Meta LLaMA3: meta-llama/llama3: The official Meta Llama 3 GitHub site