H200 DeepSeek Performance Report: The Results Are In!!!

Last updated 1 month ago

H200 DeepSeek Performance Report: The Results Are In!!!

When it comes to performance testing for AI systems, time and expertise are critical resources. Exabits takes the burden off your shoulders by conducting rigorous testing, so you don't have to. In this report, Exabits evaluates the H200 8-GPU cluster paired with the DeepSeek-R1 671B model, focusing on key metrics such as throughput, latency, and first token latency (TTFT). Our comprehensive analysis and detailed benchmarking deliver actionable insights, ensuring businesses can focus on deploying reliable, high-performance AI solutions without the hassle of conducting extensive tests themselves.

H200 8 R1 671B Performance Test Report

Test Environment

Device: H200 × 8
Model: DeepSeek-R1 671B
Test Tool: Custom Load Testing Tool
Performance Metrics:

Throughput (tokens/s): Measures computational efficiency; higher is better.
Latency (s): Time to complete a response; lower is better.
First Token Latency (TTFT, s): Time to generate the first token; lower is better for responsive AI.

Throughput Test

Findings:
- Low concurrency (16-128): SGLang outperforms vLLM.
- High concurrency (256+): vLLM excels with superior scalability.

Concurrency Level

vLLM Throughput (tokens/s)

SGLang Throughput (tokens/s)

146.976

212.751

221.053

289.175

361.312

399.016

128

621.929

690.207

256

835.662

821.490

512

1045.945

876.842

First Token Latency (TTFT)

Key Metric: Measures responsiveness.
Results:
- SGLang performs better at low concurrency (16-128).
- vLLM demonstrates stability at higher concurrency levels (256+).

Concurrency Level

vLLM TTFT (s)

SGLang TTFT (s)

3.289

1.876

4.101

1.439

3.966

1.937

128

2.141

2.111

256

15.418

18.539

512

46.406

42.023

Latency Test

Observations:
- SGLang offers lower latency at low concurrency levels (faster response times).
- vLLM shows gradual latency increases, ensuring stability under high concurrency.

Concurrency Level

vLLM Latency (s)

SGLang Latency (s)

29.241

20.521

22.384

18.561

25.810

21.728

128

28.965

28.720

256

51.362

61.668

512

84.354

91.417

Analysis and Advantages

Low to Medium Concurrency (16-128):
- H200 paired with vLLM ensures low TTFT and stable throughput.
High Concurrency (256+):
- H200 system maintains stability, with scalable performance under heavy workloads.

With its exceptional computational power, the H200 system delivers stable inference performance under high-concurrency workloads. Combined with vLLM optimizations, it effectively supports high-performance, low-latency AI inference, meeting customer demands for reliable and efficient AI model deployment.

Last updated 1 month ago

H200 8 R1 671B Performance Test Report

Test Environment

Device: H200 × 8
Model: DeepSeek-R1 671B
Test Tool: Custom Load Testing Tool
Performance Metrics:

Throughput (tokens/s): Measures computational efficiency; higher is better.
Latency (s): Time to complete a response; lower is better.
First Token Latency (TTFT, s): Time to generate the first token; lower is better for responsive AI.

Throughput Test

Findings:
- Low concurrency (16-128): SGLang outperforms vLLM.
- High concurrency (256+): vLLM excels with superior scalability.

Concurrency Level

vLLM Throughput (tokens/s)

SGLang Throughput (tokens/s)

146.976

212.751

221.053

289.175

361.312

399.016

128

621.929

690.207

256

835.662

821.490

512

1045.945

876.842

First Token Latency (TTFT)

Key Metric: Measures responsiveness.
Results:
- SGLang performs better at low concurrency (16-128).
- vLLM demonstrates stability at higher concurrency levels (256+).

Concurrency Level

vLLM TTFT (s)

SGLang TTFT (s)

3.289

1.876

4.101

1.439

3.966

1.937

128

2.141

2.111

256

15.418

18.539

512

46.406

42.023

Latency Test

Observations:
- SGLang offers lower latency at low concurrency levels (faster response times).
- vLLM shows gradual latency increases, ensuring stability under high concurrency.

Concurrency Level

vLLM Latency (s)

SGLang Latency (s)

29.241

20.521

22.384

18.561

25.810

21.728

128

28.965

28.720

256

51.362

61.668

512

84.354

91.417

Analysis and Advantages

Low to Medium Concurrency (16-128):
- H200 paired with vLLM ensures low TTFT and stable throughput.
High Concurrency (256+):
- H200 system maintains stability, with scalable performance under heavy workloads.