H200 DeepSeek Performance Report: The Results Are In!!!
Last updated
Last updated
When it comes to performance testing for AI systems, time and expertise are critical resources. Exabits takes the burden off your shoulders by conducting rigorous testing, so you don't have to. In this report, Exabits evaluates the H200 8-GPU cluster paired with the DeepSeek-R1 671B model, focusing on key metrics such as throughput, latency, and first token latency (TTFT). Our comprehensive analysis and detailed benchmarking deliver actionable insights, ensuring businesses can focus on deploying reliable, high-performance AI solutions without the hassle of conducting extensive tests themselves.
Device: H200 × 8
Model: DeepSeek-R1 671B
Test Tool: Custom Load Testing Tool
Performance Metrics:
Throughput (tokens/s): Measures computational efficiency; higher is better.
Latency (s): Time to complete a response; lower is better.
First Token Latency (TTFT, s): Time to generate the first token; lower is better for responsive AI.
Findings:
Low concurrency (16-128): SGLang outperforms vLLM.
High concurrency (256+): vLLM excels with superior scalability.
16
146.976
212.751
32
221.053
289.175
64
361.312
399.016
128
621.929
690.207
256
835.662
821.490
512
1045.945
876.842
Key Metric: Measures responsiveness.
Results:
SGLang performs better at low concurrency (16-128).
vLLM demonstrates stability at higher concurrency levels (256+).
16
3.289
1.876
32
4.101
1.439
64
3.966
1.937
128
2.141
2.111
256
15.418
18.539
512
46.406
42.023
Observations:
SGLang offers lower latency at low concurrency levels (faster response times).
vLLM shows gradual latency increases, ensuring stability under high concurrency.
16
29.241
20.521
32
22.384
18.561
64
25.810
21.728
128
28.965
28.720
256
51.362
61.668
512
84.354
91.417
Low to Medium Concurrency (16-128):
H200 paired with vLLM ensures low TTFT and stable throughput.
High Concurrency (256+):
H200 system maintains stability, with scalable performance under heavy workloads.
With its exceptional computational power, the H200 system delivers stable inference performance under high-concurrency workloads. Combined with vLLM optimizations, it effectively supports high-performance, low-latency AI inference, meeting customer demands for reliable and efficient AI model deployment.