Exabits Blog
  • BLOG
  • Exabits: The Mothership of AI Compute
  • Real Customers || Real Revenue Traction
  • Exabits: The Upstream Provider Powering AI Compute
  • Exabits: GPU Data Center Expert
  • Accelerated Inference & Training with Our Software Innovations
  • Barriers to Investing in GPUs and Exabits' Solutions
  • Exabits MCP Server: A Simpler Way to Use GPU Cloud for AI
  • H200 DeepSeek Performance Report: The Results Are In!!!
  • MCP & AI Agents: The Future of AI Collaboration
  • How DeepSeek Exposed AI’s Biggest Lie And Why Exabits Stands to Win Big
  • Nebula Block's Partnership with Exabits Leads to 130% Performance Boost and Cuts Costs by 71%
  • The Evolution of AI Agents Over the Years and Exabits' Role in the Future of AI
  • Safe Zones in AI: How TEE Protects Our Data
  • AI-Driven Digital Transform & Gamify: Revolutionizing Industries and Enhancing User Engagement
  • Democratizing AI: Innovations in Compute and Human-AI Interaction
  • The Future of AGI: Transforming Industries with Advanced AI Agents
  • The Growing Demand for GPUs: Why AI is the Catalyst
  • Scaling Generative AI: The Compute Bottleneck and Its Solutions
  • The Growing Demand for GPUs: Why AI is the Catalyst
  • What are Exabits Plans for 2025?
  • Exabits' AI Compute Economy
  • FastGPU: Revolutionizing AI Compute with On-Demand GPUs
  • Competition to Collaborations: How Exabits Transformed the AI Compute Landscape
Powered by GitBook
On this page
  • H200 8 R1 671B Performance Test Report
  • Test Environment
  • Throughput Test
  • First Token Latency (TTFT)
  • Latency Test
  • Analysis and Advantages

H200 DeepSeek Performance Report: The Results Are In!!!

Last updated 1 month ago

When it comes to performance testing for AI systems, time and expertise are critical resources. Exabits takes the burden off your shoulders by conducting rigorous testing, so you don't have to. In this report, Exabits evaluates the H200 8-GPU cluster paired with the DeepSeek-R1 671B model, focusing on key metrics such as throughput, latency, and first token latency (TTFT). Our comprehensive analysis and detailed benchmarking deliver actionable insights, ensuring businesses can focus on deploying reliable, high-performance AI solutions without the hassle of conducting extensive tests themselves.

H200 8 R1 671B Performance Test Report

Test Environment

  1. Device: H200 × 8

  2. Model: DeepSeek-R1 671B

  3. Test Tool: Custom Load Testing Tool

  4. Performance Metrics:

  • Throughput (tokens/s): Measures computational efficiency; higher is better.

  • Latency (s): Time to complete a response; lower is better.

  • First Token Latency (TTFT, s): Time to generate the first token; lower is better for responsive AI.

Throughput Test

  • Findings:

    • Low concurrency (16-128): SGLang outperforms vLLM.

    • High concurrency (256+): vLLM excels with superior scalability.

Concurrency Level
vLLM Throughput (tokens/s)
SGLang Throughput (tokens/s)

16

146.976

212.751

32

221.053

289.175

64

361.312

399.016

128

621.929

690.207

256

835.662

821.490

512

1045.945

876.842

First Token Latency (TTFT)

  • Key Metric: Measures responsiveness.

  • Results:

    • SGLang performs better at low concurrency (16-128).

    • vLLM demonstrates stability at higher concurrency levels (256+).

Concurrency Level
vLLM TTFT (s)
SGLang TTFT (s)

16

3.289

1.876

32

4.101

1.439

64

3.966

1.937

128

2.141

2.111

256

15.418

18.539

512

46.406

42.023

Latency Test

  • Observations:

    • SGLang offers lower latency at low concurrency levels (faster response times).

    • vLLM shows gradual latency increases, ensuring stability under high concurrency.

Concurrency Level
vLLM Latency (s)
SGLang Latency (s)

16

29.241

20.521

32

22.384

18.561

64

25.810

21.728

128

28.965

28.720

256

51.362

61.668

512

84.354

91.417

Analysis and Advantages

  • Low to Medium Concurrency (16-128):

    • H200 paired with vLLM ensures low TTFT and stable throughput.

  • High Concurrency (256+):

    • H200 system maintains stability, with scalable performance under heavy workloads.

With its exceptional computational power, the H200 system delivers stable inference performance under high-concurrency workloads. Combined with vLLM optimizations, it effectively supports high-performance, low-latency AI inference, meeting customer demands for reliable and efficient AI model deployment.

Page cover image