In This Article:
-
Validates the performance of AI infrastructure by emulating real-world workloads
-
Evaluates how new algorithms, components, and protocols improve the performance of AI training
-
Adjusts and optimizes the parameters of both AI workloads and system infrastructure without investing in expensive large-scale deployments
SANTA ROSA, Calif., April 01, 2025--(BUSINESS WIRE)--Keysight Technologies, Inc. (NYSE: KEYS) introduces Keysight AI (KAI) Data Center Builder, an advanced software suite that emulates real-world workloads to evaluate how new algorithms, components, and protocols impact the performance of AI training. KAI Data Center Builder’s workload emulation capability integrates large language model (LLM) and other artificial intelligence (AI) model training workloads into the design and validation of AI infrastructure components – networks, hosts, and accelerators. This solution enables tighter synergy between hardware design, protocols, architectures, and AI training algorithms, boosting system performance.
AI operators use various parallel processing strategies, also known as model partitioning, to accelerate AI model training. Aligning model partitioning with AI cluster topology and configuration enhances training performance. During the AI cluster design phase, critical questions are best answered through experimentation. Many of the questions focus on data movement efficiency between the graphics processing units (GPUs). Key considerations include:
-
Scale-up design of GPU interconnects inside an AI host or rack
-
Scale-out network design, including bandwidth per GPU and topology
-
Configuration of network load balancing and congestion control
-
Tuning of the training framework parameters
The KAI Data Center Builder workload emulation solution reproduces network communication patterns of real-world AI training jobs to accelerate experimentation, reduce the learning curve necessary for proficiency, and provide deeper insights into the cause of performance degradation, which is challenging to achieve through real AI training jobs alone. Keysight customers can access a library of LLM workloads like GPT and Llama, with a selection of popular model partitioning schemas like Data Parallel (DP), Fully Sharded Data Parallel (FSDP), and three-dimensional (3D) parallelism.
Using the workload emulation application in the KAI Data Center Builder enables AI operators to:
-
Experiment with parallelism parameters, including partition sizes and their distribution over the available AI infrastructure (scheduling)
-
Understand the impact of communications within and among partitions on overall job completion time (JCT)
-
Identify low-performing collective operations and drill down to identify bottlenecks
-
Analyze network utilization, tail latency, and congestion to understand the impact they have on JCT