All Posts
/
Engineering
Published on 7 Oct 2024

Is Kubernetes or Slurm the best orchestrator for 512+ GPU jobs.

Choosing the right orchestrator for training large LLMs.

Contributors

Gary Wu

Co-founder

Share

Table of content

Subscribe to newsletter

By subscribing you agree to with our Privacy Policy.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Introduction

More parameters and larger training sets mean more GPUs. But more GPUs mean more overhead in managing, scaling, and orchestrating your workloads. Consider:

  1. Resource allocation: You must efficiently distribute compute to various jobs and tasks. This is not just about ensuring your tasks are run; it is about ensuring your GPUs are fully utilized without overcommitting resources and preventing resource contention between different jobs. Then, which jobs will run where and when? You need to keep resources busy to reduce idle time, handle job priorities, preemption, and fairness, and manage job dependencies and workflows.
  2. Fault tolerance: GPUs fail. Orchestrators help workloads continue operating properly in the event of component failure. This can be job recovery, handling node failures, and ensuring high availability.
  3. Scalability: We want to go from single-digit GPU clusters to hundreds or thousands. Orchestrators handle this increased workload by managing nodes and GPUs and dynamically scaling resources up or down as needed. Well-designed systems also minimize latency and overhead in these large-scale operations.

But what is the best option for orchestration? Two main platforms are always at the top of the list. Slurm is how HPC jobs have traditionally been orchestrated, and Kubernetes is the cloud-native containerization platform that underpins the modern web. Here, we want to show how you can set up large-scale GPU jobs with both platforms and why we think Kubernetes is the platform on which AI is built.

Slurm: old-school, but still high-power

If you have worked with high-performance computing, you’ll have come across Slurm. Used by the majority of supercomputers to schedule workloads, it is the core utility for Linux-based, high-performance, non-distributed computing. 

But all that makes it sound like it's for researchers in an 80s computer science department. It’s not. Slurm is more than capable of running extensive twenty-first-century AI workloads. It excels at scalability. This acronym even got retconned from Simple Linux Utitility for Resource Management to Scalable Linux Utitility for Resource Management. As one redditor puts it:

So what makes usnus so sure Slurm is best for AI/ML training? In their years in the HPC world, Slurm developers have tailored the utility to exactly what is needed for large workloads: scalability, resource management, and job scheduling.

At the core of Slurm is the Slurm Controller (slurmctld). This is the central management daemon that coordinates resource allocation. But then every node gets its own daemon, slurmd, responsible for job execution.-

(Source: Slurm overview)

You then control these through a combination of configuration files, scripts, and client commands. Let’s say we’re trying to build our 512 GPU cluster. What do we need to do?

First, we need to define our nodes within slurm.conf. Assuming each node has 8 GPUs, we’ll require 64 nodes to reach 512 GPUs.

# slurm.conf

# Basic Slurm configuration
ClusterName=ai_cluster
ControlMachine=slurm-controller
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818

# Node definitions
NodeName=gpu-node[001-064] Gres=gpu:8 CPUs=32 RealMemory=128000 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 State=UNKNOWN

# Partition definitions
PartitionName=gpu_partition Nodes=gpu-node[001-064] Default=YES MaxTime=INFINITE State=U

Here, we defined nodes gpu-node001 to gpu-node064, each with 8 GPUs, 32 CPUs, and 128 GB of memory. We also create a partition named gpu_partition that includes all GPU nodes.

The secret sauce here is Gres=gpu:8. Gres is Slurm’s way of handling generic resources, such as GPUs. We can configure these in our gres.conf:

# /etc/slurm/gres.conf

Name=gpu Type=h100 File=/dev/nvidia[0-7]

This tells Slurm where to find our GPUs and that they are H100s. With this configuration set, we can set up the run. To do that, we put together a job script that we’ll run using srun. In the script, we use the SBATCH command to request resources and define job parameters:

#!/bin/bash
#SBATCH --job-name=AI_Training
#SBATCH --partition=gpu_partition
#SBATCH --nodes=64                # 64 nodes x 8 GPUs = 512 GPUs
#SBATCH --ntasks-per-node=8       # One task per GPU
#SBATCH --gres=gpu:8              # Request 8 GPUs per node
#SBATCH --cpus-per-task=4         # CPUs per task (adjust as needed)
#SBATCH --mem=120G                # Memory per node
#SBATCH --time=48:00:00           # Maximum job runtime
#SBATCH --output=logs/output_%j.txt
#SBATCH --error=logs/error_%j.txt

# Load modules, environments, variables
...

# Run the training script using srun
srun --mpi=pmix_v3 python train.py

This script requests the necessary resources and configures the job parameters for the training task. Let's break down what each part does:

  1. Resource allocation: The script requests 64 nodes (--nodes=64), each with 8 GPUs (--gres=gpu:8), totaling 512 GPUs. It allocates eight tasks per node (--ntasks-per-node=8), one for each GPU, with 4 CPUs per task (--cpus-per-task=4). Each node is given 120GB of memory (--mem=120G).
  2. Job configuration: It sets a job name (--job-name=AI_Training), specifies the partition to use (--partition=gpu_partition), and sets a maximum runtime of 48 hours (--time=48:00:00).
  3. Job execution: The srun command is used to launch the training script (train.py) across all allocated nodes.

In this final part, the --mpi=pmix_v3 option specifies the use of PMIx for inter-process communication. This is particularly important when training deep learning models that require synchronization of parameters across multiple GPUs and nodes.

With this configuration, Slurm will:

  1. Reserve the requested resources across the cluster.
  2. Distribute the workload evenly across all 512 GPUs.
  3. Manage inter-node communication for synchronization during distributed training.
  4. Handle any node failures or job interruptions, potentially restarting the job from the last checkpoint if configured.
  5. Provide detailed accounting and usage statistics after job completion.

Nice. All you need to now do is run it:

sbatch submit_job.sh

Slurm will then queue and execute the job when the requested resources become available.

With Slurm, there are a massive number of possible configuration and fine-tuning options that can be leveraged to optimize cluster performance. You can adjust parameters like job priorities, fair share settings, and preemption policies to ensure efficient resource utilization across different user groups and workloads. 

For instance, to ensure efficient use of the 512 GPUs, you might enable backfill to utilize idle resources for smaller jobs. All you need to do is modify your slurm.conf:

# slurm.conf continued

SchedulerType=sched/backfill
SchedulerParameters=bf_continue,bf_max_job_test=1000,bf_interval=30

Where:

  • bf_continue: Continues backfilling after a higher-priority job starts.
  • bf_max_job_test: Number of jobs considered for backfill.
  • bf_interval: Frequency of backfill scheduling

This should show you that Slurm is powerful. But it has its issues.

  1. Learning curve: If you aren’t used to being neck-deep in the Linux command line, then Slurm's extensive configuration options and CLI can be overwhelming. There is a lot of scope for configuration errors, and if you are scaling a team alongside scaling your workloads, onboarding them to Slurm is going to be a problem.
  2. Limited support for containers: Slurm can work with containers, but it doesn't have the same level of native container support as platforms like Kubernetes. This is less of an issue when you are only working on local models. But if you are productizing an AI, you need a certain degree of portability and reproducibility that comes as standard with containerized workloads.
  3. Lack of auto-scaling: Those conf files above do all of the heavy lifting within Slurm. What you set in them is what Slurm uses. Slurm doesn't provide out-of-the-box support for auto-scaling resources based on demand, which is particularly useful for fluctuating AI workloads. This can lead to inefficient resource utilization during periods of low demand or potential resource shortages during peak times.
  4. Integration with modern tooling: If you are building for production, the tools and platforms you will be working with are designed for cloud-native environments. Integrating these tools with Slurm-based HPC environments requires effort or custom solutions, slowing down your dev cycle.

If we were still in the world of local ML models being built on supercomputers in labs, Slurm would be enough. But AI has escaped the laboratory and hit the web, so we need a web-scale orchestration platform.

Kubernetes: the future and where support is growing

Kubernetes is about web scale. Within AI, perhaps the best justification for the usefulness of Kubernetes is the fact that OpenAI has scaled its operations on Kubernetes up to (at least, because the case study is from 2021) 7,500 nodes in a single cluster

If we go back to our “why orchestrate” question, we can see the answers for Kubernetes.

  1. Resource allocation. Like Slurm conf files, Kubernetes uses resource requests and limits in pod specifications to declare the amount of CPU, memory, and GPU resources needed. With Kubernetes, you can use namespaces and resource quotas to manage resources across different teams or projects by setting quotas.
  2. Job scheduling. Probably the one part where Kubernetes is behind HPC frameworks like Slurm. Kubernetes can be configured for batch jobs but requires custom schedulers or extensions. You can use Kubernetes Jobs and CronJobs for batch processing.
  3. Fault tolerance. This is a huge benefit of Kubernetes. It detects node failures and restarts failed pods to maintain the desired state. Checkpointing has been available since v1.30.
  4. Scalability. The other core asset of Kubernetes. With horizontal pod autoscaling, you can automatically adjust the number of pod replicas based on resource utilization. With cluster autoscaling, you can add or remove nodes based on workload demands. As shown with openAI, Kubernetes also supports clusters with thousands of nodes.

It’s not that Kubernetes has everything required to build the perfect high-count GPU job. The Kubernetes ecosystem is actively working towards this aim. With Kubernetes now embedded in the AI ecosystem, teams are developing tool after tool to help organizations build better AI workflows.

NVIDIA is a case in point. Over the last few years, NVIDIA has seen how AI developers have been using Kubernetes and the struggles they’ve had to overcome: 

  • Heterogeneous node software stack: Managing different GPU types, driver versions, and OS versions across a cluster was challenging.
  • Driver configuration and management: Installing, configuring, and updating GPU drivers on each node was a manual and error-prone process.
  • GPU partitioning and sharing: Configuring GPUs for different sharing strategies (like MIG, time-slicing, or MPS) was complex and required node-level configuration.
  • Container runtime configuration: Setting up container runtimes to work with GPUs requires manual configuration and hooks.
  • GPU health monitoring: There was no robust solution for monitoring GPU health and taking action when issues occurred.
  • Interoperability issues: As the cluster grew and components were updated, interoperability issues between different parts of the GPU software stack became more common.

So NVIDIA has developed the GPU operator to help with these problems. Let’s go through how we might use this to run a 512+ GPU job. First, we can install the NVIDIA GPU Operator using Helm:

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
kubectl create namespace gpu-operator
helm install gpu-operator nvidia/gpu-operator -n gpu-operator

The default values for the gpu-operator are going to suffice in most instances. For customization, we can use a custom-values.yaml file to to optimize performance and resource management.

For instance, you might want to take advantage of MIG, or Multi-Instance GPU, a feature introduced by NVIDIA in their A100 and newer GPUs. It allows a single physical GPU to be partitioned into multiple isolated GPU instances, each with its dedicated memory, cache, and compute cores. This means that a single GPU can serve multiple users or processes simultaneously, effectively increasing the utilization and efficiency of your GPU resources.

Defining MIG profiles allows us to specify how each GPU should be partitioned. For example, we can partition each GPU into seven instances of equal size:

devicePlugin:
  enabled: true
  migStrategy: mixed

migManager:
  enabled: true
  config:
    create: true
    name: "custom-mig-config"
    data: |-
      config.yaml: |-
        version: v1
        mig-configs:
          all-disabled:
            - devices: all
              mig-enabled: false
          custom-mig:
            - devices: [0]
              mig-enabled: true
              mig-devices:
                "1g.5gb": 7

By partitioning our GPUs using MIG, we effectively increase the number of available GPU resources. We can run more processes in parallel, which can significantly speed up training for models that benefit from distributed computing. We can also use resource matching, allocating just enough GPU resources to each workload, preventing the underutilization of full GPUs. 

We can then apply these values:

helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator --values custom-values.yaml

(You can also use MIG with Slurm)

Then we can use a kubeflow training operator to run our distributed jobs with 64 pods, each requesting 8 GPUs (this one with PyTorch, but you can also use Tensorflow):

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: gpu-training-job
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: your-registry/your-gpu-app:latest
            resources:
              limits:
                nvidia.com/gpu: 8
    Worker:
      replicas: 63
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: your-registry/your-gpu-app:latest
            resources:
              limits:
                nvidia.com/gpu: 8

This may look like more setup than Slurm, and in some ways, it is. But with this pattern, you have flexibility, scalability, and portability.

By modifying the container specifications and resource requests, this setup can be easily adapted to different GPU types or even heterogeneous clusters. The declarative nature of Kubernetes configurations allows for easy replication of the entire infrastructure. You can move this setup between cloud providers and easily share and work on the code as a team.

The Future of AI Workloads on Kubernetes

At FluidStack, we see Kubernetes as the ideal orchestration platform for high-level AI workloads. The declarative nature of Kubernetes gives you the flexibility to change with different providers and easily scale up and down your workloads depending on demand. It also means you can run your training and your inference on the same infrastructure.

If you are already using Slurm, great. It has such a solid foundation that it’s going to continue to be able to deliver performant jobs for the foreseeable future. But it isn’t the future. Kubernetes is the platform that is best suited for scaling AI workloads, especially when you need GPU counts in the hundreds or thousands. 

As AI grows, Kubernetes will grow with it, providing the foundation for cloud-native AI applications that Slurm has for HPC applications. Organizations like NVIDIA, OpenAI, and FluidStack will grow the ecosystem and tooling to help provide AI developers with exactly what they need to build the best models.

Reserve your GPU Cluster.

Fully managed, deployed on Kubernetes or Slurm with 24/7 support, 15 min response time and 99% uptime.

FluidStack GPU clusters are powered by Nvidia H100s, H200s, and soon GB200s.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.