Pipeline parallelism is key to efficient distributed training for large-scale models, but its performance is often hindered by pipeline bubbles, which are gaps in computation that limit throughput. A recent paper introduces a breakthrough zero-bubble scheduling strategy, achieving up to 30% throughput improvement. In this post, we demystify the scheduling process with detailed, step-by-step illustrations, providing clarity and context that complement the original work. Whether you're new to ML systems or a seasoned researcher, this post bridges the gap between high-level concepts and practical understanding with fresh and accessible perspectives.
Distributed training has become indispensable to deep learning, enabling researchers to scale models to billions and even trillions of parameters. Among the key strategies for distributing workloads across multiple GPUs is pipeline parallelism, a technique that splits a model into stages, with each stage assigned to a different device. Like an assembly line in manufacturing, pipeline parallelism allows multiple stages of a model to be processed simultaneously, improving throughput and making large-scale training feasible.
However, pipeline parallelism is not without challenges. A major inefficiency stems from pipeline bubbles, which occur when devices are idle due to sequential dependencies between computation stages. These idle periods limit throughput and waste computational resources, particularly during the warm-up and flush phases of a pipeline. Over the years, several scheduling strategies, such as 1F1B (one-forward-one-backward)
A recent paper, “Zero Bubble (Almost) Pipeline Parallelism”
In this post, we demystify the scheduling process, starting with the commonly used 1F1B approach and progressing to the zero-bubble schedules. Through step-by-step derivations and visualizations, we show how these schedules are constructed and highlight their key advantages. Finally, we explore how the insights from these schedules are generalized into an automatic scheduling algorithm. Whether you’re a seasoned ML systems researcher or new to distributed training, this post will provide clarity and context to help you understand the exciting potential of zero-bubble pipeline parallelism.
As deep learning models grow larger and more complex, training them on a single GPU becomes infeasible due to memory and compute constraints. Distributed training solves this problem by splitting the workload across multiple devices. The two most widely used parallelism strategies are data parallelism
In data parallelism, the dataset is divided into smaller mini-batches, and identical copies of the model are deployed across multiple devices. Each device processes a different mini-batch, computes gradients, and synchronizes the results through an all-reduce operation.
While simple and effective for small-to-medium-sized models, data parallelism struggles with memory-intensive models because it requires each device to store a complete copy of the model. In addition, the cost of communicating gradients grows with the number of devices, creating a bottleneck that can slow down training.
In model parallelism, the model itself is split across devices, with each device responsible for a subset of the computations. Two common forms of model parallelism are:
Pipeline parallelism is particularly useful for training large models that exceed the memory capacity of a single GPU. By splitting the model into stages and the data into micro-batches, it enables simultaneous execution of forward and backward passes on different micro-batches and model layers. For example, while the first stage processes the forward pass for one micro-batch, the second stage can work on the backward pass for another micro-batch.
However, pipeline parallelism introduces pipeline bubbles, which are idle periods that occur because:
To mitigate bubbles, practitioners have employed different scheduling strategies. One widely adopted strategy is 1F1B (one-forward-one-backward), which alternates between forward and backward passes to balance workloads across devices.
Before we dive into 1F1B, let’s consider a typical multi-layer perceptron (MLP) and how it is executed in a pipeline parallel setting:
$$ \begin{align*} \pmb{z} &= \mathbf{W}\pmb{x}\\ \pmb{y} &= \sigma(\pmb{z})\\ \end{align*} $$
where \(\pmb{x}\) is the input, $\mathbf{W}$ is the weight matrix, $\sigma(\cdot)$ is the activation function, and \(\pmb{y}\) is the final output. We use the following notation:
The computation graph for this MLP is shown below:
We can see that the backward operation (both $B_{i,k}$ and $W_{i,k}$) must wait for the corresponding forward operation $F_{i,k}$ to complete.
The 1F1B strategy combines the $B_{i,k}$ and $W_{i,k}$ into a single operation, resulting in the following dependency graph:
The dependency graph reveals a waveform pattern with three key components in pipeline scheduling:
1F1B scheduling follows this pattern and consists of three phases:
The figure below illustrates a 1F1B schedule with four stages (devices) and eight micro-batches. Note the wavefront pattern and the presence of bubbles:
1F1B is popular because it achieves a good balance between memory usage and throughput:
Despite its strengths, 1F1B has inherent limitations:
In the next section, we will explore how splitting the backward pass into finer-grained components enables new schedules that significantly reduce or eliminate pipeline bubbles.
The 1F1B schedule reduces bubbles to some extent, but it still leaves inefficiencies in the warm-up and flush phases. Remember that the backward pass consists of two components:
However, in 1F1B, $B$ and $W$ are grouped into a single operation, leading to sequential dependencies between $B_{i-1,k}$ and $W_{i,k}$.
The key idea behind zero-bubble pipeline is to use finer-grained scheduling: splitting the backward pass into $B$ and $W$, which can be scheduled independently. This results in a refined dependency graph with more flexibility in operation placement:
Unlike $F$ and $B$, which must remain sequentially dependent, $W$ can be flexibly scheduled to fill pipeline bubbles as long as it follows its corresponding $F$.
Now, starting from the 1F1B schedule, let’s split the backward passes into $B$ and $W$. For simplicity, we assume that the execution times for $F$, $B$, and $W$ are identical.
Thanks to the finer granularity, we can shift all operations on every stage except the last one to the left by one time step.
We will see how the authors build on the above schedule by strategically placing operations, leading to two handcrafted schedules: ZB-H1 and ZB-H2, which reduce or eliminate bubbles.
ZB-H1 adjusts the starting points of $W$ passes, filling the tail-end bubbles with $W$ passes without exceeding the memory usage of 1F1B. As a result, the bubble size is reduced to approximately one-third of that in 1F1B.
To better understand ZB-H1 in action, we will derive it step by step. First, let’s remove all $W$ operations from our starting point:
In order to eliminate bubbles in the flush phase, we shift the last three stages to the left by one time step:
To ensure the design of ZB-H1 meets its goals, we must carefully balance memory usage and computational dependencies while satisfying the following principles:
We’ll further shift all operations to the left by reasonable amounts:
We are almost there! Now, let’s reintroduce the $W$ operations and label their micro-batches correctly. We also extend the schedule to show the start of the next steady state phase:
ZB-H1 significantly reduces bubbles without increasing memory usage beyond 1F1B. However, it does not entirely eliminate bubbles, leaving some inefficiencies in the warm-up phase.
If we are willing to slightly relax the memory constraint, we can achieve zero bubbles by adding extra $F$ passes during the warm-up phase and reordering $W$ passes in the flush phase. This adjustment results in ZB-H2, which adopts a ‘‘parallelogram-shaped’’ layout that completely eliminates idle periods.
Similarly to before, we begin by removing all $W$ operations to allow flexibility in rearranging other operations:
We fill the warm-up phase with forward operations shifted from later time steps and move the remaining forward operations to the left accordingly:
Now comes the crucial part: we carefully shift the steady-state operations to the left while respecting the sequential dependencies between $F$ and $B$ operations. Moreover, we sporadically reserve spots at earlier time steps for $W$ operations to ensure a balanced memory usage.
Finally, let’s put $W$ operations back in place. Note how the pipeline achieves zero bubbles by transitioning from a trapezoidal layout (1F1B and ZB-H1) to a parallelogram layout:
ZB-H2 eliminates pipeline bubbles entirely and maximizes throughput. However, it requires additional memory due to:
While the handcrafted schedules ZB-H1 and ZB-H2 demonstrate the power of finer-grained scheduling, they rely on idealized assumptions, such as identical execution times for $F$, $B$, and $W$. Real-world models, however, often face variable execution times, heterogeneous hardware, and different memory constraints. To address this, the authors propose an automatic zero-bubble scheduling algorithm, which generalizes into a fully automated framework.
Denote the running time of a backward pass as $T_B$. Drawing from the derivation of ZB-H2, the automatic scheduler dynamically adjusts the placement of operations based on the following principles:
Moreover, the communication time required to transfer activations or gradients between stages was ignored in the previous analysis. The authors address optimizer synchronization while maintaining synchronous semantics through an optimistic approach. This assumes that most training iterations proceed without numerical issues and that most synchronization steps on global states have no significant effect. Instead of relying on ad-hoc synchronization, the authors propose a post-hoc update validation:
The authors conducted comprehensive experiments to evaluate the performance of their zero-bubble scheduling strategies, comparing the handcrafted and automatically generated schedules against baseline methods. The results demonstrate that zero-bubble scheduling consistently outperforms traditional approaches in both throughput and efficiency, with trade-offs between memory usage and performance.
The experiments were implemented using the open-source Megatron-LM framework, trained models analogous to GPT-3, and were run on up to 32 NVIDIA A100 GPUs distributed across 4 nodes.
The authors compared the following scheduling strategies:
Metrics evaluated include:
ZB-2p consistently outperformed all other methods across various configurations, achieving throughput improvements of up to 30% compared to 1F1B, even when using fewer micro-batches.
ZB-1p performed comparably to 1F1B-I in single-node setups but outperformed it in multi-node setups where communication bandwidth was a bottleneck. Its ability to reduce pipeline bubbles without communication overhead was a key advantage.
ZB-2p achieved a bubble rate of less than 1% in most setups. ZB-2p’s bubble rate was consistently lower than ZB-H2, showing the effectiveness of the automatic scheduling algorithm.
ZB-1p’s bubble rate was comparable to ZB-H1, where memory constraints become the dominant factor in limiting improvement.
ZB-2p achieved the best throughput but required roughly twice the memory of 1F1B. Therefore, ZB-2p is more ideal for memory-rich setups.
ZB-1p matched the memory usage of 1F1B while achieving significant throughput gains, making it a more practical option with limited memory.
Although zero-bubble scheduling shows promising results, we have identified several limitations.
The handcrafted schedules ZB-H1 and ZB-H2 assume that the forward pass ($F$), backward pass for inputs ($B$), and backward pass for weights ($W$) have identical execution times. In practice, these times can vary significantly across layers and stages and can introduce additional bubbles.
The automatic scheduling algorithm can struggle with highly heterogeneous device latencies or bandwidths. For example, devices with slower interconnects (e.g., PCIe instead of NVLink) or highly distributed setups (e.g., across multiple servers) can cause bottlenecks.
The zero-bubble scheduling strategies assume a synchronous training setup, where all pipeline stages must remain in sync. This design ensures exact optimization semantics but limits applicability in asynchronous environments.
Pipeline bubbles have long been a limiting factor in distributed training, reducing throughput and leaving computational resources under-utilized. The zero-bubble pipeline scheduling strategies presented in this paper mark a significant step forward, achieving higher throughput while maintaining synchronous training semantics.
We summarize the key contributions of zero-bubble scheduling:
While zero-bubble scheduling has demonstrated significant potential, we have identified several avenues for future research.
Adapting the zero-bubble approach to asynchronous training settings could further improve scalability by eliminating synchronization requirements. This would require addressing challenges in managing dependencies and ensuring consistent optimization semantics. In addition, future work could focus on extending the automatic scheduler to handle highly heterogeneous environments, such as clusters with varying device speeds, memory capacities, and interconnect bandwidths.
Investigating dynamic scheduling techniques that adapt in real-time to changing workloads or hardware conditions can optimize training efficiency. As hybrid parallelism strategies become common, integrating zero-bubble scheduling into DP or TP could lead to greater performance benefits.
Zero-bubble pipeline scheduling represents a significant advance in distributed training, demonstrating how finer-grained scheduling can boost throughput and resource utilization. This blog post builds on these ideas by providing detailed visualizations, contextual insights, and step-by-step clarity, making this complex topic more accessible. We hope these contributions help others better understand and apply zero-bubble scheduling, sparking more innovation in scalable deep learning systems.