Virtual Labs

Correlating Branch Prediction with Two-Level Predictors

Introduction to Branch Prediction

In modern pipelined processors, branch instructions pose a significant challenge to maintaining high instruction throughput. When a branch instruction is encountered, the processor must decide which instruction to fetch next before the branch condition is resolved. Branch Prediction is a critical technique that attempts to guess the outcome of branch instructions before they are actually executed, enabling the processor to continue fetching and executing instructions speculatively.

The impact of branch prediction on performance is substantial:

Modern processors can have pipelines with 10-20+ stages
Branch mispredictions can cause pipeline flushes, wasting 10-20+ cycles
Branches occur frequently in programs (every 4-7 instructions on average)
Accurate branch prediction is essential for maintaining high IPC (Instructions Per Cycle)

Evolution of Branch Prediction

Simple Predictors

Static Prediction
- Always predict taken or not-taken
- Compiler hints can improve accuracy
- Limited effectiveness (~60-70% accuracy)
1-bit Predictor
- Remember the last outcome of each branch
- Suffer from ping-pong effect in loops
- Accuracy limited by simple state representation
2-bit Saturating Counter (Bimodal)
- Use a 2-bit counter for each branch
- Provides hysteresis to avoid ping-pong effect
- Better accuracy (~80-85%) but still limited for complex patterns

Correlating Branch Predictors

Correlating Branch Predictors (also known as Two-Level Predictors) represent a significant advancement in branch prediction technology. They are based on the key observation that the behavior of one branch is often correlated with the behavior of other recently executed branches.

Key Insights

Branch Correlation: The outcome of a branch often depends on the outcomes of previous branches
Pattern Recognition: Complex control flow patterns can be captured by considering branch history
Context Sensitivity: The same branch may behave differently in different execution contexts

Architecture of Correlating Predictors

A correlating predictor consists of two main components:

Global History Register (GHR)
- Tracks the outcomes of the most recently executed branches
- Typically 4-16 bits wide
- Shifted left on each branch, with new outcome inserted at LSB
- Provides context for prediction
Pattern History Table (PHT)
- Array of 2-bit saturating counters
- Indexed by the global history register
- Each entry predicts taken/not-taken based on the specific history pattern
- Size = 2^(history_bits) entries

Two-Level Predictor Operation

Prediction Process

Index Calculation: Use current global history as index into PHT
Counter Reading: Read 2-bit counter from PHT[GHR]
Prediction: Predict taken if counter ≥ 2, not-taken if counter < 2
Speculation: Continue fetching based on prediction

Update Process

Outcome Resolution: Determine actual branch outcome
Counter Update: Update PHT[old_GHR] based on actual outcome
- Increment (max 3) if taken
- Decrement (min 0) if not-taken
History Update: Shift GHR and insert actual outcome

Saturating Counter States

The 2-bit saturating counters used in the PHT have four states:

00 (Strongly Not-Taken): Predicts not-taken, requires two consecutive taken branches to change prediction
01 (Weakly Not-Taken): Predicts not-taken, one taken branch changes to weakly taken
10 (Weakly Taken): Predicts taken, one not-taken branch changes to weakly not-taken
11 (Strongly Taken): Predicts taken, requires two consecutive not-taken branches to change prediction

This hysteresis prevents prediction oscillation in the presence of occasional mispredictions.

Design Considerations

History Length Selection

Short History (1-4 bits):

Advantages: Small PHT, fast access, low cost
Disadvantages: Limited pattern recognition, higher aliasing

Long History (8-16 bits):

Advantages: Complex pattern recognition, lower aliasing
Disadvantages: Large PHT, slower access, higher cost

Aliasing Effects

Aliasing occurs when different branch patterns map to the same PHT entry, causing interference:

Constructive Aliasing: Different patterns with similar behavior share entries efficiently
Destructive Aliasing: Conflicting patterns reduce prediction accuracy
Mitigation: Larger PHT, better hash functions, or tagged entries

Performance Analysis

Accuracy Metrics

Prediction Accuracy: Percentage of correct predictions
Misprediction Rate: Percentage of incorrect predictions
MPKI: Mispredictions per 1000 instructions

Factors Affecting Performance

Program Characteristics
- Branch frequency and distribution
- Control flow complexity
- Loop structures and nesting
Predictor Configuration
- History register length
- PHT size and organization
- Counter initialization
Workload Behavior
- Predictable vs. unpredictable branches
- Regular vs. irregular patterns
- Static vs. dynamic behavior

Advanced Correlating Predictors

Local vs. Global History

Global History: Uses outcomes of all recent branches
Local History: Uses outcomes of the same branch
Hybrid: Combines both global and local information

Gshare Predictor

An efficient implementation that XORs the program counter with global history:

Index = PC ⊕ GHR
Provides better distribution and reduces aliasing
Widely used in practice

Tournament Predictors

Combine multiple prediction mechanisms:

Use a meta-predictor to choose between different predictors
Adapt to different program phases
Examples: Alpha 21264, Intel P6 microarchitecture

Real-World Implementation

Hardware Considerations

Access Time: Prediction must complete in one cycle
Power Consumption: Large tables consume significant power
Area Cost: PHT size affects chip area
Update Bandwidth: Multiple branches per cycle require parallel updates

Modern Processors

Intel Core: Uses tournament predictors with local and global components
AMD Zen: Implements perceptron-based predictors
ARM Cortex: Uses hybrid predictors with multiple components
RISC-V: Various implementations from simple to sophisticated

Applications and Examples

Loop Prediction

Consider a simple counting loop:

MOV R1, 0
          MOV R2, 100
          LOOP:
            // loop body
            ADD R1, R1, 1
            BNE R1, R2, LOOP  ; Branch taken 99 times, not-taken once

A correlating predictor can learn this pattern and predict accurately even for the final iteration.

Conditional Patterns

Complex conditional structures benefit from correlation:

if (a > 0) {        ; Branch B1
            if (b > 0) {      ; Branch B2
              // code
            }
          }

The outcome of B2 is correlated with B1 - if B1 is not-taken, B2 is never executed.

Limitations and Challenges

Cold Start: No history available for new branches
Capacity Limits: Finite PHT size limits pattern storage
Interference: Aliasing reduces effectiveness
Training Time: Predictors need time to learn patterns
Context Switches: History may become irrelevant after context changes

Conclusion

Correlating branch predictors represent a significant advancement in processor design, enabling high-performance instruction pipelines by accurately predicting complex branch behavior. Understanding their operation, benefits, and limitations is crucial for computer architects and performance engineers working on modern processor designs.

The key insight that branch outcomes are often correlated has led to a family of sophisticated predictors that form the foundation of branch prediction in contemporary processors, directly impacting the performance of virtually all modern computing systems.