Dynamic Scheduling with Tomasulo Algorithm
Theory
Introduction to Dynamic Scheduling
In modern processors, instruction throughput can be significantly improved by executing instructions out of their original program order. Dynamic Scheduling is a technique that allows the processor to rearrange instruction execution at runtime to minimize pipeline stalls and maximize resource utilization.
The main challenges in out-of-order execution are:
- True Dependencies (RAW): A later instruction needs the result of an earlier instruction
- Anti-dependencies (WAR): A later instruction writes to a register that an earlier instruction reads
- Output Dependencies (WAW): Two instructions write to the same register
The Tomasulo Algorithm
The Tomasulo Algorithm, developed by Robert Tomasulo at IBM in 1967 for the IBM 360/91, was one of the first dynamic scheduling algorithms. It enables out-of-order execution while maintaining program correctness through several key innovations:
Key Components
Reservation Stations
- Buffer instructions waiting for execution
- Store operand values or tags indicating where values will come from
- Enable register renaming to eliminate false dependencies
- Different types for different functional units (e.g., Add/Sub, Multiply/Divide, Load/Store)
Common Data Bus (CDB)
- Broadcasts results to all reservation stations simultaneously
- Updates waiting instructions with computed values
- Maintains data consistency across the system
Register Alias Table (RAT)
- Maps architectural registers to reservation station tags
- Implements register renaming
- Tracks which reservation station will produce each register's value
Instruction Processing Stages
Issue Stage
- Decode instruction and check for available reservation station
- If structural hazard exists (no free reservation station), stall
- Allocate reservation station and update register alias table
- Read available operands or record tags for unavailable ones
Execute Stage
- Monitor common data bus for operand values
- When all operands are available, begin execution
- Execute instruction in the appropriate functional unit
- Handle execution latencies (different for different operations)
Write-back Stage
- Broadcast result on common data bus
- Update all waiting instructions with the result
- Free the reservation station
- Update register file if this instruction produces the current value
Register Renaming Process
The Tomasulo algorithm implements register renaming through reservation stations:
- Eliminate WAR Hazards: Instructions read operand values early (during issue) or get them from CDB
- Eliminate WAW Hazards: Only the most recent instruction to write a register updates the register alias table
- Preserve RAW Dependencies: Instructions wait for their operands to be produced by earlier instructions
Functional Units
Typical implementation includes multiple functional units:
Integer Units
- ADD/SUB operations
- Fast execution (1-2 cycles)
- Multiple units for parallel execution
Floating-Point Units
- FADD/FSUB operations
- Moderate execution latency (3-4 cycles)
- Pipelined for high throughput
Multiply/Divide Units
- FMUL/FDIV operations
- High execution latency (10-20+ cycles)
- May be non-pipelined (especially divide)
Load/Store Units
- Memory access operations
- Variable latency depending on cache hits/misses
- Address calculation and memory hierarchy interaction
Instruction Dispatch and Execution
Reservation Station States
Each reservation station can be in one of several states:
- Free: Available for new instruction
- Waiting: Instruction issued but waiting for operands
- Ready: All operands available, ready for execution
- Executing: Currently executing in functional unit
- Completed: Execution finished, waiting to write back
Operand Handling
Operands in reservation stations can be:
- Value: Actual data value available immediately
- Tag: Reference to reservation station that will produce the value
- Register: Direct register reference (for architectural state)
Hazard Resolution
Structural Hazards: Resolved by stalling issue when no reservation stations are available
Data Hazards:
- RAW: Resolved by waiting for operand values on CDB
- WAR: Eliminated through early operand reading and register renaming
- WAW: Resolved through register renaming (latest write wins)
Performance Implications
Advantages:
- Enables out-of-order execution without complex compiler analysis
- Dynamically adapts to runtime dependencies
- High instruction throughput when sufficient resources available
- Elegant handling of variable-latency operations
Disadvantages:
- Complex hardware implementation
- Requires significant silicon area for reservation stations
- Broadcast bus (CDB) can become bandwidth bottleneck
- Limited by number of reservation stations
Modern Implementations
The Tomasulo algorithm forms the foundation for modern superscalar processors:
- Register Renaming: Extended with physical register files
- Multiple Issue: Combined with multiple instruction issue per cycle
- Speculative Execution: Enhanced with branch prediction and speculation
- Memory Disambiguation: Advanced load/store queue management
Understanding Tomasulo is essential for:
- Computer architecture design
- Performance optimization
- Compiler development
- Parallel programming concepts