Virtual Labs

Directory-Based Cache Coherence Protocol Simulator

Introduction to Directory-Based Cache Coherence

In large-scale multiprocessor systems, maintaining cache coherence becomes increasingly challenging as the number of processors grows. Traditional bus-based protocols like MSI don't scale well beyond 16-32 processors due to bandwidth limitations and bus arbitration overhead. Directory-based cache coherence protocols solve this scalability problem by maintaining coherence information in a distributed manner.

The Scalability Problem

Bus-Based Limitations

Bus contention: All processors compete for a shared bus
Broadcast overhead: Every cache miss generates broadcasts to all processors
Bandwidth bottleneck: Bus bandwidth doesn't scale with processor count
Electrical limitations: Physical bus length limits the number of processors

Directory-Based Solution

Directory-based protocols maintain coherence state information for each memory block in a centralized or distributed directory. Instead of broadcasting to all processors, requests are sent only to processors that actually cache the data.

Directory Structure

Basic Directory Entry

Each memory block has an associated directory entry containing:

          Directory Entry = {
              State: [Uncached | Shared | Exclusive]
              Sharer Vector: [Bit vector indicating which processors have copies]
              Owner: [Processor ID for exclusive blocks]
          }

Directory States

Uncached (U)
- No processor currently caches this memory block
- Memory holds the valid data
- No coherence actions needed
Shared (S)
- One or more processors have read-only copies
- Memory holds the valid data
- Sharer vector indicates which processors have copies
Exclusive (E)
- Exactly one processor has the block and may have modified it
- That processor (owner) holds the most recent data
- Memory may be stale

Processor Cache States

Similar to MSI protocol but with additional semantics:

Invalid (I)
- Cache line is not present or invalid
- Any access requires directory lookup
Shared (S)
- Cache line is valid and unmodified
- Other processors may also have shared copies
- Memory is up-to-date
Modified (M)
- Cache line is valid and potentially modified
- This processor is the exclusive owner
- Memory may be stale; this processor must supply data on requests

Directory-Based Protocol Operations

Read Miss Handling

When processor P wants to read block X:

Check Local Cache: If cache hit, return data
Directory Lookup: Send read request to directory
Directory Actions:
- If Uncached: Send data from memory, set state to Shared, set P's bit in sharer vector
- If Shared: Send data from memory, add P to sharer vector
- If Exclusive: Contact owner processor, request data forwarding, change to Shared

Write Miss Handling

When processor P wants to write block X:

Check Local Cache: If cache hit and exclusive, write directly
Directory Lookup: Send write request to directory
Directory Actions:
- If Uncached: Send data from memory, set state to Exclusive, set owner to P
- If Shared: Send invalidations to all sharers, send data to P, set state to Exclusive
- If Exclusive: Contact current owner, request data forwarding and invalidation, set new owner to P

Message Types

Processor to Directory

Read-Request: Request data for reading
Write-Request: Request exclusive access for writing

Directory to Processor

Data-Reply: Send requested data
Invalidate: Invalidate cached copy
Forward-Request: Forward request to current owner

Processor to Processor

Data-Forward: Owner sends data directly to requestor
Writeback: Send modified data back to directory/memory

Protocol Example: Read-Shared Scenario

          Initial State:
          - Memory Block A: Directory = [Uncached, 0000, None]
          - P1 Cache: Block A = Invalid
          - P2 Cache: Block A = Invalid
          
          Step 1: P1 reads Block A
          - P1 → Directory: Read-Request(A)
          - Directory → P1: Data-Reply(A, data)
          - Directory State: [Shared, 0001, None] (P1 bit set)
          - P1 Cache: Block A = Shared
          
          Step 2: P2 reads Block A
          - P2 → Directory: Read-Request(A)
          - Directory → P2: Data-Reply(A, data)
          - Directory State: [Shared, 0011, None] (P1, P2 bits set)
          - P2 Cache: Block A = Shared

Performance Characteristics

Advantages

Scalability: No broadcast bottleneck; point-to-point communication
Flexibility: Works with various network topologies
Reduced Traffic: Only interested processors receive messages
Memory Bandwidth: Better utilization of memory system bandwidth

Challenges

Directory Overhead: Storage overhead for directory information
Indirection: Extra directory lookup adds latency
Hot Spots: Directory can become a bottleneck for popular data
Complexity: More complex than bus-based protocols

Directory Organization Strategies

Centralized Directory

Single directory node manages all memory blocks
Simple implementation but potential bottleneck
Used in smaller systems (8-64 processors)

Distributed Directory

Directory information distributed across memory modules
Better scalability but more complex routing
Each node manages directory for its local memory

Hierarchical Directory

Multiple levels of directories
Reduces directory traffic for local clusters
Complex implementation but excellent scalability

Comparison with Bus-Based Protocols

Aspect	Bus-Based (MSI)	Directory-Based
Scalability	Limited (16-32 CPUs)	High (100s-1000s CPUs)
Latency	Low (direct broadcast)	Higher (directory lookup)
Bandwidth	Bus bottleneck	Point-to-point efficiency
Implementation	Simpler	More complex
Storage Overhead	Minimal	Directory storage required

Real-World Applications

Commercial Systems

SGI Origin: Distributed directory in NUMA systems
AMD Opteron: HyperTransport with directory coherence
Intel Xeon: Directory-based coherence in multi-socket systems

Research Systems

Stanford DASH: Early distributed directory system
MIT Alewife: Cache-only memory architecture
Wisconsin Typhoon: Scalable directory protocols

Design Considerations

Sharer Vector Size

Full Bit Vector: One bit per processor (exact but expensive)
Coarse Vector: Grouped processors (efficient but less precise)
Limited Pointers: Store only N processor IDs (hybrid approach)

Directory Placement

Memory-based: Directory co-located with memory controllers
Cache-based: Directory integrated with processor caches
Network-based: Directory nodes in interconnection network

Protocol Optimizations

Performance Enhancements

Intervention: Direct cache-to-cache transfers
Forwarding: Reduce directory involvement in common cases
Prediction: Anticipate sharing patterns to reduce latency

Storage Optimizations

Sparse Directories: Only track actively shared blocks
Compressed Sharing: Encode common sharing patterns efficiently
Hierarchical Encoding: Multi-level sharing representation

The directory-based approach represents a fundamental shift from broadcast-based coherence to point-to-point communication, enabling the construction of large-scale multiprocessor systems that can scale to hundreds or thousands of processors while maintaining cache coherence efficiently.