Virtual Labs

Roofline Performance Model Analysis

Follow these step-by-step instructions to understand and explore the Roofline Performance Model using the interactive simulator.

Observe the Initial Setup
- Notice the main roofline chart area with logarithmic axes
- The X-axis represents Operational Intensity (FLOP/Byte)
- The Y-axis represents Performance (GFLOP/s)
- Pre-configured architecture profiles are available in the dropdown
Familiarize with the Components
- Architecture Selector: Choose between Apple Silicon, Intel Xeon, NVIDIA GPU, or Custom
- Roofline Chart: Interactive visualization with moveable points
- Performance Controls: Sliders for memory bandwidth and compute capability
- Application Plotter: Interface to add computational workloads to the chart
- Analysis Panel: Shows bottleneck identification and optimization suggestions

Start with Apple Silicon M1
- Select Apple Silicon M1 from the architecture dropdown
- Observe the pre-configured roofline with:
  - Memory bandwidth: 68.25 GB/s
  - Peak performance: 2.6 TFLOP/s
  - Ridge point automatically calculated
Understand the Roofline Shape
- Notice the diagonal line representing memory bandwidth limitation
- Observe the horizontal line representing compute capability ceiling
- Identify the ridge point where these two lines intersect
- The ridge point = Peak Performance / Memory Bandwidth
Explore Different Architectures
- Switch to Intel Xeon and observe different characteristics
- Try NVIDIA GPU to see high-performance computing capabilities
- Notice how different architectures have different performance profiles

Use Custom Configuration
- Select Custom from the architecture dropdown
- Adjust the Memory Bandwidth slider (1-1000 GB/s range)
- Modify the Compute Capability slider (0.1-100 TFLOP/s range)
- Watch the roofline shape change in real-time
Analyze Ridge Point Changes
- Experiment with different bandwidth and compute ratios
- Notice how increasing bandwidth shifts the ridge point left
- Observe how increasing compute capability shifts the ridge point right
- Understand the implications for different application types

Plot Basic Applications
- Click anywhere on the roofline chart to add an application point
- Try plotting points in different regions:
  - Below the diagonal (memory-bound region)
  - Above the diagonal but below ceiling (unattainable region)
  - On the horizontal ceiling (compute-bound region)
Analyze Predefined Workloads
- Use the application selector to add common workloads:
  - Vector Addition: Low operational intensity, memory-bound
  - Dense Matrix Multiplication: High operational intensity, potentially compute-bound
  - Sparse Matrix Operations: Medium operational intensity
  - FFT Operations: Variable intensity based on size
Interpret Results
- Applications below the roofline are memory-bound
- Applications on the ceiling are compute-bound
- Points above the roofline are theoretically unattainable

Memory-Bound Analysis
- Plot or select a memory-bound application (low operational intensity)
- Read the analysis panel recommendations:
  - Cache optimization strategies
  - Data layout improvements
  - Algorithmic restructuring suggestions
Compute-Bound Analysis
- Plot or select a compute-bound application (high operational intensity)
- Observe optimization suggestions:
  - Vectorization opportunities
  - Parallelization strategies
  - Algorithmic complexity reduction
Ridge Point Applications
- Plot applications near the ridge point
- Understand that these applications are transitional
- Learn about balanced optimization approaches

Side-by-Side Analysis
- Plot the same application on different architectures
- Switch between Apple Silicon, Intel Xeon, and NVIDIA GPU
- Compare where the same workload falls on different rooflines
Optimization Strategy Differences
- Memory-bound applications benefit more from high-bandwidth architectures
- Compute-bound applications benefit from high peak performance systems
- Understand architecture selection criteria for different workloads

Multi-Level Memory Hierarchy
- Understand that real systems have multiple rooflines for different memory levels
- L1 cache provides highest bandwidth but lowest capacity
- Main memory provides highest capacity but lowest bandwidth per core
- Each level creates its own performance ceiling
Scaling Analysis
- Consider how applications scale with problem size
- Small problems often memory-bound (low cache reuse)
- Large problems may become compute-bound (high cache reuse)
- Very large problems may become memory-bound again (exceed cache capacity)

Baseline Measurement
- Start by plotting your application's current performance
- Identify whether it's memory-bound or compute-bound
- Note the distance from the roofline (optimization potential)
Strategy Selection
- For memory-bound applications:
  - Focus on cache optimization and data locality
  - Consider data structure reorganization
  - Implement cache blocking techniques
- For compute-bound applications:
  - Focus on vectorization and parallelization
  - Consider algorithmic improvements
  - Optimize for specific instruction sets
Implementation and Verification
- Apply selected optimization strategies
- Re-measure and re-plot performance
- Verify movement toward the roofline
- Iterate until satisfactory performance achieved

Scientific Computing
- Climate modeling (typically memory-bound)
- Molecular dynamics (mixed characteristics)
- Finite element analysis (varies with mesh size)
Machine Learning
- Training (often memory-bound due to large models)
- Inference (can be compute-bound with optimization)
- Data preprocessing (typically memory-bound)
Graphics and Gaming
- Rasterization (memory-bound for high resolutions)
- Ray tracing (compute-intensive)
- Physics simulation (mixed characteristics)

Document Your Findings
- Record baseline performance measurements
- Note optimization strategies applied
- Document performance improvements achieved
- Analyze cost-benefit of different approaches
Compare Across Architectures
- Evaluate the same workload on different systems
- Consider price-performance ratios
- Factor in power consumption and efficiency
- Make informed hardware selection decisions

After completing this procedure, you should understand: