Reorder Buffer and Out-of-Order Execution

Books and Textbooks

  1. Hennessy, J. L., & Patterson, D. A. (2019). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann.

    • Chapter 3: Instruction-Level Parallelism and Its Exploitation
    • Section 3.4: Dynamic Scheduling: Examples and the Algorithm
    • Section 3.6: Hardware-Based Speculation
  2. Shen, J. P., & Lipasti, M. H. (2013). Modern Processor Design: Fundamentals of Superscalar Processors. Waveland Press.

    • Chapter 5: Instruction Flow Techniques
    • Chapter 6: Register Data Flow Techniques
    • Chapter 7: Memory Data Flow Techniques
  3. Stallings, W. (2018). Computer Organization and Architecture: Designing for Performance (11th ed.). Pearson.

    • Chapter 14: Instruction-Level Parallelism and Superscalar Processors
    • Section 14.3: Out-of-Order Execution
  4. Sohi, G. S. (1990). Instruction Issue Logic for Pipelined Supercomputers. IEEE Transactions on Computers, 39(11), 1443-1455.

Foundational Research Papers

  1. Smith, J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. IEEE Transactions on Computers, 37(5), 562-573.

    • Seminal paper on precise interrupt implementation
  2. Hwu, W. M. W., & Patt, Y. N. (1986). Checkpoint repair for high-performance out-of-order execution machines. IEEE Transactions on Computers, C-35(12), 1496-1514.

    • Early work on speculative execution and recovery
  3. Sohi, G. S., & Vajapeyam, S. (1987). Instruction issue logic for high-performance, interruptible pipelined processors. Proceedings of the 14th Annual International Symposium on Computer Architecture, 27-34.

  4. Johnson, M. (1991). Superscalar Microprocessor Design. Prentice Hall.

    • Comprehensive treatment of superscalar design principles

Advanced Research

  1. Kessler, R. E. (1999). The Alpha 21264 microprocessor. IEEE Micro, 19(2), 24-36.

    • Detailed implementation of ROB in Alpha 21264
  2. Yeager, K. C. (1996). The MIPS R10000 superscalar microprocessor. IEEE Micro, 16(2), 28-40.

    • R10000 implementation with active list (ROB variant)
  3. Papworth, D. B. (1996). Tuning the Pentium Pro microarchitecture. IEEE Micro, 16(2), 8-15.

    • First mainstream x86 processor with ROB
  4. Gwennap, L. (1995). Digital 21164 sets new standard. Microprocessor Report, 9(3), 11-16.

    • Analysis of Alpha 21164 out-of-order execution

Modern Processor Implementations

  1. Intel Corporation. (2019). Intel 64 and IA-32 Architectures Optimization Reference Manual.

    • Chapter 2: Intel Microarchitecture
    • Section on Out-of-Order Execution Engine
  2. AMD Corporation. (2020). Software Optimization Guide for AMD Family 17h Processors.

    • Chapter 2: Processor Architecture and Optimization
    • Section on Reorder Buffer and Retirement
  3. Boggs, D., Baktha, A., Hawkins, J., Marr, D. T., Miller, J. A., Roussel, P., ... & Nallapati, G. (2004). The microarchitecture of the Intel Pentium 4 processor on 90nm technology. Intel Technology Journal, 8(1), 1-17.

  4. Koomey, J., Berard, S., Sanchez, M., & Wong, H. (2017). Implications of historical trends in the electrical efficiency of computing. IEEE Annals of the History of Computing, 39(3), 46-54.

Theoretical Analysis

  1. Lam, M. S., & Wilson, R. P. (1992). Limits of control flow on parallelism. Proceedings of the 19th Annual International Symposium on Computer Architecture, 46-57.

  2. Wall, D. W. (1991). Limits of instruction-level parallelism. ACM SIGARCH Computer Architecture News, 19(2), 176-188.

    • Fundamental analysis of ILP limits
  3. Rau, B. R. (1994). Iterative modulo scheduling: An algorithm for software pipelining loops. Proceedings of the 27th Annual International Symposium on Microarchitecture, 63-74.

  4. Fisher, J. A. (1983). Very long instruction word architectures and the ELI-512. ACM SIGARCH Computer Architecture News, 11(3), 140-150.

Implementation Studies

  1. Palacharla, S., Jouppi, N. P., & Smith, J. E. (1997). Complexity-effective superscalar processors. Proceedings of the 24th Annual International Symposium on Computer Architecture, 206-218.

    • Analysis of complexity vs. performance trade-offs
  2. Farkas, K. I., Chow, P., Jouppi, N. P., & Vranesic, Z. (1997). The multicluster architecture: Reducing cycle time through partitioning. Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, 149-159.

  3. Patt, Y. N., Hwu, W. M. W., & Shebanow, M. (1985). HPS, a new microarchitecture: Rationale and introduction. Proceedings of the 18th Annual Workshop on Microprogramming, 103-108.

  4. Butler, M., Barnes, L., Sarkar, D. D., & Sohhi, B. (1991). Single instruction stream parallelism is greater than two. ACM SIGARCH Computer Architecture News, 19(3), 276-286.

Performance Analysis

  1. Riseman, E. M., & Foster, C. C. (1972). The inhibition of potential parallelism by conditional jumps. IEEE Transactions on Computers, C-21(12), 1405-1411.

  2. Tjaden, G. S., & Flynn, M. J. (1970). Detection and parallel execution of independent instructions. IEEE Transactions on Computers, C-19(10), 889-895.

  3. Thornton, J. E. (1964). Parallel operation in the control data 6600. Proceedings of the October 27-29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems, 33-40.

Modern Research Directions

  1. Sanchez, D., & Kozyrakis, C. (2013). ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. ACM SIGARCH Computer Architecture News, 41(3), 475-486.

  2. Carlson, T. E., Heirman, W., & Eeckhout, L. (2011). Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 1-12.

  3. Miller, J. E., Kasture, H., Kurian, G., Gruenwald, C., Beckmann, N., Celio, C., ... & Agarwal, A. (2010). Graphite: A distributed parallel simulator for multicores. IEEE Micro, 30(1), 44-55.

Online Resources and Standards

  1. Intel Corporation. (2021). Intel Architecture Instruction Set Extensions and Future Features Programming Reference.

    • Latest developments in x86 microarchitecture
  2. ARM Limited. (2020). ARM Cortex-A Series Programmer's Guide for ARMv8-A.

    • Chapter 4: Out-of-Order Execution in ARM Processors
  3. RISC-V International. (2019). The RISC-V Instruction Set Manual, Volume I: Unprivileged ISA.

    • Considerations for RISC-V implementations with out-of-order execution
  4. IEEE Computer Society. (2019). IEEE Standard for Information Technology - Portable Operating System Interface (POSIX). IEEE Std 1003.1-2017.

Survey Papers and Books

  1. Mittal, S. (2016). A survey of techniques for improving energy efficiency in embedded computing systems. Renewable and Sustainable Energy Reviews, 54, 629-660.

  2. Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., & Burger, D. (2011). Dark silicon and the end of multicore scaling. ACM SIGARCH Computer Architecture News, 39(3), 365-376.

  3. Koomey, J., Berard, S., Sanchez, M., & Wong, H. (2011). Implications of historical trends in the electrical efficiency of computing. IEEE Annals of the History of Computing, 33(3), 46-54.

Academic Course Materials

  1. University of California, Berkeley. CS152 Computer Architecture Course Materials.

  2. Carnegie Mellon University. 18-447 Introduction to Computer Architecture.

  3. MIT OpenCourseWare. 6.823 Computer System Architecture.

Technical Reports

  1. Conte, T. M., Banerjia, S., Loh, S. Y., Menezes, K. N., & Sathaye, S. S. (1995). Instruction fetch mechanisms for superscalar microprocessors. Technical Report, Department of Electrical and Computer Engineering, North Carolina State University.

  2. Austin, T. M., Larson, E., & Ernst, D. (2002). SimpleScalar: An infrastructure for computer system modeling. Technical Report, University of Michigan and University of Wisconsin.

  3. Magnusson, P. S., Christensson, M., Eskilson, J., Forsgren, D., Hållberg, G., Högberg, J., ... & Werner, B. (2002). Simics: A full system simulation platform. Computer, 35(2), 50-58.

Conference Proceedings

  1. International Symposium on Computer Architecture (ISCA) - Various years

    • Premier venue for computer architecture research including out-of-order execution
  2. International Symposium on Microarchitecture (MICRO) - Various years

    • Key conference for microarchitectural innovations and ROB research
  3. International Symposium on High-Performance Computer Architecture (HPCA) - Various years

    • Important venue for performance-oriented architectural research

Industrial Whitepapers

  1. IBM Corporation. (1995). IBM POWER2 Architecture. IBM Corporation.

    • Early implementation of sophisticated out-of-order execution
  2. Sun Microsystems. (1995). UltraSPARC-I User's Manual. Sun Microsystems, Inc.

    • SPARC implementation with out-of-order execution capabilities
  3. Digital Equipment Corporation. (1996). Alpha 21264 Microprocessor Hardware Reference Manual. Digital Equipment Corporation.

    • Detailed hardware documentation of Alpha 21264 ROB implementation