Intel ARCHITECTURE IA-32 Computer Accessories manual

Page 1

IA-32 In tel® Ar chitecture Op timization R e f er ence Manual Order Number: 248966-013US April 2006.

ii INFORMATION IN THI S DOCUMENT IS PROVIDE D IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IM PLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.

Page 3

iii Contents Introduction Chapter 1 IA-32 Intel ® Architecture Processor Family Overview SIMD T echnology ............. ....................... ...................... .................... ...................... ............ .... 1-2 Summary of SIMD T e chnologies .

Page 4

iv Out-of-Order Core...... ... .. .................... ... ... ...................... .................... .. ... ... .............. 1-30 In-Order Retirement ...................... ....................... ...................... .......................

Page 5

v Branch Prediction ................... ...................... ....................... ...................... ...................... .. .... 2-15 Eliminating Branches .................. ...................... ....................... ..................

Page 6

vi Floating-Point S talls ................. ... ...................... ....................... ................... ... ... ... ........... 2 -72 x87 Floating-point Operation s with Integer O perands ........................ ...................... 2-72 x87 Floating-point Comp arison Instructions .

Page 7

vii Considerations for Code Co nversion to SIMD Pr ogramming............ ...................... ................ 3-8 Identifying Hot S pots ....... ...................... ....................... ...................... ......................... ... 3-1 0 Determine If Code Benefits by Conversion to SI MD Execution.

Page 8

viii Packed Shuffle W ord for 64-bit Registers ........ .............. ....................... ...................... ... 4-18 Packed Shuffle W ord for 128-bit Registe r s ......... ......... ...................... .................... ........ 4-19 Unpacking/interleaving 64-bit Data in 128-bit Registers .

Page 9

ix Data Alignment........... ... .................... ... ... .. .................... ... ... ................... ... ... ................... . ....... 5-4 Data Arrangement ...................... ...................... ....................... ........

Page 10

x Hardware Prefetch ..................... ... ... ................... ... ... ... ...................... ...................... ...... 6-19 Example of Ef fective Latency Re duction with H/W Prefetch ............................ ... ........... 6-20 Example of Latency Hiding with S/W Prefetch Instruction .

Page 11

xi Key Practices of System Bus Optimization ......... ......... ...................... .................... ........ 7-17 Key Practices of Memory Optimiza tion ............... ....................... ...................... .............. 7-17 Key Practices of Front-end Opti mization .

Page 12

xii Sign Extension to Full 64-Bit s ........................... ....................... ...................... ................... 8-3 Alternate Coding Rules for 64-Bit Mode.... ....................... ......................... .......................

Page 13

xiii T ime-based Sampling .............. ... .. .................... ... ... .. .................... ... ... ...................... . A-9 Event-based Sampling.......... ... ...................... ... .............. ...................... .............

Page 14

xiv Using Performance Metrics with Hyper-Th reading T e chnology .......... ............................ ..... B-50 Using Performance Events of Intel Core Solo and Intel Core Duo processo rs ............. ....... B-56 Understanding the Resu lts in a Performance Count er .

Page 15

xv Examples Example 2-1 Assembly Code with an Un predictable Branch ............................. 2-17 Example 2-2 Code Optim ization to E liminate Branches ........ ............. ............. ... 2-17 Example 2-3 Eliminating Branch with CMO V Instr uction .

Page 16

xvi Example 3-4 Identification of SSE2 with cpui d ............................ ................. ........ 3-5 Example 3-5 Identification of SSE2 by the OS ............ ................ ................. ........ 3-6 Example 3-6 Identification of SSE3 with cpui d .

Page 17

xvii Example 4-20 Clipping to an Arbitrary Signed Range [high, low] ...... ................ ... 4-27 Example 4-21 Sim plified Clipping to an Arbitrar y Signed Rang e ...... ................ ... 4-28 Example 4-22 Clipping to an Arbitrary Unsi gned Range [high, low] .

Page 18

xviii Example 6-12 Memory Cop y Using Hardware Pref etch and Bus Segment ation .. 6-50 Example 7-1 Serial Execution of Producer and Consum er Work Items ... ............ 7-9 Example 7-2 Basic Structure of Implem enting Producer Consumer Threads . ... 7-11 Example 7-3 Thread Functi on for an Int er laced Producer Consumer Mod el .

Page 19

xix Figur es Figure 1-1 T ypical SIMD Ope rations .......... ................ ............. ............. ............... 1-3 Figure 1-2 SIMD Instr uctio n Register Us age ....................... ................ ............. .. 1-4 Figure 1-3 The Inte l NetBurst Micr oarchitectu re .

Page 20

xx Figure 6-2 Memor y Access Late ncy and Execution Witho ut Prefetch .... .......... 6-23 Figure 6-3 Memor y Access Late ncy and Execution With Prefetch ............. ...... 6-23 Figure 6-4 Pref etch and Loop Unrolling ............................ ..

Page 21

xxi T ables T able 1-1 P ent ium 4 and I ntel Xeon Pro cessor Cache P arameters .................. 1-20 T abl e 1-3 Cache Par ameters of P entium M, Intel ® Core™ Solo and Intel ® Core™ Duo Proces sors ................ ............. ............

Page 22

xxii T able C-5 Streaming SIMD Extens ion 64-bit Integer In struct ions...... ............... C-14 T able C-7 IA-32 x87 Floa ting-point Instr uction s ....... ................ ............. ............ C- 16 T able C-8 IA-32 Ge neral Pur pose I nstru ctions .

Page 23

xxiii Intr oduction The IA-32 Intel ® Architectur e Optimization Refer ence Manual describes how to optimize software to take advantage o f the performance characteristics of the current gene ration of IA-32 Intel architecture family of processors.

Page 24

IA-32 Intel® Ar chitectur e Optimization xxiv target the Intel NetBurst microarchi tecture and the Pentium M processor microarchitecture. T uning Y our Application T uning an application for high per.

Page 25

Intr oduction xxv The manual consists of the following parts: Introduction . Defines the purpose and outlin es the contents of this manual. Chapter 1: IA-32 Intel ® Ar chitecture Pr ocessor Family Overview .

Page 26

IA-32 Intel® Ar chitectur e Optimization xxvi Chapter 7: Multiprocessor and Hyper -Threading T echnology . Describes guidelines and techni ques for optimizing multithreaded applications to achieve optimal pe rformance scaling.

Page 27

Intr oduction xxvii Related Documentation For more information on the Intel ar chitecture, specific techniques, and processor architecture terminology re ferenced in this manual, see the following doc.

Page 28

IA-32 Intel® Ar chitectur e Optimization xxviii Notational Con ventions This manual uses the following conventions: This type style Indicates an element of syntax, a reserved word, a keyword, a filename, instructio n, computer output, or part of a program example.

Page 29

1-1 1 IA-32 Intel ® Ar chitectur e Pr ocessor Family Overview This chapter gives an overview o f th e features relevant to software optimization for the current gener ation s o f I A-32 processors, i.

Page 30

IA-32 Intel® Ar chitectur e Optimization 1-2 Intel Core Solo and Intel Core Duo processors incorporate microarchitectural enhancements for performance and power efficiency that are in addition to those intr oduced in the Pentium M processor .

Page 31

IA-32 Intel® Architectur e Processor Family Overview 1-3 each corresponding pair of data elem ents (X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4). The results of the four parallel computations are sorted as a set of four packed data elements.

Page 32

IA-32 Intel® Ar chitectur e Optimization 1-4 SIMD improves the performance of 3D graphics, speech recogn ition, image processing, scientific applicatio ns and applications that have the following cha.

Page 33

IA-32 Intel® Architectur e Processor Family Overview 1-5 SSE and SSE2 instructions also introduced cacheabil ity and memory ordering instructions that can improve cache usage and application performance.

Page 34

IA-32 Intel® Ar chitectur e Optimization 1-6 SSE instructions are useful for 3D geometry , 3D rendering, speech recognition, and video encoding and decoding.

Page 35

IA-32 Intel® Architectur e Processor Family Overview 1-7 Intel ® Extended Memory 64 T echnolog y (Intel ® EM64T) Intel EM64T is an extension of th e IA-32 Intel architecture. Intel EM64T increases the linear address sp ace for software to 64 bits and supports physical ad dress space up to 40 bits .

Page 36

IA-32 Intel® Ar chitectur e Optimization 1-8 Intel NetBurst ® Micr oarchitecture The Pentium 4 processor , Pentium 4 proce ssor Extreme Edition supporting Hyper -Threading T echnology , Pentium D processor , Pentium processor Extreme Editio n and the Intel Xeon processor implement the Intel NetBurst microarchitecture.

Page 37

IA-32 Intel® Architectur e Processor Family Overview 1-9 • to operate at high clock rates and to scale to higher performance and clock rates in the future Design advances of the Intel Ne tBurst mic.

Page 38

IA-32 Intel® Ar chitectur e Optimization 1-10 The out-of-order core aggressively r eorders µops so that µops whose inputs are ready (and have execution resources available) can execute as soon as possible. The core can issue multiple µops per cycle.

Page 39

IA-32 Intel® Architectur e Processor Family Overview 1-11 The Front End The front end of the Intel NetBurst micr oarchitecture consists of two parts: • fetch/decode unit • execution trace cache I.

Page 40

IA-32 Intel® Ar chitectur e Optimization 1-12 The execution trace cache and the translation engine have cooperating branch prediction hardware. Branch tar gets are predicted based on their linear address using branch predicti on logic and fetched as soon as possible.

Page 41

IA-32 Intel® Architectur e Processor Family Overview 1-13 correct execution, the results of IA- 32 instructions must be committed in original program order before th ey are retired. Exceptions may be raised as instructions are retired. For this reason , exceptions cannot occur speculatively .

Page 42

IA-32 Intel® Ar chitectur e Optimization 1-14 • a mechanism fetches data only and includes two distinct components: (1) a hardware mechanism to fetch the adjacent cache line within an 128-byte sect.

Page 43

IA-32 Intel® Architectur e Processor Family Overview 1-15 Branch Prediction Branch prediction is important to th e performance of a deeply pipelined processor . It enables the processor to begin execut ing instructions long before the branch outcome is certain.

Page 44

IA-32 Intel® Ar chitectur e Optimization 1-16 T o take advantage of the forward-not-taken and backward-taken static predictions, code should be arranged so that the lik ely target of the branch immediately follows forwar d branches (see also: “Branch Prediction” in Chapter 2).

Page 45

IA-32 Intel® Architectur e Processor Family Overview 1-17 Some parts of the core may speculate that a common condition holds to allow faster execution. If it does not, the machine may stall. An example of this pertains to sto r e-to-load forwarding (see “Store Forwarding” in this chapter).

Page 46

IA-32 Intel® Ar chitectur e Optimization 1-18 execution units are not pipelined (meaning that µops cannot be dispatched in consecutive cycles and the throughput is less than one per cycle). The number of µops associated with each instruction provides a basis for selecting instruction s to ge nerate.

Page 47

IA-32 Intel® Architectur e Processor Family Overview 1-19 Caches The Intel NetBurst microarchitectur e supports up to th ree levels of on-chip cache. At least two levels of on-chip cache are implemented in processors based on the Intel NetBur st microarchitecture.

Page 48

IA-32 Intel® Ar chitectur e Optimization 1-20 Levels in the cache hierarchy are not in clusive. The fact that a line is in level i does not imply that it is also in level i+ 1. All caches use a pseudo-LRU (least rece ntly used) replaceme nt algorithm.

Page 49

IA-32 Intel® Architectur e Processor Family Overview 1-21 back within the processor , and 6-12 bus cycles to access memory if there is no bus congestion. Each bus cycle equals several processor cycles. The ratio of processor cloc k speed to the scalable bus clock speed is referred to as bus ratio .

Page 50

IA-32 Intel® Ar chitectur e Optimization 1-22 • avoids the need to access of f-chip caches, which can increase the realized bandwidth compared to a normal load-miss, which returns data to all cache.

Page 51

IA-32 Intel® Architectur e Processor Family Overview 1-23 Hardware prefetching for Pentium 4 processor has the following characteristics: • works with existing applications • does not require ext.

Page 52

IA-32 Intel® Ar chitectur e Optimization 1-24 Thus, software optimization of a data access pattern should emphasize tuning for hardware prefetch f irst to favor greater proportions of smaller- stride data accesses in the workload; before attempting to provide hints to the processor by employin g software prefetch instructions.

Page 53

IA-32 Intel® Architectur e Processor Family Overview 1-25 Reordering loads with respect to each other can prevent a load miss from stalling later loads. Reordering loads with respect to other loads and stores to different addresses can enable more parallelism, allowing the machine to execute operations as soon as their inputs are ready .

Page 54

IA-32 Intel® Ar chitectur e Optimization 1-26 Intel ® P entium ® M Processor Micr oar chitecture Like the Intel NetBurst microarchitecture, the pipeline of the Intel Pentium M processor microarchit.

Page 55

IA-32 Intel® Architectur e Processor Family Overview 1-27 The Intel Pentium M processor microa rchitecture is designed for lower power consumption. There are other specific areas of the Pentium M processor microarchitecture that differ from the Intel NetBurst microarchitecture.

Page 56

IA-32 Intel® Ar chitectur e Optimization 1-28 The fetch and decode unit in cludes a hardware instruction prefetcher and three decoders that enable parallelism.

Page 57

IA-32 Intel® Architectur e Processor Family Overview 1-29 • Micro-ops (µops) fusion. Some of the most frequent pairs of µops derived from the same instruction can be fused into a single µops.

Page 58

IA-32 Intel® Ar chitectur e Optimization 1-30 Data is fetched 64 bytes at a time; the instruction and data translation lookaside buffers support 128 entrie s. See T able 1-3 for processor cache parameters. Out-of-Order Cor e The processor core dynamically executes µops ind ependent of program order .

Page 59

IA-32 Intel® Architectur e Processor Family Overview 1-31 In-Order Retirement The retirement unit in the Pentium M processor buffers completed µops is the reorder buf fer (ROB). The ROB updates the architectural state in order . Up to three µops may be retired per cycle.

Page 60

IA-32 Intel® Ar chitectur e Optimization 1-32 • Power-op timized bus The system bus is optimized for power efficiency; increased bus speed supports 667 MHz. • Data Prefetch Intel Core Solo and Intel Core Duo processors implement improved hardware prefetch mechanisms: one mech anism can look ahead and prefetch data into L1 from L2.

Page 61

IA-32 Intel® Architectur e Processor Family Overview 1-33 Data Prefetc hing Intel Core Solo and Intel Core Duo processors provide hardware mechanisms to prefetch data from memory to the second-level cache.

Page 62

IA-32 Intel® Ar chitectur e Optimization 1-34 The two logical processors each have a complete set of architectural registers while sharing one single phy sical processor's resources.

Page 63

IA-32 Intel® Architectur e Processor Family Overview 1-35 In the first implementation of HT T echnology , the phys ical execution resources are shared and the architect ure state is duplicated for each logical processor .

Page 64

IA-32 Intel® Ar chitectur e Optimization 1-36 Pr ocessor Resources and Hy per -Threading T echnology The majority of microarchitecture re sources in a physical processor are shared between the logical processors. Only a few small data structures were replicated for each logical pro cessor .

Page 65

IA-32 Intel® Architectur e Processor Family Overview 1-37 For example: a cache miss, a branch misprediction, or instruction dependencies may prevent a logical processor fr om making forward progress for some number of cycles. The partitioning prevents the stalled logical processor from blo cking forward progress.

Page 66

IA-32 Intel® Ar chitectur e Optimization 1-38 Micr oarchitecture Pipeline an d Hyper -Threading T echnology This section describes the HT T echnology microarchitecture and how instructions from the two logical p r ocessors are handled between the front end and the back end of the pipeline.

Page 67

IA-32 Intel® Architectur e Processor Family Overview 1-39 Execution Core The core can dispatch up to six µops per cycle, provided the µops are ready to execute. Once the µops ar e placed in the queues waiting for execution, there is no distinction be tween instructions from the two logical processors.

Page 68

IA-32 Intel® Ar chitectur e Optimization 1-40 Pentium Processor Extreme Edition prov ide four logical processors in a physical package that has two executi on cores. Each core provides two logical processors sharing an ex ecution core and a cache hierarchy .

Page 69

IA-32 Intel® Architectur e Processor Family Overview 1-41 Figure 1-7 P entium D Processo r , P entium Processor Ext reme Edition and Intel Core Duo Pr ocessor System Bus Ar c hit ect ual S t ate Ex e.

Page 70

IA-32 Intel® Ar chitectur e Optimization 1-42 Microar chitecture Pipeline and Multi-Co re Processor s In general, each core in a multi-core processor resembles a single-core processor implementation of the un derlying microarchitecture.

Page 71

IA-32 Intel® Architectur e Processor Family Overview 1-43 that the cache line that contains th e memory location is owned by the first-level data cache of the initiati ng core (that is, the line is in exclusive or modified state). Then the processor looks for the cache line in the cache and memory sub-systems.

Page 72

IA-32 Intel® Ar chitectur e Optimization 1-44 when data is written back to memory , the eviction consumes cache bandwidth and bus bandwidth. For multiple cache misses that require the eviction of modified lines and ar e within a short time, there is an overall degradation in response time of these cache misses.

Page 73

2-1 2 General Optimization Guidelines This chapter discusses general optimi zation techniques that can improve the performance of applications running o n the Intel Pentium 4, Intel Xeon, Pentium M processors, as well as on dual-co re processors.

Page 74

IA-32 Intel® Ar chitectur e Optimization 2-2 The following sections describe practices, tools, coding r ules and recommendations associated with th ese factors that will aid in optimizing the performance on IA-32 processors.

Page 75

General Optimization Guidelines 2 2-3 * Streaming SIMD Extensions (S SE) ** Streaming S IMD Extensions 2 (SSE2) General Practices and Coding Guidelines This section discusses guidelines derived from the performance factors listed in the “Tu ning to Achieve Optimum Performance” section.

Page 76

IA-32 Intel® Ar chitectur e Optimization 2-4 Use A vailable P erformance T ools • Current-generation compiler , su ch as the Intel C++ Compiler: — Set this compiler to produce code for the tar get processor implementation — Use the compiler switches for optimization and/or profile-guided optimization.

Page 77

General Optimization Guidelines 2 2-5 Optimize Branch Predictability • Improve branch predictability a nd optimize instruction prefetching by arranging code to be consistent with the static branch prediction assumption: backward taken and forward not taken.

Page 78

IA-32 Intel® Ar chitectur e Optimization 2-6 • Minimize use of global variables and pointers. • Use the const modifier; use the static modifier for global variables.

Page 79

General Optimization Guidelines 2 2-7 • A void longer latency instructions: integer multiplies and divides. Replace them with alternate code se quences (e.g., use shifts instead of multiplies). • Use the lea instruction and the full range of addressing modes to do address calculation.

Page 80

IA-32 Intel® Ar chitectur e Optimization 2-8 • A void the use of conditionals. • Keep induction (loop) variable ex pressions simple. • A void using pointers, tr y to replace pointers with arrays and indices. Coding Rules, Suggestio ns and T uning Hints This chapter includes rules, suggesti ons and hints.

Page 81

General Optimization Guidelines 2 2-9 P erformance T ools Intel offers several tools that can facilitate optimizing your application’ s performance. Intel ® C++ Compiler Use the Intel C++ Compiler following the recommendations described here.

Page 82

IA-32 Intel® Ar chitectur e Optimization 2-10 General Compiler Recommendations A compiler that has been extensively tuned for the target microarchitec- ture can be expected to match or outperform han d-coding in a general case.

Page 83

General Optimization Guidelines 2 2-11 The VT une Performance Analyzer also enables engineers to use these counters to measure a number of wo rkload characteristics, including: • retirement throughp.

Page 84

IA-32 Intel® Ar chitectur e Optimization 2-12 Intel Core Solo and Intel Core Duo pr ocessors have enhanced front end that is less sensitive to the 4-1-1 template. The practice has no real impact on processors based on the Intel NetBurst microarchitecture.

Page 85

General Optimization Guidelines 2 2-13 • On the Pentium 4 and Intel Xeon processo rs, the primary code size limit of interest is imposed by the trace cache.

Page 86

IA-32 Intel® Ar chitectur e Optimization 2-14 T ransparent Cache-P arameter Strategy If CPUID instruction supp orts function leaf 4, also known as deterministic cache parameter leaf, this function le.

Page 87

General Optimization Guidelines 2 2-15 Branch Prediction Branch optimizations have a significant impact on performance. By understanding the flow of branches and improving the predictability o f branches, you can increase the speed of code significantly .

Page 88

IA-32 Intel® Ar chitectur e Optimization 2-16 Assembly/Compiler Coding Rule 1. (MH impa ct, H generality) Arrange code to make basic blocks contig uous and elimin ate unnecessary bran ch es.

Page 89

General Optimization Guidelines 2 2-17 See Example 2-2. The optimized code sets ebx to zero, then compares A and B. If A is greater than or equal to B, ebx is set to one. Then ebx is decreased and “ and -ed” with the difference of the constant values.

Page 90

IA-32 Intel® Ar chitectur e Optimization 2-18 The cmov and fcmov instructions are available on the Pentium II and subsequent processors, but not on Pe ntium processors and earlier 32-bit Intel architecture processors. Be su re to check whether a processor supports these instructions with the cpuid instruction.

Page 91

General Optimization Guidelines 2 2-19 Static Prediction Branches that do not have a history in the BTB (see the “Branch Prediction” section) are predicted us ing a static prediction algorithm.

Page 92

IA-32 Intel® Ar chitectur e Optimization 2-20 Assembly/Compiler Coding Rule 3. (M impa ct, H generality) Arrange code to be consistent with the stat ic bra nch pr ediction algorith m: make the fall-t.

Page 93

General Optimization Guidelines 2 2-21 Examples 2-6, Example 2-7 provide basic rules for a static prediction algorithm. In Example 2-6, the backward branch ( JC Begin ) is not in the BTB the first time through, theref ore, the BTB does not issue a prediction.

Page 94

IA-32 Intel® Ar chitectur e Optimization 2-22 Inlining, Calls and Returns The return address stack mechanism augments the static and dynamic predictors to optimize specifically fo r calls and returns. It ho lds 16 entries, which is lar ge enough to cover the call d e pth of most pr ograms.

Page 95

General Optimization Guidelines 2 2-23 Assembly/Compiler Coding Rule 6 . (H impac t, M gener ality) Do not inline a function if doing so incr eases the working set size beyond what will fit in the trace cache.

Page 96

IA-32 Intel® Ar chitectur e Optimization 2-24 Placing data immediately following an indirect branch can cause a performance problem. If the data consist of all zeros, it look s like a long stream of adds to memory destinations, which can cause resource conflicts and slow down branch recovery .

Page 97

General Optimization Guidelines 2 2-25 indir ect branch into a tr ee wher e one or mor e indire ct branches ar e pr eceded by conditi onal branch es to those ta r gets. Apply this “peeling” procedur e to the common tar get of an indir ect branch that corr elates to branch history .

Page 98

IA-32 Intel® Ar chitectur e Optimization 2-26 best performance from a coding ef fort. An example of peeling out the most favored tar get of an indirect br anch with correlat ed branch history is shown in Example 2-9.

Page 99

General Optimization Guidelines 2 2-27 • The Pentium 4 processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional br anches in the loop.

Page 100

IA-32 Intel® Ar chitectur e Optimization 2-28 In this example, a loop that ex ecutes 100 times assigns x to every even-numbered element and y to every odd-numbered element. By unrolling the loop you can make both assignments each iteration, removing one branch in the loop bod y .

Page 101

General Optimization Guidelines 2 2-29 Memory Accesses This section discusses guidelines for optimizing code an d data memory accesses. The most important recommendations are: • align data, paying a.

Page 102

IA-32 Intel® Ar chitectur e Optimization 2-30 Assembly/Compiler Coding Rule 16. (H impact, H generality) Align data on natural operand size addr ess boundaries. If the data will be accesses with vector instru ction loads and stor es, align the data o n 16 byte boundaries.

Page 103

General Optimization Guidelines 2 2-31 Alignment of code is less of an issue for th e Pentium 4 processor . Alignment of branch targets to ma ximize bandwidth of fetching cached instructions is an issue only when not executing out of the trace cache.

Page 104

IA-32 Intel® Ar chitectur e Optimization 2-32 Store Forwar ding The processor ’ s memory system only sends stores to memory (includin g cache) after store retirement. Howeve r , store data can be forwarded from a store to a subsequent load fro m the same address to give a much shorter store- load latency .

Page 105

General Optimization Guidelines 2 2-33 If a variable is known not to change between when it is stored and when it is used again, the register that was stored can be copied or used directly . If register pressure is too high, or an unseen function is called before the store and th e second load, it may not be possible to eliminate the second load.

Page 106

IA-32 Intel® Ar chitectur e Optimization 2-34 The size and alignment restrictions fo r store forwarding are illustrated in Figure 2-2. Coding rules to help programmers satis fy size and alignment restrictions for store forwarding follow . Assembly/Compiler Coding Rule 18.

Page 107

General Optimization Guidelines 2 2-35 A load that forwards from a store mu st wait for the store’ s data to be written to the store buffer before pr oceeding, but other , unrel ated loads need not wait.

Page 108

IA-32 Intel® Ar chitectur e Optimization 2-36 Example 2-14 illustrates a stalled store-forwarding situation that may appear in compiler generated code. Sometimes a compiler generates code similar to that shown in Example 2-14 to handle spilled byte to the stack and convert the byte to an integer value.

Page 109

General Optimization Guidelines 2 2-37 When moving data that is smalle r than 64 bits between memory locations, 64-bit or 128-bit SIMD register moves ar e more efficient (if aligned) and can be used to avoid un aligned loads.

Page 110

IA-32 Intel® Ar chitectur e Optimization 2-38 Store-forwar ding Restrict ion on Data A vailability The value to be stored must be available before the load operation can be completed. If this restriction is vi olated, the execution of the load will be delayed until the data is availabl e.

Page 111

General Optimization Guidelines 2 2-39 An example of a loop-carried dependence chain is shown in Example 2-17. Data La yout Optimizations User/Source Coding Rule 2. (H impact, M generality) Pad data structur es defined in the sour ce code so that every d ata element is aligned t o a natural operand size a ddre ss boundary .

Page 112

IA-32 Intel® Ar chitectur e Optimization 2-40 Cache line size for Pentium 4 and Pentium M processors can impact streaming applications (for example, multimedia).

Page 113

General Optimization Guidelines 2 2-41 However , if the access pattern of the array exhibits locality , such as if the array index is being swept through, then the Pentium 4 processor prefetches data from struct_of_array , even if the elements of the structure are accessed together .

Page 114

IA-32 Intel® Ar chitectur e Optimization 2-42 non-sequential manner , the automa tic hardware prefetcher cannot prefetch the data. The prefetcher can recognize up to eight concur rent streams. See Chapter 6 for more information and the hardware prefetcher .

Page 115

General Optimization Guidelines 2 2-43 If for some reason it is not possible to align the stack for 64-bits, the routine should access the parameter and save it into a register or known aligned storage, thus incurring the penalty only once.

Page 116

IA-32 Intel® Ar chitectur e Optimization 2-44 Capacity Limits in Set-Associative Caches Capacity limits may occur if th e number of outstanding memory references that are mapped to the same set in each way of a given cache exceeded the number of ways of that cache.

Page 117

General Optimization Guidelines 2 2-45 Aliasing Cases in the P entium ® 4 and Intel ® Xeon ® Processor s Aliasing conditions that are specific to the Pentium 4 processor and Intel Xeon processor are: • 16K for code – there can only be one of these in the trace cache at a time.

Page 118

IA-32 Intel® Ar chitectur e Optimization 2-46 Aliasing Cases in t he P entium M Pr ocessor Pentium M, Intel Core Solo and I ntel Core Duo processors have the following al iasi ng case: • Store forw.

Page 119

General Optimization Guidelines 2 2-47 Mixing Code and Data The Pentium 4 processor ’ s aggressive prefetching and pre-decoding of instructions has two related ef fects: • Self-modifying code works corr ectly , according to the Intel architecture processor requirements, but incurs a significant performance penalty .

Page 120

IA-32 Intel® Ar chitectur e Optimization 2-48 and cross-modifying code (when more than one processor in a multi-processor system are writing to a code p age) should be avoided when high performance is desired.

Page 121

General Optimization Guidelines 2 2-49 write misses; only four write-combining b uffers are guaranteed to be available for simultaneous use. W r ite combining applies to memory type WC; it does not apply to memory type UC. Assembly/Compiler Coding Rule 28.

Page 122

IA-32 Intel® Ar chitectur e Optimization 2-50 be no RFO since the line is not cached , and there is no such delay . For details on write-combining, see the Intel Ar chitectur e Softwar e Devel- oper ’ s Manual .

Page 123

General Optimization Guidelines 2 2-51 Locality enhancement to the last level cache can be accomplished with sequencing the data access pattern to take advantage of hardware prefetching.

Page 124

IA-32 Intel® Ar chitectur e Optimization 2-52 Minimizing Bus Latency The system bus on Intel Xeon and Pentium 4 processo rs provides up to 6.4 GB/sec bandwidth of throug hput at 200 MHz scalable bus clock rate. (See MSR_EBC_FREQUENCY_ID register .) The peak bus bandwidth is even higher with higher bu s clock rates.

Page 125

General Optimization Guidelines 2 2-53 User/Sourc e Coding Rule 8. (H impact, H generality) T o achieve effective amortization of b us latency , softwar e should pay attentio n to favor data access pa.

Page 126

IA-32 Intel® Ar chitectur e Optimization 2-54 Example 2-21 Non-temporal Stores and 64-byte Bus W rite T ransactions Example 2-22 Non-temporal Stores a nd Partial Bus Write T ransactions #define STRID.

Page 127

General Optimization Guidelines 2 2-55 Prefetc hing The Pentium 4 processor has th ree prefetching mechanisms: • hardware instruction prefetcher • software prefetch for data • hardware prefetch for cache lines of data or instructions.

Page 128

IA-32 Intel® Ar chitectur e Optimization 2-56 access patterns to suit the hardware prefetcher is highly recommended, and should be a higher -priority consideration than using software prefetch instructions. The hardware prefetcher is best fo r small-stride data access patterns in either direction with cache-miss stride not far from 64 bytes.

Page 129

General Optimization Guidelines 2 2-57 • new cache line flush instruction • new memory fencing instructions For a detailed description of us ing cacheability instructions, see Chapter 6.

Page 130

IA-32 Intel® Ar chitectur e Optimization 2-58 Guidelines fo r Optimizi ng Floating-point Code User/Sourc e Coding Rule 10. (M impact, M generality) Enable the compiler ’ s use of S SE, SSE2 or SSE3 instructions wi th appr opria te switches.

Page 131

General Optimization Guidelines 2 2-59 to early out). However , be careful of intr oducing more than a total of two values for the flo ating po int cont r ol wor d, or the r e will be a lar g e perfor mance penalty . See “Float in g-point Mod es”.

Page 132

IA-32 Intel® Ar chitectur e Optimization 2-60 desir ed numeric pr ecision, the size of the look-up tableland t aking advantage of the paralleli sm of the Str eamin g S IMD Extensions an d the S treaming SIMD Extensions 2 i nstructions.

Page 133

General Optimization Guidelines 2 2-61 executing SSE/SSE2/SSE3 instruct ions and when speed is more important than complying to IEEE st andard. The following paragraphs give recommendations on how to optimize yo ur code to reduce performance degradation s related to floating-point exceptions.

Page 134

IA-32 Intel® Ar chitectur e Optimization 2-62 Underflow exceptions and denormalized source operan ds are usually treated according to the IEEE 754 specification.

Page 135

General Optimization Guidelines 2 2-63 FPU control word (FCW), such as when performing conversions to integers. On Pentium M, Intel Core Solo and Intel Core Duo processors; FLDCW is improved over previous generations. Specifically , the optimization for FLDCW allows programmers to alternate between two constant values efficiently .

Page 136

IA-32 Intel® Ar chitectur e Optimization 2-64 Assembly/Compiler Coding Rule 31. (H impact, M generality) Minimize changes to bits 8-12 of the floating poin t contr ol wor d.

Page 137

General Optimization Guidelines 2 2-65 If there is more than one change to rounding , precision and infinity bits and the rounding mode is not importan t to the result; use the algorithm in Example 2-23 to avoid synchronization issues, the overhead of the fldcw instruction and having to change the ro unding mode.

Page 138

IA-32 Intel® Ar chitectur e Optimization 2-66 Example 2-23 Algorithm to A void Changing the Rounding Mode _fto132proc lea ecx,[esp-8] sub esp,16 ; allocate frame and ecx,-8 ; align pointer on boundar.

Page 139

General Optimization Guidelines 2 2-67 Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to th e rounding mode. Do not use changes in the rounding mode to implement the floor and ceiling f unctions if this involves a to tal of mor e than two valu es of the set of r ounding, pr ecision and i nfinity bits.

Page 140

IA-32 Intel® Ar chitectur e Optimization 2-68 Assembly/Compiler Coding Rule 33. (H impact, L generality) Minimize the number of changes to the precision mode. Impr oving P arallelism and the Use of FXCH The x87 instruction set relies on the floating po int stack for one of its operands.

Page 141

General Optimization Guidelines 2 2-69 This in turn allows instructions to be reordered to make instructions available to be executed in parallel. Out-of-order execution precludes the need for using fxch to move instructions for very short distances. x87 vs.

Page 142

IA-32 Intel® Ar chitectur e Optimization 2-70 • Scalar floating-point registers may be accessed directly , avoiding fxch and top-of-stack restrictions. On th e Pentium 4 processor , the floating-point register stack may be used simultaneously with XMM registers.

Page 143

General Optimization Guidelines 2 2-71 Recommendation : Use the compiler switch to generate SSE2 scalar floating-point code over x87 code. When working with scalar SSE/SSE2 code, pay attention to the need for clearing the content of unused slots in an xmm register and the associated performance impact.

Page 144

IA-32 Intel® Ar chitectur e Optimization 2-72 Floating-P oint Stalls Floating-point instructions have a latency of at least two cycles. But, because of the out-of-order nature of Pentium II and the subsequent processors, stalls will not necessarily occur on an in struction or µop basis.

Page 145

General Optimization Guidelines 2 2-73 Note that transcendental functions are supported only in x 87 floating point, not in St reaming SIMD Extensions or Streaming SIMD Extensions 2. Instruction Selection This section explains how to generate optimal assembly co de.

Page 146

IA-32 Intel® Ar chitectur e Optimization 2-74 Complex Instructions Assembly/Compiler Coding Rule 40. (ML impact, M generality) A void using complex in struc tio ns (f or example, enter , leave , or loop ) that have mor e than four µops and r equir e multipl e cycles to decode .

Page 147

General Optimization Guidelines 2 2-75 Use of the inc and dec Instructions The inc and dec instructions modify o nly a subs et of the bits in the flag register .

Page 148

IA-32 Intel® Ar chitectur e Optimization 2-76 CMPXCHG8B, various rotate instructions, STC, an d STD. An example of assembly with a partial flag regist er stall and alternative code without the stall is shown in T able 2-2. Integer Divide T ypically , an integer divide is preceded by a cwd or cdq instruction.

Page 149

General Optimization Guidelines 2 2-77 (model 9) does incur a penalty . This is because every operation on a partial register updates the whole register . However , this does mean that there may be false dependencies between any references to partial registers.

Page 150

IA-32 Intel® Ar chitectur e Optimization 2-78 T able 2-3 illustrates using movzx to avoid a partial register stall when packing three byte values into a register . Assembly/Compiler Coding Rule 44. (ML i mpact, L generality) Use sim ple instructions tha t ar e less than eight bytes in length.

Page 151

General Optimization Guidelines 2 2-79 less delay than the partial register update prob lem mentioned above, but the performance gain may vary . If the additional μ op is a critical problem, movsx can sometimes be used as alternative. Sometimes sign-extended semantics can be maintained by zero-extending operands.

Page 152

IA-32 Intel® Ar chitectur e Optimization 2-80 Prefixes and Instruction Decoding An IA-32 instruction can be up to 15 bytes in length. Prefixes can change the length of an instruction th at the decoder must recognize. In some situations, using a length-chang ing prefix (LCP) causes extra delay in decodi ng the instruct ion.

Page 153

General Optimization Guidelines 2 2-81 • Processing an instruction with the 0x66 prefix th at (i) has a mo dr/m byte in its encodi ng and (ii) the opcode byte of the instruction happens to be aligned on byte 14 of an instruction fetch line. The performance delay in this case is ap proximately twice of those other two situations.

Page 154

IA-32 Intel® Ar chitectur e Optimization 2-82 String move/store instructions ha ve multiple data granularities. For efficient data movement, larger data granularities are preferable.

Page 155

General Optimization Guidelines 2 2-83 • Cache eviction: If the amount of data to be processed by a memory routine approaches half the size of the last level on-die cache, temporal locality of the cache may suf fer . Using streaming store instructions (for example: movntq, movntdq) can minimize the effect of flushing the cache.

Page 156

IA-32 Intel® Ar chitectur e Optimization 2-84 improve address alignment, a small piece of prolog code using movsb/stosb with count less than 4 can be used to p eel off the non-aligned data moves before starting to use mo vsd/stosd.

Page 157

General Optimization Guidelines 2 2-85 Memory routines in the runtime library generated by Intel Compilers are optimized across wide range of address alignment, counter values, and microarchitectures. In most cases, ap plications should take advantage of the default memory routines provided by Intel Compilers.

Page 158

IA-32 Intel® Ar chitectur e Optimization 2-86 In some situations, the byte count of the data to operate is known by the context (versus from a parameter passed from a call). One can take a simpler approach than those required f or a general-purpose library routine.

Page 159

General Optimization Guidelines 2 2-87 Clearing Registers Pentium 4 processor provides special support to xor , sub , or pxor operations when executed within the same register . This recognizes that clearing a register does not depend on the old value of the register .

Page 160

IA-32 Intel® Ar chitectur e Optimization 2-88 Using test instruction between the instruction that may modify part of the flag register and the instruction th at uses the flag register can also help prevent partial flag register stall. Assembly/Compiler Coding Rule 52.

Page 161

General Optimization Guidelines 2 2-89 Use movapd as an alternative; it writes all 128 bits. Even though this instruction has a longer latency , the μ ops for movapd use a different execution port and this port is more likely to be free. The change can impact performance.

Page 162

IA-32 Intel® Ar chitectur e Optimization 2-90 Pr olog Sequences Assembly/Compiler Coding Rule 57. (M impact, MH generality) In r outines that do not need a frame pointer and that do not have called r outines that modify ESP , use ESP as the base r egister to fr ee up EBP .

Page 163

General Optimization Guidelines 2 2-91 Using memory as a destination operand may further reduce register pressure at the slight risk of making trace cache packing more dif ficult.

Page 164

IA-32 Intel® Ar chitectur e Optimization 2-92 Spill Scheduling The spill scheduling algorithm used by a code generator will be impacted by the Pentium 4 processor memory subsystem. A spill scheduling algorithm is an algorithm th at selects what values to spill to memory when there are too many live va lues to fit in registers.

Page 165

General Optimization Guidelines 2 2-93 Because micro-ops are delivered from the trace cache in the common cases, decoding rules are not required. Scheduling Rules f or the P e ntium M Processor Decode.

Page 166

IA-32 Intel® Ar chitectur e Optimization 2-94 Data elements in parallel. The number of elements which can be operated on in parallel range from four single-precision floating point data elements in S.

Page 167

General Optimization Guidelines 2 2-95 User/Source Coding Rule 19. (M impact, ML generality) A void the use of conditional bra nches inside loops and co nsi der using SSE instru ctions to eliminate branches. User/Source Coding Rule 20. (M impact, ML generality) Keep induction (loop) variables ex pr essions simple.

Page 168

IA-32 Intel® Ar chitectur e Optimization 2-96 The other NOPs have no special hardware support. Their input and output registers are in terpreted by the hardware.

Page 169

General Optimization Guidelines 2 2-97 User/Sour ce Coding Rules User/Source Coding Rule 1. (M impact, L generality) If an indir ect branch has two or mor e common ta ken tar gets, and at least one of.

Page 170

IA-32 Intel® Ar chitectur e Optimization 2-98 User/Source Coding Rule 8. (H impact, H generality) T o achieve effective amortization of bus latency , softwar e should.

Page 171

General Optimization Guidelines 2 2-99 look-up-tabl e- based algo rit hm using interp olation tech niques. It is p ossible to impr ove transcendental p erfor mance with these techniques by choo sin g .

Page 172

IA-32 Intel® Ar chitectur e Optimization 2-100 or der engine . When tuning, note that all IA-32 based pr ocessors have very high branch prediction rates. Cons istently mispr edicted are rar e. Use these instructi ons only if the incr ease in computation time is l ess than the expected cost of a mispr edicted branch.

Page 173

General Optimization Guidelines 2 2-101 Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put mor e than four branch es in 16-byte chunks. 2 -22 Assembly/Compiler Coding Rule 1 1. (M impact, L generality) Do not put mor e than two end loop branches in a 16-b yte chunk.

Page 174

IA-32 Intel® Ar chitectur e Optimization 2-102 Assembly/Compiler Coding Rule 18. (H impact, M generality) A load that forwards fr om a store must have the same addr ess start poin t and ther efor e the same alignmen t as the stor e data. 2-34 Assembly/Compiler Coding Rule 19.

Page 175

General Optimization Guidelines 2 2-103 first-level cach e working set. A void having mor e than 8 cache lines that ar e some multiple of 64 KB ap art in the same second-l evel cache w orking set. A void having a stor e follo wed by a non-dependent load wi th addr esses that differ by a mult ip le of 4 KB.

Page 176

IA-32 Intel® Ar chitectur e Optimization 2-104 Assembly/Compiler Coding Rule 32. (H impact , L generality) Minimize the number of chan ges to the r oundin g mo de.

Page 177

General Optimization Guidelines 2 2-105 Assembly/Compiler Coding Rule 42. (M impact, H generality) inc and dec instructions should be re pl ac ed wit h an add or sub instruction, because add and sub overwrite all flags, wher eas inc and dec do not, ther efor e creating false dependencies on earlier instructio ns that set the flags.

Page 178

IA-32 Intel® Ar chitectur e Optimization 2-106 instead of a cmp of the r egister to zer o, this saves the need to e ncode the zer o and saves encoding space. A void comparing a constant to a memo ry operand. It is pr eferable to load the memory operand and com p ar e the constant to a r egister .

Page 179

General Optimization Guidelines 2 2-107 Assembly/Compiler Coding Rule 56. (M impact, ML generality) For arithmetic or lo gical operations that have th eir sour ce operand in memory and the destinat io.

Page 180

IA-32 Intel® Ar chitectur e Optimization 2-108 T uning Suggestions T uning Suggestion 1. Rar ely , a performance pr oblem may be note d due to executing data on a code page as instructio ns. The only condition wher e this is likely to happen is f ollowing an indir ect branch that is not r esident in the trace cache.

Page 181

3-1 3 Coding for SIMD Ar chitectur es Intel Pentium 4, Intel Xeon and Pentium M processors include support for S treaming SIMD Extensions 2 (SSE2), S treaming SI MD Extensions technology (SSE), and MMX technology.

Page 182

IA-32 Intel® Ar chitectur e Optimization 3-2 Chec king for Pr ocessor Suppor t of SIMD Te c h n o l o g i e s This section shows how to check whether a processor supports MMX technology , SSE, SSE2, or SSE3. SIMD technology can be included in your appl ication in three ways: 1.

Page 183

Coding for SIMD Ar chitectur es 3 3-3 For more information on cpuid see, Intel ® Pr ocessor Identification with CPUID I nstruction , order number 24161 8. Chec king for Streaming SI MD Extensions Support Checking for support of S treaming SIMD Extensions (SSE) on your processor is like checking for MMX technolog y .

Page 184

IA-32 Intel® Ar chitectur e Optimization 3-4 T o find out whether the operating system supports SSE, execute an SSE instruction and trap for an exception if one occurs.

Page 185

Coding for SIMD Ar chitectur es 3 3-5 Chec king for Streaming SI MD Extensions 2 Support Checking for support of SSE2 is like checking for SSE support. Y ou must also check whether your operat ing system (OS) sup ports SSE. The OS requirements for SSE2 Support are the same as the requirements for SSE.

Page 186

IA-32 Intel® Ar chitectur e Optimization 3-6 Chec king for Streaming SI MD Extensions 3 Support SSE3 includes 13 instructions, 1 1 of those are suited for SIMD or x87 style programming. Checking for suppor t of these SSE3 instructions is similar to checking for SSE support.

Page 187

Coding for SIMD Ar chitectur es 3 3-7 Example 3-6 Identifica tion of SSE3 with cpuid SSE3 requires the same support from the operating system as SSE. T o find out wh ether the operating syst em suppo rts SSE3 (FISTTP and 10 of the SIMD instructions in SSE3), ex ecute an SSE3 inst ruction and trap for an exception if one occurs.

Page 188

IA-32 Intel® Ar chitectur e Optimization 3-8 Example 3-7 Identificati on of SSE3 by the OS Considerations f or Code Con version to SIMD Programming The VT une Performance Enhancement Environment CD provides tools to aid in the evaluation and tuning. But before implementing them, you need answers to the following questions: 1.

Page 189

Coding for SIMD Ar chitectur es 3 3-9 Figure 3-1 Con verting to Streaming SIMD Extensions Chart OM15 156 Code benefit s from S IM D STOP Ident ify H ot Spots i n C ode Int eger or fl oati ng-poi nt? Y.

Page 190

IA-32 Intel® Ar chitectur e Optimization 3-10 T o use any of the SIMD technologies optimally , you must evaluate the following situations in your code: • fragments that are computationally intensiv.

Page 191

Coding for SIMD Ar chitectur es 3 3-11 specific optimizations. Where appropriate, the coach displays pseudo-code to su ggest the use of highly optimized intrinsics and functions in the Intel ® Performance Library Suite.

Page 192

IA-32 Intel® Ar chitectur e Optimization 3-12 costly application processing time. However , these routines have potential for increased performance when you convert them to use one of the SIMD technologies.

Page 193

Coding for SIMD Ar chitectur es 3 3-13 Coding Methodologies Software developers need to compare the performance improvement that can be obtained from assembly code ver sus the cost of those improvements.

Page 194

IA-32 Intel® Ar chitectur e Optimization 3-14 The examples that follow illustra te the use of coding adjustments to enable the algorithm to benef it from the SSE. The same techniques may be used for single-precision f loating-point, double-precision floating-point, and integer data under SSE2 , SSE, and MMX technology .

Page 195

Coding for SIMD Ar chitectur es 3 3-15 Assembl y Key loops can be coded directly in assembly lan guage using an assembler or by using inlined assembly (C-asm) in C/C++ code. The Intel compiler or assembler recognize the new instructions and registers, then directly generate the correspondin g code.

Page 196

IA-32 Intel® Ar chitectur e Optimization 3-16 SIMD Extensions 2 inte ger SIMD and __m128d is used for double precision floating-point SIMD. These ty pes enable the programmer to choose the implementation of an algo rithm directly , while allowi ng the compiler to perform regi ster allocation and instru ction scheduling where possible.

Page 197

Coding for SIMD Ar chitectur es 3 3-17 The intrinsic data types, however , are not a basic ANSI C data type, and therefore you must observe the following usage restrictions: • Use intrinsic data types only on the left-hand side of an assignment as a return value or as a parameter .

Page 198

IA-32 Intel® Ar chitectur e Optimization 3-18 Here, fvec.h is the class definition file and F32vec4 is the class representing an array of four fl oats. The “+” and “=” operators are overloaded so that the actual S treaming SIMD Extensions implementation in the previous exam ple is abstracted out, or hidden, from the developer .

Page 199

Coding for SIMD Ar chitectur es 3 3-19 The caveat to this is that only certain types of loops can be automatically vectorized, and in most cases user interaction with the compiler is needed to fully enable this. Example 3-12 shows the code for auto matic vectorization for the simple four -iteration loop (from Example 3-8).

Page 200

IA-32 Intel® Ar chitectur e Optimization 3-20 Stac k and Data Alignment T o get the most performance out of code written for SIMD technologies data should be formatted in memory according to the guidelines described in this section. Assembly code with an unaligned accesses is a lot slower than an aligned access.

Page 201

Coding for SIMD Ar chitectur es 3 3-21 By adding the padding variable pa d , the structure is now 8 bytes, and if the first element is aligned to 8 byte s (64 bits), all following elements will also be aligned.

Page 202

IA-32 Intel® Ar chitectur e Optimization 3-22 Assuming you have a 64-bit aligned da ta vector and a 64-bit aligned coefficients vector , the filter operation on the first data element wi ll be fully aligned. For the second data element, how ever , access to the data vector will be misaligned.

Page 203

Coding for SIMD Ar chitectur es 3 3-23 • Functions that use Streaming SIMD Extensions or S treaming SIMD Extensions 2 data need to provide a 1 6-byte aligned stack frame. • The __m128* parameters need to be aligned to 16-byte boundaries, possibly creating “holes” (due to padding) in th e argument block.

Page 204

IA-32 Intel® Ar chitectur e Optimization 3-24 Another way to improve data alignment is to copy the data into locations that are aligned on 64-bit boundaries. When the data is accessed frequently , this can provide a significant performance improvement.

Page 205

Coding for SIMD Ar chitectur es 3 3-25 The __declspec(align(16)) specifications can be placed before data declarations to force 16-byte alignmen t. This is pa rticularly useful for local or global data declarations that are assigned to 128-bit data types.

Page 206

IA-32 Intel® Ar chitectur e Optimization 3-26 In C++ (but not in C) it is also possible to force the alignment of a class / struct / union type, as in the code that follows: struct __ declspec(align(.

Page 207

Coding for SIMD Ar chitectur es 3 3-27 Impr oving Memory Utilization Memory performance can be improved by rearran ging data and algorithms for SSE 2, SSE, and MMX technology intrinsics.

Page 208

IA-32 Intel® Ar chitectur e Optimization 3-28 There are two options for comp uting data in AoS format: perform operation on the data as it stands in AoS format, or re-arrange it (swizzle it) into SoA format dynamically . S ee Example 3-16 for code samples of each option based on a dot-product computation.

Page 209

Coding for SIMD Ar chitectur es 3 3-29 Performing SIMD operations on the original AoS format can require more calculations and some of the op erations do not take advantage of all of the SIMD elements available. Therefore, th is option is generally less efficient.

Page 210

IA-32 Intel® Ar chitectur e Optimization 3-30 but is somewhat inefficient as there is the overhead of extra instructions during computation. Performing the sw izzle statically , when the data structures are being laid out, is best as there is no runtime overhead.

Page 211

Coding for SIMD Ar chitectur es 3 3-31 Note that SoA can have the disadvantage of requiring more independent memory stream references. A computation that uses arrays x , y , and z in Example 3-15 would require three separate data streams.

Page 212

IA-32 Intel® Ar chitectur e Optimization 3-32 Strip Mining Strip minin g, also known as loop s ectioning, is a loop transformation technique for enabling SIMD-encodings of loops, as well as providing a means of improving memory performance.

Page 213

Coding for SIMD Ar chitectur es 3 3-33 The main loop consists of two func tions: transformation and lighting. For each object, the main loop calls a transformation routine to update some data, then calls the lighting routine to further work on the data.

Page 214

IA-32 Intel® Ar chitectur e Optimization 3-34 In Example 3-19, the computation has been strip-mined to a size strip_size . The value strip_size is chosen such that strip_size elements of array v[Num] fit into the cache hierarchy .

Page 215

Coding for SIMD Ar chitectur es 3 3-35 For the first iteration of the inner loop, each access to array B will generate a cache miss. If th e size of one row of array A , that is, A[2, 0:MAX-1] , is large enough, by the time the second iteration starts, each access to array B will always generate a cache miss.

Page 216

IA-32 Intel® Ar chitectur e Optimization 3-36 This situation can be avoided if the loop is blocked with respect to the cache size. In Figure 3-3, a block_size is selected as the loop blocking factor . Suppose that block_size is 8, then the blocked chunk of each array will be eight cache lines (32 bytes each).

Page 217

Coding for SIMD Ar chitectur es 3 3-37 As one can see, all the redundant cache misses can be eliminated by applying this loop blocking technique. If MAX is huge, loop blocking can also help reduce the penalty from DTLB (data translation look-aside buffer) misses.

Page 218

IA-32 Intel® Ar chitectur e Optimization 3-38 Note that this can be applied to both SIMD integer and SIMD floating-point code. If there are multiple consumers of an instan ce of a register , group the consumers together as closely as possible. However , the consumers should not be scheduled near the p roducer .

Page 219

Coding for SIMD Ar chitectur es 3 3-39 Recommendation : When targeting code generation for Intel Core Solo and Intel Core Duo processors, favor instructio ns consisting of two-micro-ops over those with more than two micro-o ps.

Page 220

IA-32 Intel® Ar chitectur e Optimization 3-40.

Page 221

4-1 4 Optimizing for SIMD Integer Applications The SIMD integer instructions provide performance impr ovements in applications that are integer-intensive and can take advantage of the SIMD architecture of Pentium 4, In tel Xeon, and Pentium M processors.

Page 222

IA-32 Intel® Ar chitectur e Optimization 4-2 For planning considerations of using the new SIMD integer instructions, refer to “Checking for S treaming SIMD Extensions 2 Support” in Chapter 3.

Page 223

Optimizing for SIMD Integer Applications 4 4-3 Using SIMD Integer with x87 Floating-point All 64-bit SIMD integer instructions use the MMX registers, which share register state with the x87 floating-point stack. Because of this sharing, certain rules and considera tions apply .

Page 224

IA-32 Intel® Ar chitectur e Optimization 4-4 Using emms clears all of the valid bits, effectively emptying the x87 floating-point stack and making it ready f or new x87 floating-point operations. The emms instruction ensures a clean transition between using operations on the MMX registers and using operations on the x 87 floating-point stack.

Page 225

Optimizing for SIMD Integer Applications 4 4-5 • Don’ t empty when alr eady empty : If the next instruction uses an MMX register , _mm_empty() incurs a cost with no benefit. • Gr oup Instructions: T ry to partition regions that use x87 FP instructions from those that use 64-bit SIMD integer instructions.

Page 226

IA-32 Intel® Ar chitectur e Optimization 4-6 Data Alignment Make sure that 64-bit SIMD integer data is 8- byte aligned and that 128-bit SIMD integer data is 1 6-byte aligned. Referencing unaligned 64-bit SIMD integer data can incur a performance penalty due to accesses that span 2 cache lines.

Page 227

Optimizing for SIMD Integer Applications 4 4-7 Signed Unpac k Signed numbers should be sign-ext ended when unpacking the values. This is simil ar to the zero-exte nd shown above except that the psrad instruction (packed shift right arith metic) is used to effectively sign extend the values.

Page 228

IA-32 Intel® Ar chitectur e Optimization 4-8 Interleaved P ack with Saturation The pack instructions pack two values into the destination register in a predetermined order .

Page 229

Optimizing for SIMD Integer Applications 4 4-9 Figure 4-2 illustrates two values interleaved in the destination register , and Example 4-4 shows co de that us es the operation. The two signed doublewords are used as source operands and the result is interleaved signed words.

Page 230

IA-32 Intel® Ar chitectur e Optimization 4-10 The pack instructions always as sume that the source operands are signed numbers. The result in the destination register is always d efined by the pack instruction that perform s the operation.

Page 231

Optimizing for SIMD Integer Applications 4 4-11 Non-Interleaved Unpac k The unpack instructions perform an interleave merge of the data elements of the destination and source oper ands into the destination register . The following example merges the two operands into the destination registers without interleaving.

Page 232

IA-32 Intel® Ar chitectur e Optimization 4-12 The other destination register w ill contain the opposite combination illustrated in Figure 4-4. Code in the Example 4-6 unpacks two packed-word sources in a non-interleaved way .

Page 233

Optimizing for SIMD Integer Applications 4 4-13 Extract W or d The pextrw instruction takes the word in the designated MMX register selected by the two least significant bits of the immediate value and moves it to the lower half of a 32-bit integer re gister , see Figure 4-5 and Example 4-7.

Page 234

IA-32 Intel® Ar chitectur e Optimization 4-14 Insert W ord The pinsrw instruction loads a word from the lower half of a 32-bit integer register or from memory and inserts it in the MMX technology destination register at a position de fined by the two least significant bits of the immediate constant.

Page 235

Optimizing for SIMD Integer Applications 4 4-15 If all of the operands in a register are being replaced by a series of pinsrw instructions, it can be useful to clear the content and break the dependence chain by either using the pxor instruction or loading the register .

Page 236

IA-32 Intel® Ar chitectur e Optimization 4-16 Move Byte Mask to Integer The pmovmskb instruction returns a bit mask formed from the most significant bits of each byte of its source operand. When used with the 64-bit MMX registers, this produces an 8-bit mask, zeroing out the upper 24 bits in the destination re gister .

Page 237

Optimizing for SIMD Integer Applications 4 4-17 Figure 4 -7 pmovmskb Instruction Example Example 4-10 pmovmskb Instruction Code ; Input: ; source value ; Output: ; 32-bit register containing the byte mask in the lower ; eight bits ; movq mm0, [edi] pmovmskb eax, mm0 OM151 65 MM R32 31 0 63 0.

Page 238

IA-32 Intel® Ar chitectur e Optimization 4-18 P acked Shuffle W ord f or 64-bit Registers The pshuf instruction (see Figure 4-8, Example 4-1 1) uses the immediate ( imm8 ) operand to select between the four words in either two MMX registers or one MMX register and a 64-bit memory location.

Page 239

Optimizing for SIMD Integer Applications 4 4-19 P acked Shuffle W ord f or 128-bit Registers The pshuflw / pshufhw instruction performs a fu ll shuffle of any source word field within the low/high 64 .

Page 240

IA-32 Intel® Ar chitectur e Optimization 4-20 Unpac king/interleaving 64-bit Data in 128-bit Registers The punpcklqdq / punpchqdq instructio ns interleav e the low/high-order 64-bits of the source operand and the low/high- order 64-bits of the destination operand and writes them to the destination register .

Page 241

Optimizing for SIMD Integer Applications 4 4-21 Data Mo vement There are two additional instructions to enable data movement from the 64-bit SIMD integer registers to the 128-bit SIMD registers. The movq2dq instruction moves the 64-bit integer data from an MMX register (source) to a 128-bit destination register .

Page 242

IA-32 Intel® Ar chitectur e Optimization 4-22 pxor MM0, MM0 pcmpeq MM1, MM1 psubb MM 0, MM1 [psubw MM0, MM1] (psubd MM0, MM1) ; three instructions above generate ; the constant 1 in every ; packed-by.

Page 243

Optimizing for SIMD Integer Applications 4 4-23 Building Bloc ks This section describes instr uctions and algorithms which implement common code building blocks ef ficiently . Absolute Difference of Unsigned Numbers Example 4-16 computes the absolu te difference of two unsigned numbers.

Page 244

IA-32 Intel® Ar chitectur e Optimization 4-24 Absolute Difference of Signed Numbers Chapter 4 computes the absolute difference of two signed numbers. The technique used here is to first sort the co rresponding elements of the input operands into packed words of the maximum values, and packed words of the minimum values.

Page 245

Optimizing for SIMD Integer Applications 4 4-25 Absolute V alue Use Example 4-18 to compute | x | , where x is signed. This example assumes signed words to be the oper ands.

Page 246

IA-32 Intel® Ar chitectur e Optimization 4-26 Clipping to an Arbitrary Range [high, low] This section explains how to clip a values to a range [ high, low ]. Specifically , if the value is less than low or greater than high , then clip to low or high, respectively .

Page 247

Optimizing for SIMD Integer Applications 4 4-27 Highly Efficient Clipping For clipping signed words to an arbitrary range, the pmaxsw and pminsw instructions may be used. For clipping un signed bytes to an arbitrary range, the pmaxub and pminub instructions may be used.

Page 248

IA-32 Intel® Ar chitectur e Optimization 4-28 The code above converts values to un signed numbers first and then clips them to an unsigned range. The last in struction converts the data back to signed data and places the data with in the signed range.

Page 249

Optimizing for SIMD Integer Applications 4 4-29 packed-subtract instructions with unsigned saturation, thus this technique can only be used on p acked-bytes and packed-words data types.

Page 250

IA-32 Intel® Ar chitectur e Optimization 4-30 Unsigned Byte The pmaxub instruction returns the maximum between the eight unsigned bytes in either two SIMD registers, or one SIMD register and a memory location.

Page 251

Optimizing for SIMD Integer Applications 4 4-31 The subtraction operation presented above is an absolute difference, that is, t = abs(x-y ) . The byte values are stored in temporary space, all values are summed together , and the result is written into the lower word of the destination register .

Page 252

IA-32 Intel® Ar chitectur e Optimization 4-32 The PA VGB instruction operates on pack ed unsigned bytes and the PAVGW instruction operates on packed unsigned words. Complex Multipl y by a Constant Complex multiplication is an op eration which requires four multiplications and two additions.

Page 253

Optimizing for SIMD Integer Applications 4 4-33 Note that the output is a pack ed doubleword. If needed, a pack instruction can be used to convert th e result to 16-bit (thereby matching the format of the input).

Page 254

IA-32 Intel® Ar chitectur e Optimization 4-34 Memory Optimizations Y ou can improve memory accesses using the following techniques: • A voiding partial memory accesses • Increasing the bandwidth of memory fills and video fills • Prefetching data with Streaming SIMD Extensions (see Chapter 6, “Optimizing Cache Usage”).

Page 255

Optimizing for SIMD Integer Applications 4 4-35 P ar tial Memory Accesses Consider a case with large load after a series of small stores to the same area of memory (beginni ng at memory address mem ). The lar ge load will stall in this case as shown in Example 4-24.

Page 256

IA-32 Intel® Ar chitectur e Optimization 4-36 Let us now consider a case with a seri es of small loads after a large store to the same area of memory (beginning at memory address mem ) as shown in Example 4-26. Most of th e small loads will stall because they are not aligned with the store; see “Store Forwarding” in Chapter 2 for more details.

Page 257

Optimizing for SIMD Integer Applications 4 4-37 These transformations, in general, increase the number of instructions required to perform the desired oper ation.

Page 258

IA-32 Intel® Ar chitectur e Optimization 4-38 SSE3 provides an instruction LDDQU for loading from memory address that are not 16 byte aligned. LDDQU is a special 128-bit unaligned load designed to avoid cach e line splits. If the address of the load is aligned on a 16-byte boundary , LDQQU loads the 16 bytes requested.

Page 259

Optimizing for SIMD Integer Applications 4 4-39 Increasing Bandwidth of Memory Fills and Video Fills It is beneficial to understand how memory is accessed and filled.

Page 260

IA-32 Intel® Ar chitectur e Optimization 4-40 same DRAM page have shorter la tencies than sequential accesses to dif ferent DRAM pages. In many systems the latency for a p age miss (that is, an acces.

Page 261

Optimizing for SIMD Integer Applications 4 4-41 aligned versions; this can reduce the performance gains when using the 128-bit SIMD integer extensions. The general guidelines on the alignment of memory operands are: — The greatest performance gains can be achieved when all memory streams are 16-byte aligned.

Page 262

IA-32 Intel® Ar chitectur e Optimization 4-42 P acked SSE2 Integer versus MMX Instructions In general, 128-bit SIMD integer instr uctions should be favored over 64-bit MMX instructions on Intel Core Solo and Intel Core Duo processors.

Page 263

5-1 5 Optimizing for SIMD Floating-point Applications This chapter discusses general rules of optimizing for the single-instruction, multiple-data (SIM D) floating-point instructions available in Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2 (SSE2)and S treaming SIMD Extensions 3 (SSE3).

Page 264

IA-32 Intel® Ar chitectur e Optimization 5-2 • Use MMX technology instructions and registers or for cop ying data that is not used later in SIMD floating-point computations. • Use the reciprocal instructions followed by iteration for increased accuracy .

Page 265

Optimizing for SIMD Float ing-point Applications 5 5-3 • Is the data arranged for ef fici ent utilization of the SIMD floating-point registers? • Is this application targeted for processors without SIMD floating-point instructions? For more details, see the section on “Consideration s for Code Conversion to SIMD Programming” in Chapter 3.

Page 266

IA-32 Intel® Ar chitectur e Optimization 5-4 When using scalar floating-point in structions, it is not necessary to ensure that the data appears in vector form. However , all of the optimizations regarding alignment, scheduling, instruction selection, and other optimizations covered in Chapter 2 and Chapter 3 should be observed.

Page 267

Optimizing for SIMD Float ing-point Applications 5 5-5 For some applications, e.g., 3D geometry , the traditional data arrangement requires some changes to fully u tilize the SIMD registers and parallel techniques. T raditionally , the data layout has been an array of structures (AoS).

Page 268

IA-32 Intel® Ar chitectur e Optimization 5-6 simultaneously referred to as an xyz data representation, see the diagram below) are computed in parallel, and the array is updated one vertex at a time.

Page 269

Optimizing for SIMD Float ing-point Applications 5 5-7 T o utilize all 4 computation slot s, the vertex data can be reorganized to allow computation on each component of 4 separate ver tices, that is, processing multiple vectors simultaneously . This can also be referred to as an SoA form of representing vertices data shown in T able 5-1.

Page 270

IA-32 Intel® Ar chitectur e Optimization 5-8 Figure 5-2 shows how 1 result would be computed for 7 instructions if the data were or ganized as AoS an d using SSE alone: 4 results would require 28 instructions.

Page 271

Optimizing for SIMD Float ing-point Applications 5 5-9 Now consider the case when the data is organized as SoA. Example 5-2 demonstrates how 4 results are computed for 5 instructions.

Page 272

IA-32 Intel® Ar chitectur e Optimization 5-10 T o gather data from 4 different memory locations on the f ly , follow steps: 1. Identify the first half of the 128-bit memory location. 2. Group the different h alves together using the movlps and movhps to form an xyxy layout in two registers.

Page 273

Optimizing for SIMD Float ing-point Applications 5 5-11 y1 x1 movhps xmm7, [ecx+16] // xmm7 = y2 x2 y1 x1 movlps xmm0, [ecx+32] // xmm0 = -- -- y3 x3 movhps xmm0, [ecx+48] // xmm0 = y4 x4 y3 x3 movaps.

Page 274

IA-32 Intel® Ar chitectur e Optimization 5-12 Example 5-4 shows the same data -swizzling algorithm encoded using the Intel C++ Compiler ’ s intrinsics for SSE.

Page 275

Optimizing for SIMD Float ing-point Applications 5 5-13 Although the generated result of all zeros does not depend on the specific data contained in the source operand (that is, XOR of a registe r with itself always produces all zeros), the instruction cannot execute until the instruction that generates xmm0 has completed.

Page 276

IA-32 Intel® Ar chitectur e Optimization 5-14 Data Deswizzling In the deswizzle operation, we want to arrange the SoA format back into AoS format so the xxxx , yyyy , zzzz are rearranged and stored in memory as xyz .

Page 277

Optimizing for SIMD Float ing-point Applications 5 5-15 Y ou may have to swizzle data in the registers, but not in memory . This occurs when two different functions n eed to process the data in dif ferent layout. In lighting, for example, data comes as rrrr gggg b bbb aaaa , and you must deswizzle them into rgba before convertin g in to in teger s.

Page 278

IA-32 Intel® Ar chitectur e Optimization 5-16 // Start deswizzling here movaps xmm7, xmm4 // xmm7= a1 a2 a3 a4 movhlps xmm7, xmm3 // xmm7= b3 b4 a3 a4 movaps xmm6, xmm2 // xmm6= g1 g2 g3 g4 movlhps x.

Page 279

Optimizing for SIMD Float ing-point Applications 5 5-17 Using MMX T echnolog y Code for Cop y or Shuffling Functions If there are some parts in the code th at ar e mainly copyin g, shuf fling, or doing logical manipulations that do not requir e use of SSE code, consider performing these actions with MMX technology co de.

Page 280

IA-32 Intel® Ar chitectur e Optimization 5-18 Example 5-8 illustrates how to use MMX technology code for copying or shuf fling. Horizontal ADD Using SSE Although vertical computations use the SIMD performan ce better than horizontal computations do, in some cases, the code must use a horizontal operation.

Page 281

Optimizing for SIMD Float ing-point Applications 5 5-19 Figure 5-3 Horizontal Add Using mo vhlps/movlhps Example 5-9 Horizontal Add Using mo vhlps/movlhps void horiz_add(Vertex_soa *in, float *out) { .

Page 282

IA-32 Intel® Ar chitectur e Optimization 5-20 // START HORIZONTAL ADD movaps xmm5, xmm0 // xmm5= A1,A2,A3,A4 movlhps xmm5, xmm1 // xmm5= A1,A2,B1,B2 movhlps xmm1, xmm0 // xmm1= A3,A4,B3,B4 addps xmm5.

Page 283

Optimizing for SIMD Float ing-point Applications 5 5-21 Use of cvttps2pi/cvttss2si Instructions The cvttps2pi and cvttss2si instructions encode the truncate/chop rounding mode implicitly in the instruction, thereby taking precedence over the rounding mode specified in the MXCSR register .

Page 284

IA-32 Intel® Ar chitectur e Optimization 5-22 avoided since there is a penalty associated with writing this register; typically , through the use of the cvttps2pi and cvttss2si instructions, the rounding contr ol in MXCSR can be always be set to round-nearest.

Page 285

Optimizing for SIMD Float ing-point Applications 5 5-23 SSE3 and Complex Arithmetics The flexibility of SSE3 in dealing with AOS-type of data structure can be demonstrated by the example of multiplicatio n and division of complex numbers. For example, a complex number can be stored in a structure consisting of its real and im aginary part.

Page 286

IA-32 Intel® Ar chitectur e Optimization 5-24 instructions to perform multiplica tions of single-precision complex numbers. Example 5-12 demonstrates using SSE3 instructions to perform division of complex numbers. In both of these examples, the comple x numbers are store in arrays of structures.

Page 287

Optimizing for SIMD Float ing-point Applications 5 5-25 Example 5-12 Division of T wo P air of Single-precision Complex Number // Division of (ak + i bk ) / (ck + i dk ) movshdup xmm0, Src1; load imaginary parts into t he ; destination, b1, b1, b0, b0 movaps xmm1, src2; load the 2nd pair of comple x values, ; i.

Page 288

IA-32 Intel® Ar chitectur e Optimization 5-26 SSE3 and Horizontal Comp utation Sometimes the AOS type of data organization are more natural in many algebraic formula. SSE3 enhances the flexibility of SIMD programming for applications that rely on the horizontal computation model.

Page 289

Optimizing for SIMD Float ing-point Applications 5 5-27 SIMD Optimizations and Microar chitectures Pentium M, Intel Core Solo and I ntel Core Duo processors have a different microarchitecture than Intel NetBurst ® microarchitecture. The following sub-section discusses optimiz ing SIMD code that target Intel Core Solo and Intel Core Duo processors.

Page 290

IA-32 Intel® Ar chitectur e Optimization 5-28 When targeting complex arithme tics on Intel Core Solo and Intel Core Duo processors, using sing le-precision SSE3 instructions can deliver higher performance than alternatives.

Page 291

6-1 6 Optimizing Cache Usage Over the past decade, processor sp eed has increased more than ten times. Memory access speed has incr eased at a slower pace.

Page 292

IA-32 Intel® Ar chitectur e Optimization 6-2 • Memory Optimization Using Hardware Prefetching, Software Prefetch and Cacheability Instru ctions: discusses techniques for implementing memory optimizations using the above instructions. • Using deterministic cache parameters to manage cache hierarchy .

Page 293

Optimizing Cache Usage 6 6-3 • Facilitate compiler optimization: — Minimize use of global variables and pointers — Minimize use of complex control flow —U s e t h e const modifier , avoid register modifier — Choose data types carefully (see below) and avo id type casting.

Page 294

IA-32 Intel® Ar chitectur e Optimization 6-4 • Optimize software prefetch scheduling distance: — Far ahead enough to allow interim computation to overlap memory access time. — Near enough that the prefetched data is not replaced from the data cache.

Page 295

Optimizing Cache Usage 6 6-5 3. Follows only one stream per 4K page (load or store) 4. Can prefetch up to 8 simultaneous independent streams f rom eight dif feren t 4K regions 5. Does not prefetch across 4K boundary; note that this is independent of paging modes 6.

Page 296

IA-32 Intel® Ar chitectur e Optimization 6-6 Data reference patterns can be classified as follows: T emporal data will be used again soon Spatial data will be used in adjacent locations, for example,.

Page 297

Optimizing Cache Usage 6 6-7 The prefetch instruction is implementation -specific; applications need to be tuned to each implemen tation to maximize performance.

Page 298

IA-32 Intel® Ar chitectur e Optimization 6-8 The Prefetch Instructions – P e ntium 4 Processor Implementation Streaming SIMD Extensions include four flavors of prefetch instructions, one non-temporal, and three temporal. They correspond to two types of operations, temporal and non-temporal.

Page 299

Optimizing Cache Usage 6 6-9 Currently , the prefetch instruction provides a greater performance gain than preloading because it: • has no destination register , it only updates cache lines. • does not stall the normal instruction retirement. • does not af fect the functional behavior of the program.

Page 300

IA-32 Intel® Ar chitectur e Optimization 6-10 The Non-temporal Store Instructions This section describes the behavior of streaming stores and reiterates some of the information presented in the previous section.

Page 301

Optimizing Cache Usage 6 6-11 • Reduce disturbance of frequently used cached (temporal) data, since they write around th e processor caches. Streaming stores allow cross-aliasing of memory types for a given memory region.

Page 302

IA-32 Intel® Ar chitectur e Optimization 6-12 evicting data from all processor caches). The Pentium M processor implements a combin ation of both approaches. If the streaming store hits a line th at is present in the first-level cache, the store data is combined in place within the first-level cache.

Page 303

Optimizing Cache Usage 6 6-13 possible. This behavior should be considered reserved, and dependence on the behavior of any particular implementation risks future incompatibility . Streaming Store Usage Mo dels The two primary usage domains for streaming store are coherent requests and non-coherent r equests.

Page 304

IA-32 Intel® Ar chitectur e Optimization 6-14 In case the region is not mapped as WC , the streaming might update in-place in the cache and a subsequent sfence would not result in the data being written to system memory .

Page 305

Optimizing Cache Usage 6 6-15 The maskmovq/maskmovdqu (non-temporal by te mask store of packed integer in an MMX technology or S treaming SIMD Ex tensions register) instructions store data from a regist er to the location specified by the edi register .

Page 306

IA-32 Intel® Ar chitectur e Optimization 6-16 The degree to which a consumer o f data knows that the data is weakly-ordered can vary for these cases. As a result, the sfence instruction should be used to ensure ordering between routines that produce weakly-ordered data and rou tines that consume this data.

Page 307

Optimizing Cache Usage 6 6-17 The clflush Instruction The cache line associated with the li near address specified by the value of byte address is invalidated from all levels of the processor cache hierarchy (data and instruction) . The invalidation is broadcast throughout the coherence domain.

Page 308

IA-32 Intel® Ar chitectur e Optimization 6-18 Memory Optimization Using Prefetch The Pentium 4 processor has two mechanisms for data prefetch: software-controlled prefetch and an automatic hardware prefetch.

Page 309

Optimizing Cache Usage 6 6-19 Har dware Prefetc h The automatic hardware prefetch, can bring cache lines into the unified last-level cache based on prior data misses. The automatic hardware prefetcher will attempt to prefetch two cache lines ahead of the prefetch stream.

Page 310

IA-32 Intel® Ar chitectur e Optimization 6-20 • May consume extra system bandwidth if the application’ s memory traffic has significant portions with strides of cache misses greater than the trigger distance threshold of hardwar e prefet ch (lar ge-stride memory traffic).

Page 311

Optimizing Cache Usage 6 6-21 Example 6-2 Populating an Array for Circ ular Pointer Chasin g with Constant Stride register char ** p; char *next; // Populating pArray for circular point er // chasing .

Page 312

IA-32 Intel® Ar chitectur e Optimization 6-22 Example of Latency Hiding with S/W Prefetch Instruction Achieving the highest level of memor y optimization using prefetch instructions requires an understanding of the microarchitecture and system architecture of a given machin e.

Page 313

Optimizing Cache Usage 6 6-23 execution units sit idle and wait until data is returned. On the other hand, the memory bus sits idle while the execution units are processing vertices. This scenario severely decreases the advantage of having a decoupled architecture.

Page 314

IA-32 Intel® Ar chitectur e Optimization 6-24 The performance loss caused by poor utilization of resources can be completely eliminated by correctly scheduling the prefetch instructions appropriately . As shown in Figure 6-3 , prefetch instructions are issued two vertex iterations ahead.

Page 315

Optimizing Cache Usage 6 6-25 • Balance single-pass versus multi-pass execution • Resolve memory bank conflict issues • Resolve cache management issues The subsequent sections discuss all the above items.

Page 316

IA-32 Intel® Ar chitectur e Optimization 6-26 lines of data per iteration. The PSD would need to be increased/decreased if more/less th an two cache lines are used per iteration. Software Prefetc h Concatenation Maximum performance can be achieved when execution pipeline is at maximum throughput, without incurring an y memo ry latency penalties.

Page 317

Optimizing Cache Usage 6 6-27 This memory de-pipelining creates inefficiency in both the memory pipeline and execution pipeline. Th is de-pipelining effect can be removed by applying a technique ca lled prefetch concatenation. W ith this technique, the memory access an d execution can be fully pipelined and fully utilized.

Page 318

IA-32 Intel® Ar chitectur e Optimization 6-28 Prefetch concatenation can bridge the execution pipeline bubbles between the boundary of an inn er loop and its associated outer loop.

Page 319

Optimizing Cache Usage 6 6-29 Minimize Number of Software Prefetches Prefetch instructions are not completely free in terms of bus cycles, machine cycles and resources, even though they requ ire minimal clocks and memory bandwidth.

Page 320

IA-32 Intel® Ar chitectur e Optimization 6-30 Figure 6-5Figure demonstrates the ef fectiveness of software prefetches in latency hiding. The X ax is indicates the number of computation clocks per loop (each iteration is inde pendent). The Y axis indicates the execution time measured in clocks per loop.

Page 321

Optimizing Cache Usage 6 6-31 Figure 6-5 Memory Access Latency and Execution With Pr efetch 2 Load streams, 1 stor e str eam 50 100 150 200 250 300 350 54 108 144 19 2 240 336 390 Comput a tions per loop Eff ect ive loop lat enc y 0.00% 10.00% 20.00% 30.

Page 322

IA-32 Intel® Ar chitectur e Optimization 6-32 Mix Software Prefetc h with Computation In structions It may seem convenient to cluster all of the prefetch instructions at the beginning of a loop body or before a loop, but this can lead to severe performance degradation.

Page 323

Optimizing Cache Usage 6 6-33 Example 6-6 Spread Prefet ch In st ru c ti on s NO TE. T o avoid instruction execution stalls due to the over-utilization of the r esour ce, pr efetch instruc tions must be interspersed with computational instructions.

Page 324

IA-32 Intel® Ar chitectur e Optimization 6-34 Software Prefetc h and Cache Bloc king T echniques Cache blocking techniques, such as strip-mining, are used to impr ove temporal locality , and thereby cache hit rate. Strip-mining is a one-dimensional temporal locality optimization for memory .

Page 325

Optimizing Cache Usage 6 6-35 In the temporally-adjacent scenario , subsequent passes use the same data and find it already in second-level cache. Prefetch issues aside, this is the preferred situation.

Page 326

IA-32 Intel® Ar chitectur e Optimization 6-36 Figure 6-7 shows how prefetch instructions and strip-mining can be applied to increase performance in both of these scenarios.

Page 327

Optimizing Cache Usage 6 6-37 In scenario to the right, in Figure 6- 7, keeping the data in one way of the second-level cache does not improve cache locality .

Page 328

IA-32 Intel® Ar chitectur e Optimization 6-38 W ithout strip-mining, all the x,y ,z coor dinates for the four vertices mu st be re-fetched from memory in the seco nd pass, that is, the lighting loop. This causes under-utilization of cache lines fetched during transformation loop as well as ban dwidth wasted in the lighting loop.

Page 329

Optimizing Cache Usage 6 6-39 T able 6-1 summarizes the steps of the basic usage model that incorporates only software prefetch with strip-mining. The steps are: • Do strip-mining: partition loops so that the dataset fits into second-level cache.

Page 330

IA-32 Intel® Ar chitectur e Optimization 6-40 happen to be powers of 2, aliasing conditio n due to finite number of way-associativity (see “Capacity Lim its and Aliasing in Caches” in Chapter 2) will exacerbate the likelihood of cache evictions.

Page 331

Optimizing Cache Usage 6 6-41 references enables the hardware prefetcher to initiate bus requests to read some cache lines before the code actually reference the linear addresses.

Page 332

IA-32 Intel® Ar chitectur e Optimization 6-42 selected to ensure that the batch stays within the processor caches through all passes. An intermediate cached buf fer is used to pass the batch of vertices from one stag e or pass to the next on e.

Page 333

Optimizing Cache Usage 6 6-43 The choice of single-pass or multi-pass can have a number of performance implications. For instance, in a multi-pass pipeline, stages that are limited by bandwidth (either input or output) will reflect more of this performance limitation in overal l execution time.

Page 334

IA-32 Intel® Ar chitectur e Optimization 6-44 a line burst transaction. T o achieve the best possible performance, it is recommended to align data along the cache line boundary and write them consecutively in a cache line si ze while using non-temporal stores.

Page 335

Optimizing Cache Usage 6 6-45 The following examples of using prefetching instructions in the operation of video encoder and decode r as well as in simple 8-byte memory copy , illustrate performance gain from using the prefetching instructions for efficient cache management.

Page 336

IA-32 Intel® Ar chitectur e Optimization 6-46 Later , the processor re-reads the data using prefetchnta , which ensures maximum bandwidth, yet minimizes disturbance of other cached temporal data by using the non- temporal (NT A) version of prefetch.

Page 337

Optimizing Cache Usage 6 6-47 The memory copy algorithm can be o ptimized using the Streamin g SIMD Extensions with these considerations: • alignment of data • proper layout of pages in memory • cache size • interaction of the transaction lookaside buf fer (TLB) with memory accesses • combining prefetch and streaming-store instructions.

Page 338

IA-32 Intel® Ar chitectur e Optimization 6-48 Using the 8-by te Streamin g Stores and Software Prefetc h Example 6-1 1 presents the copy algorithm that uses second level cache.

Page 339

Optimizing Cache Usage 6 6-49 In Example 6-1 1, eig ht _mm_load_ps and _mm_stream_ ps intrinsics are used so that all of the data prefet ched (a 128-byte cache line) is written back. The prefetch and streaming-stor es are executed in separate loops to minimize the number of transitions between readin g and writing data.

Page 340

IA-32 Intel® Ar chitectur e Optimization 6-50 The instruction, temp = a[kk+CACHESIZE] , is used to ensure the page table entry for array , and a is entered in the TLB prior to prefetching. This is essentially a prefetch itself , as a cache line is filled from that memory location with this instruction.

Page 341

Optimizing Cache Usage 6 6-51 prefetch_loop: movaps xmm0, [esi+ecx] movaps xmm0, [esi+ecx+64] add ecx,128 cmp ecx,BLOCK_SIZE jne prefetch_loop xor ecx,ecx align 16 cpy_loop: movdqa xmm0,[esi+ecx] movd.

Page 342

IA-32 Intel® Ar chitectur e Optimization 6-52 P erformance Comparisons of Memory Copy Routines The throughput of a lar ge-region, memory copy routine depends on several factors: • coding techniques.

Page 343

Optimizing Cache Usage 6 6-53 The baseline for performance compariso n is the throughput (bytes/sec) of 8-MByte region memory copy on a first-generation Pentium M processor (CPUID signature 0x69n) with a 400-MHz system bus using byte-sequential technique similar to that shown in Example 6-10.

Page 344

IA-32 Intel® Ar chitectur e Optimization 6-54 query each level of the cache hierarchy . Enumeration of each cache level is by specifying an index value (starting form 0) in the ECX register .

Page 345

Optimizing Cache Usage 6 6-55 • Determine multi-threading resource topology in an MP system (See Section 7.10 of IA-32 Intel® Ar chitectur e Softwar e Developer ’ s Manual, V olume 3A ). • Determine cache hierarchy topology in a platform using multi-core processors (See Example 7-13).

Page 346

IA-32 Intel® Ar chitectur e Optimization 6-56 platform, software can extract in formation on the numb er and the identities of each logical processor sharing that cache level and is made available to application by the OS. This is discussed in detail in “Using Shared Execution Resources in a Processor Core” in Chapter 7 and Example 7-13.

Page 347

7-1 7 Multi-Cor e and Hyper -Thr eading T echnology This chapter describes software optimization techniques for multithreaded applications running in an environment using either multiprocessor (MP) systems or pr ocessors with hardware-based multi-threading suppor t.

Page 348

IA-32 Intel® Ar chitectur e Optimization 7-2 cores but shared by two logical pr ocessors in the same core if Hyper -Threading T echnology is enabled. This chapter covers guidelines that apply to either situations.

Page 349

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-3 Figure 7-1 illustrates how performance gains can be realized for any workload according to Amdahl’ s law . The bar in Figure 7-1 represents an individual task unit or the collective workload of an entire application.

Page 350

IA-32 Intel® Ar chitectur e Optimization 7-4 When optimizing application performance in a multithreaded environment, control flow parallelis m is likely to have the lar gest impact on performance scaling with respect to the number of physical processors and to the number of logical processors per physical processor .

Page 351

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-5 terms of time of completion relative to the same task when in a single-threaded environment) will vary , depending on how much shared execution resources and memory are utilized.

Page 352

IA-32 Intel® Ar chitectur e Optimization 7-6 When two applications are employe d as part of a multi-tasking workload, there is little synchron ization overhead between these two processes. It is also important to ensure each application has minimal synchronization overhead within itself.

Page 353

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-7 P arallel Programming Models T wo common programming models for transforming independent task requirements into application threads are: • domain .

Page 354

IA-32 Intel® Ar chitectur e Optimization 7-8 Functional Decomposition Applications usually process a wide variety of tasks with diverse functions and many unrelated data sets. For example, a video codec needs several dif ferent processing functions. These include DCT , motion estimation and colo r conversion.

Page 355

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-9 overhead when buffers are exch anged between the producer and consumer . T o achieve optimal scalin g with th e number of cores, the synchronization overhead must be kept low .

Page 356

IA-32 Intel® Ar chitectur e Optimization 7-10 Producer -Consumer Threading Models Figure 7-3 illustrates the basic scheme of interaction between a pair of producer and consumer threads. The horizon tal direction represents time. Each block represents a task unit, processing the buffer assigned to a thread.

Page 357

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-11 It is possible to structure the prod ucer -consumer model in an interlaced manner such that it can minimize bus traffic and be ef fective on multi-core processors without shared second-level cache.

Page 358

IA-32 Intel® Ar chitectur e Optimization 7-12 corresponding task to use its designated buffer . Thus, the producer and consumer tasks execute in parallel in two threads. As long as the data generated by the producer reside in either the first or second level cache of the same core, the consumer can access them without incurring bus traffic.

Page 359

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-13 Example 7-3 Thread Function for an Interlace d Producer Consumer Model // master thread starts the first it eration, the other thread must wait // .

Page 360

IA-32 Intel® Ar chitectur e Optimization 7-14 T ools for Creating Multithreaded Applications Programming directly to a multithreading application pro gramming interface (API) is not the only me thod for creating multithreaded applications.

Page 361

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-15 Automatic Parallelization of Code . While OpenMP directives allow programmers to quickly transform serial applicatio ns into parallel applications, programmers must id entify specific portions of the application code that contain parall elism and add compiler directives.

Page 362

IA-32 Intel® Ar chitectur e Optimization 7-16 Optimization Guidelines This section summarizes optimization guidelines for tuning multithreaded applications.

Page 363

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-17 • Place each synchronization variable alone, separated by 128 bytes or in a separate cache line.

Page 364

IA-32 Intel® Ar chitectur e Optimization 7-18 • Adjust the private stack of each th read in an application so the spacing between these stacks is not offset by multiples of 64 KB or 1 MB (prevents unnecessary cache line evictions) when targ eting IA-32 processors supporting Hyper-Threading T echnology .

Page 365

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-19 • For each processor s upporting Hyper -Thr eading T echnology , consider adding functionally unco rrelated threads to increase the hardware resource utilization of each physical processor package.

Page 366

IA-32 Intel® Ar chitectur e Optimization 7-20 The best practice to reduce the overhead of thread synchro nization is to start by reducing the application’ s requirements for synchronization.

Page 367

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-21 the white paper “ Developing Multi-thr eaded Applications: A Platform Consistent Appr oach ” (referenced in the Introduction chapter).

Page 368

IA-32 Intel® Ar chitectur e Optimization 7-22 Synchr onization for Short P eriods The frequency and duration that a thread needs to synchronize with other threads depends applicat ion characteristics. When a synchronization loop needs very fast response, ap plications may use a spin-wait loop.

Page 369

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-23 the processor must guarantee no violations of memo ry order occur . The necessity of maintaining the order of outstanding memory operations inevitably costs the pro cessor a severe penalty that impacts all threads.

Page 370

IA-32 Intel® Ar chitectur e Optimization 7-24 Example 7-4 Spin- wait Loop and P AUSE Instructions (a) An un-optimized spin-wait loop experiences performance penalty when exiting the loop. It consumes execu tion resources without contributing computational work.

Page 371

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-25 User/Sourc e Coding Rule 21. (M impact, H generality) Insert the P AUSE instruction in fast spin loop s and keep the nu mber of loop repetitions to a minimum to improve overall system performance.

Page 372

IA-32 Intel® Ar chitectur e Optimization 7-26 T o reduce the performance penalty , one approach is to reduce the likelihood of many threads competing to acquire the same lock. Apply a software pipelining technique to handle data that must be shared between multiple threads.

Page 373

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-27 If an application thread must remain idle for a long time, the application should use a thread b locking API or other method to release the idle processor .

Page 374

IA-32 Intel® Ar chitectur e Optimization 7-28 A void Coding Pitfalls in Thread Synchr onization Synchronization between multiple th reads must be designed and implemented with care to achieve good performance scaling with respect to the number of discrete pr ocessors and the nu mber of logical processor per physical processor .

Page 375

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-29 In general, OS function calls should be used with care when synchronizing threads. When using OS-suppo rted thread synchronization objects (critica.

Page 376

IA-32 Intel® Ar chitectur e Optimization 7-30 Prevent Sharing of Modified Data and False-Sharing On an Intel Core Duo processor , sh aring of modified data incurs a performance penalty when a thread running on one core tries to read or write data that is currently present in modified state in the first level cache of the other core.

Page 377

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-31 User/Source Coding Rule 24 . (H impact, M generality) Bewar e of false sharing within a cache line (64 bytes on Intel Pen tium 4, Intel Xeon, Pentium M, Intel Core Duo pr ocessors), an d wi thin a sector (128 bytes on Pentium 4 and Intel Xeon processors).

Page 378

IA-32 Intel® Ar chitectur e Optimization 7-32 • Objects allocated dynamically by different threads may share cache lines. Make sure that the variable s used locally by one thread are allocated in a manner to prevent sharing the cache line with other threads.

Page 379

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-33 • In managed environments that provide automatic object allocation, the object allocators and garbag e collectors are responsible for layout of the objects in memory so that false sharing through two objects does not happen.

Page 380

IA-32 Intel® Ar chitectur e Optimization 7-34 Conserve Bus Bandwidth In a multi-threading environment, bus bandwidth may be shared by memory traffic originated from multip le bus agents (These agents can be several logical processors and/or several processor cores).

Page 381

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-35 reads. An approximate working guideline for software to operate below bus saturation is to check if bus read queue depth is sign ificantly below 5. Some MP platform may have a chipset that provides two buses, with each bus servicing one or more physi cal processors.

Page 382

IA-32 Intel® Ar chitectur e Optimization 7-36 A void Excessive Software Prefetc hes Pentium 4 and Intel Xeon Processors have an auto matic hardware prefetcher . It can bring data an d instructions into the unified second-level cache based on prior refere nce patterns.

Page 383

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-37 latency of scattered memory reads can be improved by issuing multiple memory reads back-to-back to over lap multiple outstanding memory read transactions.

Page 384

IA-32 Intel® Ar chitectur e Optimization 7-38 Frequently , multiple partial writes to WC memory can be combined into full-sized writes using a software wr ite-combining technique to separate WC store operations from competi ng with WB store traf fic.

Page 385

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-39 block size for loop blocking should be determined by dividing the tar get cache size by the number of logical processors available in a physical processor package.

Page 386

IA-32 Intel® Ar chitectur e Optimization 7-40 User/Source Coding Rule 33 . (H impact, M generality) Minimize the sharing of data betw een thr eads tha t execut e on differ ent bu s agent s sha ring a common bus .

Page 387

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-41 Example 7-8 shows the batched implementation of the producer and consumer thread functions. Example 7-8 Batched Implement ation of the Producer Con.

Page 388

IA-32 Intel® Ar chitectur e Optimization 7-42 Eliminate 64-KByte Al iased Data Accesses The 64 KB aliasing condition is discussed in Chapter 2. Memory accesses that satisfy the 64 KB aliasing condition can cause excessive evictions of the first-level data cache.

Page 389

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-43 Preventing Excessive Evictions in First-Le vel Data Cache Cached data in a first-level data cache are indexed to linear addresses but physically tagged. Data in second-level and third-level caches are tagged and indexed to physical addres ses.

Page 390

IA-32 Intel® Ar chitectur e Optimization 7-44 P er-thread Stac k Offset T o prevent private stack accesses in concurrent thread s from thrashing the first-level data cache, an applica tion can use a per -thread stack offset for each of its threads. The size of th ese of fsets should be multiples of a common base of fset.

Page 391

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-45 Example 7-9 Adding an Offset to t he St ack Pointer of Three Thread s Void Func_thread_entry(DW ORD *pArg) {DWORD StackOffset = *pArg; DWORD var1; // The local variable at this scope may not benefit DWORD var2; // from the adjustment of the stack pointer that ensue .

Page 392

IA-32 Intel® Ar chitectur e Optimization 7-46 P er-instance Stac k Offset Each instance an application runs in its own linear address space; but the address layout of data for stack se gments is identical for the both instances.

Page 393

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-47 However , the buffer space does enable the first-level data cache to be shared cooperatively when two copies of the same application are executing on the two logical processo rs in a physical processor package.

Page 394

IA-32 Intel® Ar chitectur e Optimization 7-48 Front-end Optimization In the Intel NetBurst microarchit ecture family of processors, the instructions are decoded into micro-ops (μ ops) and sequences of μ ops (called traces) are stored in the Execution T race Cache.

Page 395

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-49 On Hyper -Threading-T echnology-enabled processors, excessive loop unrolling is likely to reduce the T r ace Cache’ s ability to deliver high bandwidth μ op streams to the execution engine.

Page 396

IA-32 Intel® Ar chitectur e Optimization 7-50 initial APIC_ID (See Section 7.10 of IA-32 Intel Ar chitectur e Softwar e Developer ’ s Manual , V olume 3A for more details) associated with a logical processor . The three levels are: • physical processor package.

Page 397

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-51 Affinity mask s can be used to optimize shared multi-threading resources. Example 7-1 1 Assembling 3-level IDs , Affinity Masks for Each Logical Processor // The BIOS and/or OS may limit the number of logical processors // available to applic ations after system boot.

Page 398

IA-32 Intel® Ar chitectur e Optimization 7-52 Arrangements of af finity-binding can benefit performance more than other arrangements. This applies to: • Scheduling two domain-decomposition threads .

Page 399

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-53 first to the primary logical proces sor of each processor core. This example is also optimized to the situations of schedu ling two memory-intensive threads to run on separate cores an d scheduling two compute-intensive threads on separate cores.

Page 400

IA-32 Intel® Ar chitectur e Optimization 7-54 Example 7-12 Assembling a Look up T abl e to Manage Affinit y Mas ks and Schedule Threads to Each Core First AFFINITYMASK LuT[64]; // A Look up table to retrie ve the affinity // mask we want to use from the thread // scheduling sequence index.

Page 401

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-55 Example 7-13 Discovering the Affinity Masks fo r Sibling Logical Processors Sharing the Same Cache // Logical processors sharing the same cache can.

Page 402

IA-32 Intel® Ar chitectur e Optimization 7-56 PackageID[Proce ssorNUM] = PACKAGE_ID; CoreID[ProcessorNum] = CORE_ID; SmtID[ProcessorNum] = SMT_ID; CacheID[ProcessorNUM] = CACHE_ID; // Only the target.

Page 403

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-57 For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) { ProcessorMask << = 1; For (i = 0; i < CacheNum; i++) { // We may.

Page 404

IA-32 Intel® Ar chitectur e Optimization 7-58 Optimization of Other Shared Resources Resource optimization in multi-thread ed application depends on the cache topology and execution resources associated within the hierarchy of processor topology .

Page 405

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-59 seldom reaches 50% of peak retirement bandwidth. Thus, improving single-thread execution throughput sh ould also benefit multi-threading performance.

Page 406

IA-32 Intel® Ar chitectur e Optimization 7-60 throughput of a physical processor package. The non-halted CPI metric can be interpreted as the inverse of the throughpu t of a logical processor 9 .

Page 407

Multi-Cor e and Hyper-Thr e ading T echnology 7 7-61 Using a function decomposition th reading model, a multithreaded application can pair up a thread with critical dependence on a low-throughput resource with other threads th at do not have the same dependency .

Page 408

IA-32 Intel® Ar chitectur e Optimization 7-62 W rite-combining buf fers are another example of execution resources shared between two logical proces sors. W ith two threads running simultaneously on a pr ocessor supporting Hyper -Threading T echnology , the write s of both threads count toward the limit of four write-combining buf fers.

Page 409

8-1 8 64-bit Mode Coding Guidelines Intr oduction This chapter describes coding gui delines for application software written to run in 64-bit mode. These guidelines should be considered as an addendum to the coding guidelin es described in Chap ter 2 through 7.

Page 410

IA-32 Intel® Ar chitectur e Optimization 8-2 This optimization holds true for the lower 8 general purpose registers: EAX, ECX, EBX, EDX, ESP , EBP , ESI, EDI. T o access the data in registers r9-r15, the REX prefix is required. Using the 32- bit form there does not reduce code size.

Page 411

64-bit Mode Coding Guidelines 8 8-3 If the compiler can determine at compile time that the result of a multiply will not exceed 64 bits, then the compiler should generate the multiply instruction that produces a 64-bit result.

Page 412

IA-32 Intel® Ar chitectur e Optimization 8-4 Can be replaced with: movsx r8, r9w ;If bits 63:8 do not need to be ;preserved. movsx r8, r10b ;If bits 63:8 do not need to ;be preserved. In the above example, the moves to r8w and r8b both require a mer ge to preserve the rest of the bits in th e register .

Page 413

64-bit Mode Coding Guidelines 8 8-5 IMUL RAX, RCX The 64-bit version above is more ef ficient than using the following 32-bit version: MOV EAX, DWORD PTR[X] MOV ECX, DWORD PTR[Y] IMUL ECX In the 32-bit case above, EAX is required to be a source. The result ends up in the EDX:EAX pair instead of in a single 64-bit register .

Page 414

IA-32 Intel® Ar chitectur e Optimization 8-6 Use 32-Bit V ersions of CVTSI2SS and CVTSI2SD When P ossible The CVTSI2SS and CVTSI2SD instruct ions convert a signed integer in a general-purpose register or memory location to a single-pr ecision or double-precision floating-point value.

Page 415

9-1 9 Power Optimization for Mobile Usages Overview Mobile computing allows computer s to operate anywhere, anytime. Battery life is a key factor in deliver ing this benefit. Mobile applications require software optimization that considers both performance and power consumption.

Page 416

IA-32 Intel® Ar chitectur e Optimization 9-2 Pentium M, Intel Core Solo and In tel Core Duo processors implement features designed to enable the re duction of active power and static power consumption.

Page 417

Power Optimization for Mobile Usages 9 9-3 to accommodate demand and adapt power consumption. The interaction between the OS power management policy and perf ormance history is described below: 1. Demand is high and the proces sor wo rks at its highest possible frequency (P0).

Page 418

IA-32 Intel® Ar chitectur e Optimization 9-4 A CPI C-States When computational demands are less than 100%, part of the time the processor is doing useful work and the rest of the time it is idle.

Page 419

Power Optimization for Mobile Usages 9 9-5 The index of a C-state type desi gnates the depth of sleep. Higher numbers indicate a deeper sleep state and lower power consumption.

Page 420

IA-32 Intel® Ar chitectur e Optimization 9-6 Figure 9-3 Application of C-states to Idle Ti me Consider that a processor is in lo west frequency (LFM- low frequency mode) and utilization is low .

Page 421

Power Optimization for Mobile Usages 9 9-7 • In an Intel Core Solo or Duo pro cessor , after staying in C4 for an extended time, the processor may enter into a Deep C4 state to save additional static power .. The processor reduces volt age to the minimum l evel required to safely maintain processor context.

Page 422

IA-32 Intel® Ar chitectur e Optimization 9-8 Adjust P erformance to Meet Quality of Features When a system is battery powered, applications can extend battery life by reducing the performan ce or quality of features, turning of f background activities, or both.

Page 423

Power Optimization for Mobile Usages 9 9-9 • GetActivePwrScheme: Retrieves the active power scheme (current system power scheme) index. An application can use this API to ensure that system is ru nning best power scheme.A void Using Spin Loops Spin loops are used to wait fo r short intervals of time or for synchronization.

Page 424

IA-32 Intel® Ar chitectur e Optimization 9-10 workload (usually that equates to reducing the number of instructions that the processor needs to ex ecute, or optimizing application performance).

Page 425

Power Optimization for Mobile Usages 9 9-11 disk operations over time. Use the GetDevicePowerS tate() W indows API to test disk state an d delay the disk access if it is not spinning. Handling Sleep State T ransitions In some cases, transitioni ng to a sleep state may harm an application.

Page 426

IA-32 Intel® Ar chitectur e Optimization 9-12 Using Enhanced Intel SpeedStep ® T echnolog y Use Enhanced Intel SpeedS tep T echnology to adjust the processor to operate at a lower frequency and save ener gy . The basic idea is to divide computations into smaller pieces a nd use OS power management policy to effect a transition to higher P-states.

Page 427

Power Optimization for Mobile Usages 9 9-13 The same application can be written in such a way that work units are divided into smaller granularity , but scheduling of each work unit and Sleep() occurring at more frequent intervals (e.g. 100 ms) to deliver the same QOS (operating at full performance 50% of the time).

Page 428

IA-32 Intel® Ar chitectur e Optimization 9-14 An additional positive ef fect of continuously operating at a lower frequency is that frequent changes in power draw (from low to high in our case) and battery current even tually harm the battery . They accelerate its deterioration.

Page 429

Power Optimization for Mobile Usages 9 9-15 Eventually , if the interval is large enough, the processor will be able to enter deeper sleep and save a considerable amount of power . The following guidelines can help applica tions take advantage of Intel® Enhanced Deeper Sleep: • A void setting higher interrupt rates.

Page 430

IA-32 Intel® Ar chitectur e Optimization 9-16 thread enables the physical proces sor to operate at lower frequency relative to a single-threaded version.

Page 431

Power Optimization for Mobile Usages 9 9-17 demands only 50% of processor r esources (based on idle history). The processor frequency may be reduced by such multi-core unaware P-state coordination, resulting in a perfo rmance anomaly .

Page 432

IA-32 Intel® Ar chitectur e Optimization 9-18 processor to enter the lowest possible C-state type (lower -numbered C state has less power saving). For example, if Core 1 meets the requirement to be in ACPI C1 and Core 2 meets requirement for ACPI C3, multi-core-unaware OS coordination takes the physical processor to ACPI C1.

Page 433

Power Optimization for Mobile Usages 9 9-19 imbalance can be accomplished using performance monitoring events. Intel Core Duo processo r provides an event for this purpose.

Page 434

IA-32 Intel® Ar chitectur e Optimization 9-20.

Page 435

A-1 A Application Performance T ools Intel of fers an array of application performance tools that are optimized to take advantage of the Intel arch itecture (IA)-based processors. This appendix introduces these tools and explains their capabilities for developing the most ef ficient programs without having to write assembly code.

Page 436

IA-32 Intel® Ar chitectur e Optimization A-2 • Intel Performance Libraries The Intel Performance Library family consists of a set of sof tware libraries optimized for Intel arch itecture processors.

Page 437

Application Performance T ools A A-3 family . V ectorization, processor disp atch, inter-procedural optimization, profile-guided optimization and OpenMP parallelism are all suppor ted by the Intel compilers and can sign ifican tl y ai d the performance of an application.

Page 438

IA-32 Intel® Ar chitectur e Optimization A-4 default, and targets the Intel Pentium 4 processor and s ubsequent processors. Code produced will run on any Intel architecture 32-bit processor , but will be optimized speci fically for the targeted processor .

Page 439

Application Performance T ools A A-5 V ectorizer Swit ch Options The Intel C++ and Fortran Compiler can vectorize your code using the vectorizer switch options. The options that enable the vectorizer are the -Qx[M,K,W,B,P] and -Qax[M,K,W,B,P] d escribed above.

Page 440

IA-32 Intel® Ar chitectur e Optimization A-6 Multithreading with OpenMP* Both the Intel C++ and Fortran Compilers support shared memory parallelism via OpenMP compiler directives, library functions and environment variables. Op enMP directives are ac tivated by the compiler switch -Qopenmp .

Page 441

Application Performance T ools A A-7 The -Qrcd option disables the change to truncation of the ro unding mode in floating-point-to-integer conversions. For complete details on all of the code optimization options, refer to the Intel® C++ Compiler User ’ s Guide.

Page 442

IA-32 Intel® Ar chitectur e Optimization A-8 When you use PGO, consider the following guidelines: • Minimize the changes to your program after instrumented execution and before feedback compilation. During feedback compilation, the compiler ignores dynamic information for functions modified after that information was generated.

Page 443

Application Performance T ools A A-9 Sampling Sampling allows you to profile all active software on your sy stem, including operating sy stem, device driver , and application software. It works by occasionally interrupting the processor and collecting the instruction address, process ID, and thread ID.

Page 444

IA-32 Intel® Ar chitectur e Optimization A-10 Figure A-1 provides an example of a hotspots r eport by location. Event-based Sampling Event-based sampling (EBS) can be used to provide detailed information on the behavior of the microprocessor as it executes software.

Page 445

Application Performance T ools A A-11 different events at a time. The numb er of the events that the VT une analyzer can collect at once on the Pentium 4 and Intel Xeon processor depends on the events selected. Event-based samples are collected after a specific number of processor events have occurred.

Page 446

IA-32 Intel® Ar chitectur e Optimization A-12 duration of read traffic compared to the duration of the workload is significantly less than unity , it indicat es the dominant data locality of the workload is cache access traffic.

Page 447

Application Performance T ools A A-13 stride inefficiency is most prom inent on memory traf fic. A useful indicator for lar ge-stride inefficiency in a workload is to compare the ratio between bus rea.

Page 448

IA-32 Intel® Ar chitectur e Optimization A-14 The Call Graph V iew depicts the cal ler / callee relationships. Each thread in the application is the root of a call tree. Each node (box) in the call tree represents a function. E ach edge (line with an arrow) connecting two nodes represents the call from the parent to the child function.

Page 449

Application Performance T ools A A-15 (SSE), St reaming SIMD Extensions 2 (SSE2) and Streaming SIMD Extensions 3 (SSE3). The library se t includes the Intel Math Kernel Library (MKL) and the Intel Integr ated Performance Primitives (IPP).

Page 450

IA-32 Intel® Ar chitectur e Optimization A-16 • Performance: Highly-optimized routin es with a C interface that give Assembly-level performance in a C/C++ development enviro nment (MKL also supports a Fortran interface) . • Platform tuned: Processor -specific optimizations that yield the best performance for each Intel processor .

Page 451

Application Performance T ools A A-17 developed with the Intel Performance Libraries benefit from new architectural features of future genera tions of Intel processors simply by relinking the application with upg raded versions of the libraries.

Page 452

IA-32 Intel® Ar chitectur e Optimization A-18 The Intel Thread Checker product is an Intel VT une Performance Analyzer plug-in data collector that executes your program and automatically locates threading errors .

Page 453

Application Performance T ools A A-19 Figure A-2 shows Intel Th read Checker displaying the source code of the selected instance from a list of detected data race conditions that occurred during threaded execution.

Page 454

IA-32 Intel® Ar chitectur e Optimization A-20 Intel ® Software College The Intel ® Software College is a valuable resource for classes on Streaming SIMD Extensions 2 (SSE2), Threading and the IA-32 Intel Architecture.

Page 455

B-1 B Using Performance Monitoring Events Performance monitoring events provides faciliti es to chara cterize the interaction between programmed sequen ces of instructions and dif ferent microarchitectural sub-systems.

Page 456

IA-32 Intel® Ar chitectur e Optimization B-2 The performance metrics listed n T ables B-1 through T able B-5 may be applicable to processors that support Hyper -Threading T echnology , see Using Performance Metrics with Hyper -Threading T echnology section.

Page 457

Using Performance Monitoring Events B B-3 Repla y In order to maximize performance for the common case, the Intel NetBurst microarchitecture sometimes aggressively schedules μ ops for execution before all the conditions for correct execution are guaranteed to be satisfied.

Page 458

IA-32 Intel® Ar chitectur e Optimization B-4 miss more than once during its life time, but a Misses Retired metric (for example, 1 st -Level Cache Misses Retired ) will increment only once for that μ op.

Page 459

Using Performance Monitoring Events B B-5 The first two metrics use performance counters, and thus can be used to cause interrupt upon overflow for sampling. They may also be useful for those cases where it is easier for a tool to read a performance counter instead of the time stamp counter .

Page 460

IA-32 Intel® Ar chitectur e Optimization B-6 Non-Sleep Cloc kticks The performance monitoring counters can also be configured to count clocks whenever the performance monitoring hardware is not powered-down. T o count “non-sleep clockticks” with a performance-monitoring counter , do the following: • Select any one of the 18 counters.

Page 461

Using Performance Monitoring Events B B-7 that logical processor is not halted (it may include some portion of the clock cycles for that logical processor to complete a transition into a halted state). A physical processo r that supports Hyper-Threading T echnology enters into a power -saving state if all logical processors are halted.

Page 462

IA-32 Intel® Ar chitectur e Optimization B-8 Micr oarchitecture Notes T race Cache Even ts The trace cache is not directly comparable to an instruction cache. The two are organized very dif ferently . For example, a trace can span many lines' worth of instruction-cache data.

Page 463

Using Performance Monitoring Events B B-9 There is a simplified block diagram below of the sub-systems connected to the IOQ unit in the front side bus sub-system and the BSQ unit that interface to the IOQ.

Page 464

IA-32 Intel® Ar chitectur e Optimization B-10 Figure B-1 Relationships Between the Ca ch e Hierarch y , IOQ , BSQ and Front Side Bus Chip Set System Memo ry 1st Level Data Cache 3rd Level C ache FSB_.

Page 465

Using Performance Monitoring Events B B-11 Core references are nominally 64 bytes, the size of a 1st-level cache line. Smaller sizes are called partials, e.g., uncacheable and write combining reads, uncacheable, write-t hrough and write-protect writes, and all I/O.

Page 466

IA-32 Intel® Ar chitectur e Optimization B-12 • IOQ_allocation, IOQ_active_entries: 64 bytes for hits or misses, smaller for partials' hits or misses Writebac ks (dir ty evictions) • BSQ_cac.

Page 467

Using Performance Monitoring Events B B-13 transactions of the writeback (WB) memory type for the FSB IOQ and the BSQ can be an indication of h ow of ten this happens. It is less likely to occur for applications with poor locality of writes to the 3rd-level cache, and of course cannot happen when no 3rd-level cache is present.

Page 468

IA-32 Intel® Ar chitectur e Optimization B-14 Current implementations of the BSQ_cache_reference event do not distinguish between programmatic read and write misses. Programmatic writes that miss must get the rest of the cache line and merge the new data.

Page 469

Using Performance Monitoring Events B B-15 Usage Notes on Bus Activities A number of performance metrics in T able B-1 are based on IOQ_active_entries and BSQ_active entr ies. The next three paragraphs provide information of various bu s transaction underway metrics.

Page 470

IA-32 Intel® Ar chitectur e Optimization B-16 accesses (i.e., are also 3rd-level misses ). This can decrease the average measured BSQ latencies for workloads that frequently thrash (miss or prefetch a lot into) the 2nd-level cache but hit in the 3rd-level cache.

Page 471

Using Performance Monitoring Events B B-17 an expression built up from other metrics; for example, IPC is derived from two single-event metrics. • Column 2 provides a description of the metric in column 1.

Page 472

IA-32 Intel® Ar chitectur e Optimization B-18 T able B-1 P entium 4 Proces sor Perf ormance Metrics Metric Descrip tion Event Name or Metric Expression Event Mask V alue Required General Metr ics Non-Sleep Cl ock t ick s The number of clocktic ks.while a processor is not in any sleep modes.

Page 473

Using Performance Monitoring Events B B-19 Speculative Uops Retired Number of uops retired (include both instr uctions e xecuted to completion and speculatively ex ecuted in the path of branch mispredictions).

Page 474

IA-32 Intel® Ar chitectur e Optimization B-20 Mispredicted retur ns The number of mispredicted returns including all causes. retired_mispred_ branch_type RETURN All conditional s The number of branch.

Page 475

Using Performance Monitoring Events B B-21 TC Flushes Number of TC flushes (The counter will count twice for each occurrence. Divide the count by 2 to get the number of flushes.

Page 476

IA-32 Intel® Ar chitectur e Optimization B-22 Logical Processor 1 Deliver Mode The number of cycles that the trace and delivery engin e (TDE) is delivering traces associated with logical processor 1, regardless of the operating modes of the TDE fo r traces associated with logical processor 0.

Page 477

Using Performance Monitoring Events B B-23 Logical Processor 0 Build Mode The number of cycles that the trace and delivery engin e (TDE) is building traces associated with logical processor 0, regardless of the operating modes of the TDE fo r traces associated with logical processor 1.

Page 478

IA-32 Intel® Ar chitectur e Optimization B-24 T race Cache Misses The number of times that significant dela ys occurred in order to decode instr uctions and build a trace be cause of a TC miss.

Page 479

Using Performance Monitoring Events B B-25 Memor y Metr ics P age W alk DTLB All Misses The number of page walk requests due to DTLB misses from either load o r store. page_walk_type DTMISS 1 st -Lev el Cache Load Misses Retired The number of retired μ ops that experienced 1 st -Lev el cache load misses.

Page 480

IA-32 Intel® Ar chitectur e Optimization B-26 64K Aliasing Conflicts 1 The number of 64K aliasing conflicts. A memor y refe rence causing 64K aliasing conflict can be counted more than once in this stat. The performance penalty resulted from 64K-aliasing conflict can vary from being unnoticeable to considerable.

Page 481

Using Performance Monitoring Events B B-27 MOB Load Replays The number of repla yed lo ads related to the Memor y Order Buffer (MOB). This metric counts only the case where the store-f orwarding data is not an aligned subset of t he stored data.

Page 482

IA-32 Intel® Ar chitectur e Optimization B-28 2nd-Le vel Cache Reads Hit Shared The number of 2nd-lev el cache read references (loads and RFOs) that hit the cache line in shared state.

Page 483

Using Performance Monitoring Events B B-29 3rd-Lev el Cache Reads Hit Modified The number of 3rd-le vel cache read references (loads and RFOs) that hit the cache line in modified state.

Page 484

IA-32 Intel® Ar chitectur e Optimization B-30 All WCB Evictio ns The number of times a WC buff er e viction occurred due to any causes (This can be used to distingui sh 64K aliasing cases that contribute mor e significantly to performance penalty , e.

Page 485

Using Performance Monitoring Events B B-31 Bus Metrics Bus Accesses from the Processor The number of all bus transactions that were allocated in the IO Queue from this processor .

Page 486

IA-32 Intel® Ar chitectur e Optimization B-32 Prefetch Ratio F raction of all bus transactions (including retires) that were f or HW or SW pref etching.

Page 487

Using Performance Monitoring Events B B-33 Writes from the Processor The number of all write transactions on the bus that w ere allocated in IO Queue from this processor (e xcludes RFOs).

Page 488

IA-32 Intel® Ar chitectur e Optimization B-34 All WC from the Processor The number of Write Combining memor y transactions on the bus th at originated from this pr ocessor .

Page 489

Using Performance Monitoring Events B B-35 Bus Accesses from All Agents The number of all bus transactions that were allocated in the IO Queue by all agents.

Page 490

IA-32 Intel® Ar chitectur e Optimization B-36 Bus Reads Underwa y from the processor 7 This is an accrued sum of the durat ions of all read (includes RFOs) transactions by this processor . Divide by “Reads from the Processor” to get bus read request latency .

Page 491

Using Performance Monitoring Events B B-37 All UC Underwa y from the processor 7 This is an accrued sum of the durat ions of all UC transactions by this processor .

Page 492

IA-32 Intel® Ar chitectur e Optimization B-38 Bus Writes Underwa y from the processor 7 This is an accrued sum of the durat ions of all write transactions b y this processor . Divide by “Writes from the Processor” to get bus write request latency .

Page 493

Using Performance Monitoring Events B B-39 Write WC Full (BSQ) The number of write (but neither writeback nor RFO) transactions to WC-typ e memor y . BSQ_allocation 1. REQ_TYPE1 | REQ_LEN0 | REQ_LEN1 | MEM_ TYPE0 | REQ_DEM_ TYPE 2. Enable edge filtering 6 in the CCCR.

Page 494

IA-32 Intel® Ar chitectur e Optimization B-40 Reads Non-prefetch Full (BSQ) The number of read (excludes RFOs and HW|SW prefetches) transactions to WB-type memor y . Bew are of granularity issues with this eve n t. BSQ_allocation 1. REQ_LEN0 | REQ_LEN1 | MEM_TYPE1 | MEM_TYPE2| REQ_CACHE_TYPE| REQ_DEM_TYPE 2.

Page 495

Using Performance Monitoring Events B B-41 UC Write P ar tial (BSQ) The number of UC write transactions. Bew are of granularity issues between BSQ and FSB IOQ e vents . BSQ_allocation 1. REQ_TYPE0 | REQ_LEN0 | REQ_SPLIT_TYPE | REQ_ORD_TYPE | REQ_DEM_TYPE 2.

Page 496

IA-32 Intel® Ar chitectur e Optimization B-42 WB Writes Full Underwa y (BSQ) 8 This is an accrued sum of the durat ions of writeback (e victed from cache) transactions to WB-type memor y . Divide by Writes WB Full (BSQ) to estimate a verage request latency .

Page 497

Using Performance Monitoring Events B B-43 Write WC P ar tial Underwa y (BSQ) 8 This is an accrued sum of the durat ions of par tial wr ite transactions to WC-typ e memor y . Divide by Write WC P ar tial (BSQ) to estimate a verage request latency . User note: Allocated entries of WC par tials that origina te from D Word operands are not included.

Page 498

IA-32 Intel® Ar chitectur e Optimization B-44 SSE Input Assists The number of occurrences of SSE/SSE2 floating-point operations needing assistance to handl e an e xception condition. The number of occurrences includes speculative counts. SSE_input_assist ALL P acked SP Retired 3 Non-bogus packed single-precision instructi ons retired.

Page 499

Using Performance Monitoring Events B B-45 1. A memory reference causing 64K aliasing conflict can be counte d more than once in this stat. The resulting perf or mance penalty can vary from unnoticeab le to consi derable .

Page 500

IA-32 Intel® Ar chitectur e Optimization B-46 4. Most commonly used x87 instructions (e .g., fmul, fadd, fdiv, fsqrt, fstp , etc.) decode i nto a single μ op. Howe ver , transcendental and some x87 instructions decode into se veral μ ops; in these limited cases, the metrics will count the number of μ ops that are actually tagged.

Page 501

Using Performance Monitoring Events B B-47 T able B-2 Metrics That Utiliz e Replay T agging Mechanism Replay Metric T ags 1 Bit field to set: IA32_PEBS_ ENABLE Bit field to set: MSR_ PEBS_ MA T RIX_ V.

Page 502

IA-32 Intel® Ar chitectur e Optimization B-48 T ags for fr ont_end_event T able B-3 provides a list of the tags that ar e used by various metrics derived from the front_end_event . The event names referenced in column 2 can be found f rom the Pentium 4 processor performance monitoring events.

Page 503

Using Performance Monitoring Events B B-49 T able B-4 Metrics That Utilize the Ex ecution T agging Mechanism Execution Me tric T ags Ups tream ESCR Ta g V a l u e i n Upstream ESCR See Event Mask P ar ameter for Execution_ event Packed_SP_retired Set the ALL bit in the e vent mask and the TagUop bit in the ESCR of packed_SP_uop .

Page 504

IA-32 Intel® Ar chitectur e Optimization B-50 T able B-5 New Metri cs for P entium 4 Pr ocessor (Famil y 15, Model 3) Using P e rf ormance Metrics with Hyper-Threading Te c h n o l o g y On Intel Xeo.

Page 505

Using Performance Monitoring Events B B-51 The performance metrics listed in T able B-1 fall into three categories: • Logical processor specific and su pporting parallel counting. • Logical processor specific but c onstrained by ESCR limitations. • Logical processor independent and not su pporting parallel counting.

Page 506

IA-32 Intel® Ar chitectur e Optimization B-52 Branching Metrics Branches Retired T agged Mispredicted Branches Retired Mispredicted Branche s Retired All returns All indirect branches All calls All c.

Page 507

Using Performance Monitoring Events B B-53 Memory Metrics Split Load Replays 1 Split Store Replays 1 MOB Load Replays 1 64k Aliasing Conflicts 1st-Le vel Cache Load Misses Retired 2nd-Lev el Cache L o.

Page 508

IA-32 Intel® Ar chitectur e Optimization B-54 Bus Metrics Bus Accesses from the Processor 1 Non-pref etch Bus Accesses from the Processor 1 Reads from the Processor 1 Writes from the Processor 1 Read.

Page 509

Using Performance Monitoring Events B B-55 Character ization Metrics x87 Input Assists x87 Output Assists Machine Clear Cou nt Memor y Order Machine Clear Self-Modifying Code Cle ar Scalar DP Retired .

Page 510

IA-32 Intel® Ar chitectur e Optimization B-56 Using P e rf ormance Events of Intel Core Solo and Intel Core Duo pr ocessors There are performance events specific to the microarchitecture of Intel Core Solo and Intel Core Duo processors (see T able A-9 of the IA-32 Intel® Ar chitecture Softwar e Developer ’ s Manual, V olume 3B ).

Page 511

Using Performance Monitoring Events B B-57 There are three cycle-counting events which will not progress on a halted core, even if the halted co re is being snooped. Th ese are: Unhalted core cycles, Unhalted reference cycles, and Unhalted bus cycles.

Page 512

IA-32 Intel® Ar chitectur e Optimization B-58 • Some events, such as writeback s, may have non-deter ministic behavior for different runs. In such a case, only measurements collected in the same run yield meaningful ratio values.

Page 513

Using Performance Monitoring Events B B-59 • Serial_Execution_Cycles, event number 3C, unit mask 02H This event counts the bus cycles during which the core is actively executing code (non-halted ) while the other core in the physical processor is halted.

Page 514

IA-32 Intel® Ar chitectur e Optimization B-60.

Page 515

C-1 C IA-32 Instruction Latency and Thr oughput This appendix contains tables of the latency , throughput and execution units that are associated with mo re-commonly-used IA-32 instructions 1 . The instruction timing data varies within the IA-32 family of processors.

Page 516

IA-32 Intel® Ar chitectur e Optimization C-2 Overview The current generation of IA-32 family of processors use out-o f-order execution with dynamic scheduling and buf fering to tolerate poor instruction selection and scheduling that may occur in legacy code.

Page 517

IA-32 Instruction Latency and Thr oughput C C-3 While several items on the above list involve selecting the right instruction, this appendix focuse s on the following issues. These are listed in an expected priority order , though which item contributes most to performance will vary by application.

Page 518

IA-32 Intel® Ar chitectur e Optimization C-4 Definitions The IA-32 instruction performance data are listed in several tables. The tables contain the following information: Instruction Name:The assembly mnemonic of each instruction.

Page 519

IA-32 Instruction Latency and Thr oughput C C-5 accurately predict realistic performance of actual code sequences based on adding instruction latency data. • The instruction latency data are useful when tun ing a dependency chain. However , dependency chains limit the out-of-order core’ s ability to execute micro-ops in pa rallel.

Page 520

IA-32 Intel® Ar chitectur e Optimization C-6 Latency and Thr oughput with Register Operands IA-32 instruction latency and th roughput data are presented in T able C-2 through T able C-8.

Page 521

IA-32 Instruction Latency and Thr oughput C C-7 T able C-2 Streaming SIMD Ext ension 2 128-bit Integer Instructions Instruction Latency 1 Thr oughput Execution Unit 2 CPUID 0F3n 0F2n 0x69n 0F3n 0F2n 0.

Page 522

IA-32 Intel® Ar chitectur e Optimization C-8 PCMPGTB/PCMPGTD/PC MPGTW xmm, xmm 2 2 1 2 2 1 MMX_ALU PEXTR W r32, xmm, imm8 7 7 3 2 2 2 MMX_SHFT , FP_MISC PINSR W xmm, r32, imm8 4 4 1+1 2 2 2 MMX_SHFT .

Page 523

IA-32 Instruction Latency and Thr oughput C C-9 PSUBB/PSUBW/PSUBD xmm, xmm 2 2 1 2 2 1 MMX_ALU PSUBSB/PSUBSW/PSUB U SB/PSUBUSW xmm, xmm 2 2 1 2 2 1 MMX_ALU PUNPCKHBW/PUNPCKH WD/PUNPCKHDQ xmm, xmm 4 4 .

Page 524

IA-32 Intel® Ar chitectur e Optimization C-10 COMISD xmm, xmm 7 6 1 2 2 1 FP_ADD , FP_MISC CVTDQ2PD xmm, xmm 8 8 4+1 3 3 4 FP_ADD , MMX_SHFT CVTPD2PI mm, xmm 12 11 5 3 3 3 FP_ADD , MMX_SHFT , MMX_ALU.

Page 525

IA-32 Instruction Latency and Thr oughput C C-11 DIVPD xmm, xmm 7 0 69 32+31 70 69 62 FP_DIV DIVSD xmm, xmm 39 38 32 39 38 31 FP_DIV MAXPD xmm, xmm 5 4 4 2 2 2 FP_ADD MAXSD xmm, xmm 5 4 3 2 2 1 FP_ADD.

Page 526

IA-32 Intel® Ar chitectur e Optimization C-12 T able C-4 Streaming SIMD Extensio n Single-precision Floating-point Instructions Instruction Latency 1 Thr oughput Execution Unit 2 CPUID 0F3n 0F2n 0x69.

Page 527

IA-32 Instruction Latency and Thr oughput C C-13 MOVLHPS 3 xmm, xmm 44 2 2 M M X _ S H F T MO VMSKPS r32, xmm 6 6 2 2 FP_MISC MO VSS xmm, xmm 4 4 2 2 MMX_SHFT MO VUPS xmm, xmm 6 6 1 1 FP_MO VE MULPS x.

Page 528

IA-32 Intel® Ar chitectur e Optimization C-14 T able C-5 Stre aming SIMD Extension 64-bit Integ er Instructi ons Instruction Latency 1 Thr oughput Execution Unit CPUID 0F3n 0F2n 0x69n 0F3n 0F2n 0x69n.

Page 529

IA-32 Instruction Latency and Thr oughput C C-15 PCMPGTB/PCMPGTD/ PCMPGTW mm, mm 22 1 1 M M X _ A L U PMADDWD 3 mm, mm 98 1 1 F P _ M U L PMULHW/PMULL W 3 mm, mm 98 1 1 F P _ M U L POR mm, mm 2 2 1 1 .

Page 530

IA-32 Intel® Ar chitectur e Optimization C-16 T able C-7 IA-32 x87 Fl oating-point Instruct ions Instruction Latency 1 Throug hput Execution Unit 2 CPUID 0F3n 0F2n 0x69n 0F3n 0F2n 0x69n 0F2n F ABS 3 .

Page 531

IA-32 Instruction Latency and Thr oughput C C-17 FSCALE 4 60 7 FRNDINT 4 30 11 FXCH 5 01 F P _ M O V E FLDZ 6 0 FINCSTP/FDECSTP 6 0 See “Table Footnotes” T able C-8 IA-32 General Purpose Instructi.

Page 532

IA-32 Intel® Ar chitectur e Optimization C-18 Jcc 7 Not Appli- cable 0.5 ALU LOOP 8 1.5 ALU MO V 1 0.5 0.5 0.5 ALU MO VSB/MO VSW 1 0.5 0.5 0.5 ALU MO VZB/MOVZW 1 0.5 0.5 0.5 ALU NEG/NO T/NOP 1 0.5 0.5 0.5 ALU POP r32 1.5 1 MEM_LO AD , ALU PUSH 1.5 1 MEM_STORE, ALU RCL/RCR reg, 1 8 64 1 1 ROL / ROR 1 4 0 .

Page 533

IA-32 Instruction Latency and Thr oughput C C-19 T able Footnotes The following footnotes refer to all tables in this appendix. 1. Latency information for many of in structions that are complex (> 4 μ ops) are estimates based on conservative and worst-case estimates.

Page 534

IA-32 Intel® Ar chitectur e Optimization C-20 4. Latency and Throughput of transcen dental instructions can vary substantially in a dynamic execution environment. Only an approximate value or a range of values are given for these instructions. 5. The FXCH instruction has 0 latency in code sequences.

Page 535

IA-32 Instruction Latency and Thr oughput C C-21 For the sake of simplicity , all data being requested is assumed to reside in the first level data cache (cache hit).

Page 536

IA-32 Intel® Ar chitectur e Optimization C-22.

Page 537

D-1 D S tack Alignment This appendix details on the alignment of th e stacks of data for Streaming SIMD Extensions and Streaming SIMD Extensions 2. Stac k Frames This section describes the stack alig nment conventions for both esp -based (normal), and ebp -based (debug) stack frames.

Page 538

IA-32 Intel® Ar chitectur e Optimization D-2 alignment for __m64 and do uble type data by enforcing that these 64-bit data items are at least eight-byte aligned ( they will now be 16-byte aligned).

Page 539

S tack Alignment D D-3 As an optimization, an alternate entr y point can be created that can be called when proper stack alig nment is pr ovided by the caller .

Page 540

S tack Alignment D D-4 Example D-1 in the following sections illustrate this technique. Note t he entry points foo and foo.aligned , the latter is the alternate aligned entry point.

Page 541

S tack Alignment D D-5 Example D-1 Aligned esp-Based Stac k Frames void _cdecl foo (int k) { int j; foo: // See Note A push ebx mov ebx, esp sub esp, 0x00000008 and esp, 0xfffffff0 add esp, 0x00000008 jmp common foo.

Page 542

S tack Alignment D D-6 Aligned ebp -Based Stack Frames In ebp -based frames, padding is also inserted immediately before the return address. However , this frame is slightly unusual in that the return address may actually reside in two dif ferent places in the stack.

Page 543

S tack Alignment D D-7 Example D-2 Aligned ebp-based Stac k Frames void _stdcall foo (int k) { int j; foo: push ebx mov ebx, esp sub esp, 0x00000008 and esp, 0xfffffff0 add esp, 0x00000008 // esp is (8 mod 16) after add jmp common foo.

Page 544

S tack Alignment D D-8 // the goal is to make esp and ebp // (0 mod 16) here j = k; mov edx, [ebx + 8] // k is (0 mod 16) if caller aligned // its stack mov [ebp - 16], edx // J is (0 mod 16) foo(5); add esp, -4 // normal call sequence to // unaligned entry mov [esp],5 call foo // for stdcall, callee // cleans up stack foo.

Page 545

S tack Alignment D D-9 Stac k Frame Optimizations The Intel C++ Compiler provides certain optimizations that may improve the way aligned frames are set up and used.

Page 546

IA-32 Intel® Ar chitectur e Optimization D-10 Inlined Assembl y and ebx When using aligned frames, the ebx register generally should n ot be modified in inlined assembly blocks since ebx is used to keep track of the argu ment block.

Page 547

E-1 E Mathematics of Pr efetch Scheduling Distance This appendix discusses how far away to insert prefetch instructions. It presents a mathematical model allowing you to deduce a simplified equation which you can use for determining the prefetch schedu ling distance (PSD) for your application.

Page 548

IA-32 Intel® Ar chitectur e Optimization E-2 N inst is the number of instructions in the scope of one loop iteration. Consider the following example of a heuristic equation assuming that parameters have the values as indicated: where 60 corresponds to Nlookup , 25 to Nxfer , and 1.

Page 549

Mathematics of Pr efetch Scheduling Distance E E-3 T b data transfer latency which is equal to number of lines per iteration * line burst latency Note that the potential effects of µ op reordering are not factored into the estimations discussed.

Page 550

IA-32 Intel® Ar chitectur e Optimization E-4 Memory access plays a pivotal role in prefetch scheduling. For more understanding of a memory subsy stem, consider Streaming SIMD Extensions and S treaming SIMD Extensions 2 memory pipeline depicted in Figure E-1.

Page 551

Mathematics of Pr efetch Scheduling Distance E E-5 T l varies dynamically and is also syst em hardware-dependent. The static variants include the core-to-front-sid e-bus ratio, memory manufacturer and memory controller (chipset).

Page 552

IA-32 Intel® Ar chitectur e Optimization E-6 No Preloading or Prefetc h The traditional prog ramming approach does not perform data preloading or prefetch. It is sequen tial in nature and will experience stalls because the memory is unable to provide the data immediately when the execution pipeline re quires it.

Page 553

Mathematics of Pr efetch Scheduling Distance E E-7 The iteration latency is approximately equal to the computation laten cy plus the memory leadoff latency (inc ludes cache miss latency , chipset latency , bus arbitration, and so on.) plus the data transfer latency where transfer latency = number of lines per iteration * line burst latency .

Page 554

IA-32 Intel® Ar chitectur e Optimization E-8 The following formula shows the re lationship among the parameters: It can be seen from this relationship that the iteration latency is equal to the computation latency , which means the memory accesses are executed in background and their latencies are completely hidden.

Page 555

Mathematics of Pr efetch Scheduling Distance E E-9 For this particular example the pref etch scheduling distance is greater than 1. Data being prefetched for iteration i will be consumed in iteration i+2 .

Page 556

IA-32 Intel® Ar chitectur e Optimization E-10 Memory Throughput Bound (Case: T b >= T c ) When the application or loop is memory throughput bou nd, the memory latency is no way to be hidden. Under such circumstances, the burst latency is always greater than the co mpute latency .

Page 557

Mathematics of Pr efetch Scheduling Distance E E-11 memory to you cannot do much abou t it. T ypically , data copy from one space to another space, for example, graphics driver moving data from writeback memory to write-combi ning memory , belongs to this category , where performance advantage from pref etch in structions will be marginal.

Page 558

IA-32 Intel® Ar chitectur e Optimization E-12 Now for the case T l =18, T b =8 (2 cache lines are needed per iteration) examine the following gr aph. Consider the graph of accesses per iteration in example 1, Figure E-6. The prefetch scheduling dist ance is a step function of T c , the computation latency .

Page 559

Mathematics of Pr efetch Scheduling Distance E E-13 In reality , the front-side bus (FSB) pipelining depth is limited, that is, only four transactions are al lowed at a time in the Pentium III and Pentium 4 processors.

Page 560

IA-32 Intel® Ar chitectur e Optimization E-14.

Page 561

Index-1 Index 64-bit mode default operand size, 8-1 introduction, 8-1 legacy instructions, 8-1 multiplicati on notes, 8-2 register usage, 8-2, 8-4 sign-extension, 8-3 software prefetch, 8-6 using CVTS.

Page 562

IA-32 Intel® Ar chitectur e Optimization Index-2 coding methodologies, 3-13 coding techniques, 3-12 absolute difference of signed numbers, 4-24 absolute difference of unsigned numbers, 4-23 absolute .

Page 563

Index Index-3 floating-point stalls, 2-72 flow dependency, E-7 flush to zero, 5-22 FXCH instruction, 2-70 G general optimizati on techniques, 2-1 branch prediction, 2-15 static prediction, 2-19 genera.

Page 564

IA-32 Intel® Ar chitectur e Optimization Index-4 L large load stalls, 2-37 latency, 2-72, 6-5 lea instruction, 2-74 loading and storing to and from the same DRAM page, 4-39 loop blocking, 3-34 loop u.

Page 565

Index Index-5 O optimizing ca che util ization cache management, 6-44 examples, 6-15 non-temporal store instructions, 6-10 prefetch and load, 6-9 prefetch Instructions, 6-8 prefetching, 6-7 SFENCE ins.

Page 566

IA-32 Intel® Ar chitectur e Optimization Index-6 R reciprocal instructions, 5-2 rounding control option, A-6 S sampling event-based, A-10 Self-modifying code, 2-47 SFENCE Instruction, 6-15, 6-16 sign.

Page 567

INTEL SALES OFFICES ASIA P ACIFIC Australia Intel Corp. Level 2 448 St Kilda Road Melbourne VI C 3004 Australia Fax:613- 9862 5599 China Intel Corp. Rm 709, Shaanxi Zhongda Int'l Bldg No.30 Nandajie Street Xian AX71000 2 China Fax:(86 29) 7203 356 Intel Corp.

Page 568

Intel Corp. 999 CANADA PLACE, Suite 404,#1 1 Va n c o u v e r B C V6C 3E2 Canada Fax:604- 844-28 13 Intel Corp. 2650 Quee nsview Dr ive, Suite 250 Ottawa ON K2B 8H6 Canada Fax:613- 820-59 36 Intel Corp. 190 Attwell D rive, Suite 500 Rexcdale ON M9W 6H8 Canada Fax:416- 675-24 38 Intel Corp.

Intel ARCHITECTURE IA-32 manual

Share URL

Similar manuals