Instruction/ maintenance manual of the product ARCHITECTURE IA-32 Intel
Go to page of 568
IA-32 In tel® Ar chitecture Op timization R e f er ence Manual Order Number: 248966-013US April 2006.
ii INFORMATION IN THI S DOCUMENT IS PROVIDE D IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IM PLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.
iii Contents Introduction Chapter 1 IA-32 Intel ® Architecture Processor Family Overview SIMD T echnology ............. ....................... ...................... .................... ...................... ............ .... 1-2 Summary of SIMD T e chnologies .
iv Out-of-Order Core...... ... .. .................... ... ... ...................... .................... .. ... ... .............. 1-30 In-Order Retirement ...................... ....................... ...................... .......................
v Branch Prediction ................... ...................... ....................... ...................... ...................... .. .... 2-15 Eliminating Branches .................. ...................... ....................... ..................
vi Floating-Point S talls ................. ... ...................... ....................... ................... ... ... ... ........... 2 -72 x87 Floating-point Operation s with Integer O perands ........................ ...................... 2-72 x87 Floating-point Comp arison Instructions .
vii Considerations for Code Co nversion to SIMD Pr ogramming............ ...................... ................ 3-8 Identifying Hot S pots ....... ...................... ....................... ...................... ......................... ... 3-1 0 Determine If Code Benefits by Conversion to SI MD Execution.
viii Packed Shuffle W ord for 64-bit Registers ........ .............. ....................... ...................... ... 4-18 Packed Shuffle W ord for 128-bit Registe r s ......... ......... ...................... .................... ........ 4-19 Unpacking/interleaving 64-bit Data in 128-bit Registers .
ix Data Alignment........... ... .................... ... ... .. .................... ... ... ................... ... ... ................... . ....... 5-4 Data Arrangement ...................... ...................... ....................... ........
x Hardware Prefetch ..................... ... ... ................... ... ... ... ...................... ...................... ...... 6-19 Example of Ef fective Latency Re duction with H/W Prefetch ............................ ... ........... 6-20 Example of Latency Hiding with S/W Prefetch Instruction .
xi Key Practices of System Bus Optimization ......... ......... ...................... .................... ........ 7-17 Key Practices of Memory Optimiza tion ............... ....................... ...................... .............. 7-17 Key Practices of Front-end Opti mization .
xii Sign Extension to Full 64-Bit s ........................... ....................... ...................... ................... 8-3 Alternate Coding Rules for 64-Bit Mode.... ....................... ......................... .......................
xiii T ime-based Sampling .............. ... .. .................... ... ... .. .................... ... ... ...................... . A-9 Event-based Sampling.......... ... ...................... ... .............. ...................... .............
xiv Using Performance Metrics with Hyper-Th reading T e chnology .......... ............................ ..... B-50 Using Performance Events of Intel Core Solo and Intel Core Duo processo rs ............. ....... B-56 Understanding the Resu lts in a Performance Count er .
xv Examples Example 2-1 Assembly Code with an Un predictable Branch ............................. 2-17 Example 2-2 Code Optim ization to E liminate Branches ........ ............. ............. ... 2-17 Example 2-3 Eliminating Branch with CMO V Instr uction .
xvi Example 3-4 Identification of SSE2 with cpui d ............................ ................. ........ 3-5 Example 3-5 Identification of SSE2 by the OS ............ ................ ................. ........ 3-6 Example 3-6 Identification of SSE3 with cpui d .
xvii Example 4-20 Clipping to an Arbitrary Signed Range [high, low] ...... ................ ... 4-27 Example 4-21 Sim plified Clipping to an Arbitrar y Signed Rang e ...... ................ ... 4-28 Example 4-22 Clipping to an Arbitrary Unsi gned Range [high, low] .
xviii Example 6-12 Memory Cop y Using Hardware Pref etch and Bus Segment ation .. 6-50 Example 7-1 Serial Execution of Producer and Consum er Work Items ... ............ 7-9 Example 7-2 Basic Structure of Implem enting Producer Consumer Threads . ... 7-11 Example 7-3 Thread Functi on for an Int er laced Producer Consumer Mod el .
xix Figur es Figure 1-1 T ypical SIMD Ope rations .......... ................ ............. ............. ............... 1-3 Figure 1-2 SIMD Instr uctio n Register Us age ....................... ................ ............. .. 1-4 Figure 1-3 The Inte l NetBurst Micr oarchitectu re .
xx Figure 6-2 Memor y Access Late ncy and Execution Witho ut Prefetch .... .......... 6-23 Figure 6-3 Memor y Access Late ncy and Execution With Prefetch ............. ...... 6-23 Figure 6-4 Pref etch and Loop Unrolling ............................ ..
xxi T ables T able 1-1 P ent ium 4 and I ntel Xeon Pro cessor Cache P arameters .................. 1-20 T abl e 1-3 Cache Par ameters of P entium M, Intel ® Core™ Solo and Intel ® Core™ Duo Proces sors ................ ............. ............
xxii T able C-5 Streaming SIMD Extens ion 64-bit Integer In struct ions...... ............... C-14 T able C-7 IA-32 x87 Floa ting-point Instr uction s ....... ................ ............. ............ C- 16 T able C-8 IA-32 Ge neral Pur pose I nstru ctions .
xxiii Intr oduction The IA-32 Intel ® Architectur e Optimization Refer ence Manual describes how to optimize software to take advantage o f the performance characteristics of the current gene ration of IA-32 Intel architecture family of processors.
IA-32 Intel® Ar chitectur e Optimization xxiv target the Intel NetBurst microarchi tecture and the Pentium M processor microarchitecture. T uning Y our Application T uning an application for high per.
Intr oduction xxv The manual consists of the following parts: Introduction . Defines the purpose and outlin es the contents of this manual. Chapter 1: IA-32 Intel ® Ar chitecture Pr ocessor Family Overview .
IA-32 Intel® Ar chitectur e Optimization xxvi Chapter 7: Multiprocessor and Hyper -Threading T echnology . Describes guidelines and techni ques for optimizing multithreaded applications to achieve optimal pe rformance scaling.
Intr oduction xxvii Related Documentation For more information on the Intel ar chitecture, specific techniques, and processor architecture terminology re ferenced in this manual, see the following doc.
IA-32 Intel® Ar chitectur e Optimization xxviii Notational Con ventions This manual uses the following conventions: This type style Indicates an element of syntax, a reserved word, a keyword, a filename, instructio n, computer output, or part of a program example.
1-1 1 IA-32 Intel ® Ar chitectur e Pr ocessor Family Overview This chapter gives an overview o f th e features relevant to software optimization for the current gener ation s o f I A-32 processors, i.
IA-32 Intel® Ar chitectur e Optimization 1-2 Intel Core Solo and Intel Core Duo processors incorporate microarchitectural enhancements for performance and power efficiency that are in addition to those intr oduced in the Pentium M processor .
IA-32 Intel® Architectur e Processor Family Overview 1-3 each corresponding pair of data elem ents (X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4). The results of the four parallel computations are sorted as a set of four packed data elements.
IA-32 Intel® Ar chitectur e Optimization 1-4 SIMD improves the performance of 3D graphics, speech recogn ition, image processing, scientific applicatio ns and applications that have the following cha.
IA-32 Intel® Architectur e Processor Family Overview 1-5 SSE and SSE2 instructions also introduced cacheabil ity and memory ordering instructions that can improve cache usage and application performance.
IA-32 Intel® Ar chitectur e Optimization 1-6 SSE instructions are useful for 3D geometry , 3D rendering, speech recognition, and video encoding and decoding.
IA-32 Intel® Architectur e Processor Family Overview 1-7 Intel ® Extended Memory 64 T echnolog y (Intel ® EM64T) Intel EM64T is an extension of th e IA-32 Intel architecture. Intel EM64T increases the linear address sp ace for software to 64 bits and supports physical ad dress space up to 40 bits .
IA-32 Intel® Ar chitectur e Optimization 1-8 Intel NetBurst ® Micr oarchitecture The Pentium 4 processor , Pentium 4 proce ssor Extreme Edition supporting Hyper -Threading T echnology , Pentium D processor , Pentium processor Extreme Editio n and the Intel Xeon processor implement the Intel NetBurst microarchitecture.
IA-32 Intel® Architectur e Processor Family Overview 1-9 • to operate at high clock rates and to scale to higher performance and clock rates in the future Design advances of the Intel Ne tBurst mic.
IA-32 Intel® Ar chitectur e Optimization 1-10 The out-of-order core aggressively r eorders µops so that µops whose inputs are ready (and have execution resources available) can execute as soon as possible. The core can issue multiple µops per cycle.
IA-32 Intel® Architectur e Processor Family Overview 1-11 The Front End The front end of the Intel NetBurst micr oarchitecture consists of two parts: • fetch/decode unit • execution trace cache I.
IA-32 Intel® Ar chitectur e Optimization 1-12 The execution trace cache and the translation engine have cooperating branch prediction hardware. Branch tar gets are predicted based on their linear address using branch predicti on logic and fetched as soon as possible.
IA-32 Intel® Architectur e Processor Family Overview 1-13 correct execution, the results of IA- 32 instructions must be committed in original program order before th ey are retired. Exceptions may be raised as instructions are retired. For this reason , exceptions cannot occur speculatively .
IA-32 Intel® Ar chitectur e Optimization 1-14 • a mechanism fetches data only and includes two distinct components: (1) a hardware mechanism to fetch the adjacent cache line within an 128-byte sect.
IA-32 Intel® Architectur e Processor Family Overview 1-15 Branch Prediction Branch prediction is important to th e performance of a deeply pipelined processor . It enables the processor to begin execut ing instructions long before the branch outcome is certain.
IA-32 Intel® Ar chitectur e Optimization 1-16 T o take advantage of the forward-not-taken and backward-taken static predictions, code should be arranged so that the lik ely target of the branch immediately follows forwar d branches (see also: “Branch Prediction” in Chapter 2).
IA-32 Intel® Architectur e Processor Family Overview 1-17 Some parts of the core may speculate that a common condition holds to allow faster execution. If it does not, the machine may stall. An example of this pertains to sto r e-to-load forwarding (see “Store Forwarding” in this chapter).
IA-32 Intel® Ar chitectur e Optimization 1-18 execution units are not pipelined (meaning that µops cannot be dispatched in consecutive cycles and the throughput is less than one per cycle). The number of µops associated with each instruction provides a basis for selecting instruction s to ge nerate.
IA-32 Intel® Architectur e Processor Family Overview 1-19 Caches The Intel NetBurst microarchitectur e supports up to th ree levels of on-chip cache. At least two levels of on-chip cache are implemented in processors based on the Intel NetBur st microarchitecture.
IA-32 Intel® Ar chitectur e Optimization 1-20 Levels in the cache hierarchy are not in clusive. The fact that a line is in level i does not imply that it is also in level i+ 1. All caches use a pseudo-LRU (least rece ntly used) replaceme nt algorithm.
IA-32 Intel® Architectur e Processor Family Overview 1-21 back within the processor , and 6-12 bus cycles to access memory if there is no bus congestion. Each bus cycle equals several processor cycles. The ratio of processor cloc k speed to the scalable bus clock speed is referred to as bus ratio .
IA-32 Intel® Ar chitectur e Optimization 1-22 • avoids the need to access of f-chip caches, which can increase the realized bandwidth compared to a normal load-miss, which returns data to all cache.
IA-32 Intel® Architectur e Processor Family Overview 1-23 Hardware prefetching for Pentium 4 processor has the following characteristics: • works with existing applications • does not require ext.
IA-32 Intel® Ar chitectur e Optimization 1-24 Thus, software optimization of a data access pattern should emphasize tuning for hardware prefetch f irst to favor greater proportions of smaller- stride data accesses in the workload; before attempting to provide hints to the processor by employin g software prefetch instructions.
IA-32 Intel® Architectur e Processor Family Overview 1-25 Reordering loads with respect to each other can prevent a load miss from stalling later loads. Reordering loads with respect to other loads and stores to different addresses can enable more parallelism, allowing the machine to execute operations as soon as their inputs are ready .
IA-32 Intel® Ar chitectur e Optimization 1-26 Intel ® P entium ® M Processor Micr oar chitecture Like the Intel NetBurst microarchitecture, the pipeline of the Intel Pentium M processor microarchit.
IA-32 Intel® Architectur e Processor Family Overview 1-27 The Intel Pentium M processor microa rchitecture is designed for lower power consumption. There are other specific areas of the Pentium M processor microarchitecture that differ from the Intel NetBurst microarchitecture.
IA-32 Intel® Ar chitectur e Optimization 1-28 The fetch and decode unit in cludes a hardware instruction prefetcher and three decoders that enable parallelism.
IA-32 Intel® Architectur e Processor Family Overview 1-29 • Micro-ops (µops) fusion. Some of the most frequent pairs of µops derived from the same instruction can be fused into a single µops.
IA-32 Intel® Ar chitectur e Optimization 1-30 Data is fetched 64 bytes at a time; the instruction and data translation lookaside buffers support 128 entrie s. See T able 1-3 for processor cache parameters. Out-of-Order Cor e The processor core dynamically executes µops ind ependent of program order .
IA-32 Intel® Architectur e Processor Family Overview 1-31 In-Order Retirement The retirement unit in the Pentium M processor buffers completed µops is the reorder buf fer (ROB). The ROB updates the architectural state in order . Up to three µops may be retired per cycle.
IA-32 Intel® Ar chitectur e Optimization 1-32 • Power-op timized bus The system bus is optimized for power efficiency; increased bus speed supports 667 MHz. • Data Prefetch Intel Core Solo and Intel Core Duo processors implement improved hardware prefetch mechanisms: one mech anism can look ahead and prefetch data into L1 from L2.
IA-32 Intel® Architectur e Processor Family Overview 1-33 Data Prefetc hing Intel Core Solo and Intel Core Duo processors provide hardware mechanisms to prefetch data from memory to the second-level cache.
IA-32 Intel® Ar chitectur e Optimization 1-34 The two logical processors each have a complete set of architectural registers while sharing one single phy sical processor's resources.
IA-32 Intel® Architectur e Processor Family Overview 1-35 In the first implementation of HT T echnology , the phys ical execution resources are shared and the architect ure state is duplicated for each logical processor .
IA-32 Intel® Ar chitectur e Optimization 1-36 Pr ocessor Resources and Hy per -Threading T echnology The majority of microarchitecture re sources in a physical processor are shared between the logical processors. Only a few small data structures were replicated for each logical pro cessor .
IA-32 Intel® Architectur e Processor Family Overview 1-37 For example: a cache miss, a branch misprediction, or instruction dependencies may prevent a logical processor fr om making forward progress for some number of cycles. The partitioning prevents the stalled logical processor from blo cking forward progress.
IA-32 Intel® Ar chitectur e Optimization 1-38 Micr oarchitecture Pipeline an d Hyper -Threading T echnology This section describes the HT T echnology microarchitecture and how instructions from the two logical p r ocessors are handled between the front end and the back end of the pipeline.
IA-32 Intel® Architectur e Processor Family Overview 1-39 Execution Core The core can dispatch up to six µops per cycle, provided the µops are ready to execute. Once the µops ar e placed in the queues waiting for execution, there is no distinction be tween instructions from the two logical processors.
IA-32 Intel® Ar chitectur e Optimization 1-40 Pentium Processor Extreme Edition prov ide four logical processors in a physical package that has two executi on cores. Each core provides two logical processors sharing an ex ecution core and a cache hierarchy .
IA-32 Intel® Architectur e Processor Family Overview 1-41 Figure 1-7 P entium D Processo r , P entium Processor Ext reme Edition and Intel Core Duo Pr ocessor System Bus Ar c hit ect ual S t ate Ex e.
IA-32 Intel® Ar chitectur e Optimization 1-42 Microar chitecture Pipeline and Multi-Co re Processor s In general, each core in a multi-core processor resembles a single-core processor implementation of the un derlying microarchitecture.
IA-32 Intel® Architectur e Processor Family Overview 1-43 that the cache line that contains th e memory location is owned by the first-level data cache of the initiati ng core (that is, the line is in exclusive or modified state). Then the processor looks for the cache line in the cache and memory sub-systems.
IA-32 Intel® Ar chitectur e Optimization 1-44 when data is written back to memory , the eviction consumes cache bandwidth and bus bandwidth. For multiple cache misses that require the eviction of modified lines and ar e within a short time, there is an overall degradation in response time of these cache misses.
2-1 2 General Optimization Guidelines This chapter discusses general optimi zation techniques that can improve the performance of applications running o n the Intel Pentium 4, Intel Xeon, Pentium M processors, as well as on dual-co re processors.
IA-32 Intel® Ar chitectur e Optimization 2-2 The following sections describe practices, tools, coding r ules and recommendations associated with th ese factors that will aid in optimizing the performance on IA-32 processors.
General Optimization Guidelines 2 2-3 * Streaming SIMD Extensions (S SE) ** Streaming S IMD Extensions 2 (SSE2) General Practices and Coding Guidelines This section discusses guidelines derived from the performance factors listed in the “Tu ning to Achieve Optimum Performance” section.
IA-32 Intel® Ar chitectur e Optimization 2-4 Use A vailable P erformance T ools • Current-generation compiler , su ch as the Intel C++ Compiler: — Set this compiler to produce code for the tar get processor implementation — Use the compiler switches for optimization and/or profile-guided optimization.
General Optimization Guidelines 2 2-5 Optimize Branch Predictability • Improve branch predictability a nd optimize instruction prefetching by arranging code to be consistent with the static branch prediction assumption: backward taken and forward not taken.
IA-32 Intel® Ar chitectur e Optimization 2-6 • Minimize use of global variables and pointers. • Use the const modifier; use the static modifier for global variables.
General Optimization Guidelines 2 2-7 • A void longer latency instructions: integer multiplies and divides. Replace them with alternate code se quences (e.g., use shifts instead of multiplies). • Use the lea instruction and the full range of addressing modes to do address calculation.
IA-32 Intel® Ar chitectur e Optimization 2-8 • A void the use of conditionals. • Keep induction (loop) variable ex pressions simple. • A void using pointers, tr y to replace pointers with arrays and indices. Coding Rules, Suggestio ns and T uning Hints This chapter includes rules, suggesti ons and hints.
General Optimization Guidelines 2 2-9 P erformance T ools Intel offers several tools that can facilitate optimizing your application’ s performance. Intel ® C++ Compiler Use the Intel C++ Compiler following the recommendations described here.
IA-32 Intel® Ar chitectur e Optimization 2-10 General Compiler Recommendations A compiler that has been extensively tuned for the target microarchitec- ture can be expected to match or outperform han d-coding in a general case.
General Optimization Guidelines 2 2-11 The VT une Performance Analyzer also enables engineers to use these counters to measure a number of wo rkload characteristics, including: • retirement throughp.
IA-32 Intel® Ar chitectur e Optimization 2-12 Intel Core Solo and Intel Core Duo pr ocessors have enhanced front end that is less sensitive to the 4-1-1 template. The practice has no real impact on processors based on the Intel NetBurst microarchitecture.
General Optimization Guidelines 2 2-13 • On the Pentium 4 and Intel Xeon processo rs, the primary code size limit of interest is imposed by the trace cache.
IA-32 Intel® Ar chitectur e Optimization 2-14 T ransparent Cache-P arameter Strategy If CPUID instruction supp orts function leaf 4, also known as deterministic cache parameter leaf, this function le.
General Optimization Guidelines 2 2-15 Branch Prediction Branch optimizations have a significant impact on performance. By understanding the flow of branches and improving the predictability o f branches, you can increase the speed of code significantly .
IA-32 Intel® Ar chitectur e Optimization 2-16 Assembly/Compiler Coding Rule 1. (MH impa ct, H generality) Arrange code to make basic blocks contig uous and elimin ate unnecessary bran ch es.
General Optimization Guidelines 2 2-17 See Example 2-2. The optimized code sets ebx to zero, then compares A and B. If A is greater than or equal to B, ebx is set to one. Then ebx is decreased and “ and -ed” with the difference of the constant values.
IA-32 Intel® Ar chitectur e Optimization 2-18 The cmov and fcmov instructions are available on the Pentium II and subsequent processors, but not on Pe ntium processors and earlier 32-bit Intel architecture processors. Be su re to check whether a processor supports these instructions with the cpuid instruction.
General Optimization Guidelines 2 2-19 Static Prediction Branches that do not have a history in the BTB (see the “Branch Prediction” section) are predicted us ing a static prediction algorithm.
IA-32 Intel® Ar chitectur e Optimization 2-20 Assembly/Compiler Coding Rule 3. (M impa ct, H generality) Arrange code to be consistent with the stat ic bra nch pr ediction algorith m: make the fall-t.
General Optimization Guidelines 2 2-21 Examples 2-6, Example 2-7 provide basic rules for a static prediction algorithm. In Example 2-6, the backward branch ( JC Begin ) is not in the BTB the first time through, theref ore, the BTB does not issue a prediction.
IA-32 Intel® Ar chitectur e Optimization 2-22 Inlining, Calls and Returns The return address stack mechanism augments the static and dynamic predictors to optimize specifically fo r calls and returns. It ho lds 16 entries, which is lar ge enough to cover the call d e pth of most pr ograms.
General Optimization Guidelines 2 2-23 Assembly/Compiler Coding Rule 6 . (H impac t, M gener ality) Do not inline a function if doing so incr eases the working set size beyond what will fit in the trace cache.
IA-32 Intel® Ar chitectur e Optimization 2-24 Placing data immediately following an indirect branch can cause a performance problem. If the data consist of all zeros, it look s like a long stream of adds to memory destinations, which can cause resource conflicts and slow down branch recovery .
General Optimization Guidelines 2 2-25 indir ect branch into a tr ee wher e one or mor e indire ct branches ar e pr eceded by conditi onal branch es to those ta r gets. Apply this “peeling” procedur e to the common tar get of an indir ect branch that corr elates to branch history .
IA-32 Intel® Ar chitectur e Optimization 2-26 best performance from a coding ef fort. An example of peeling out the most favored tar get of an indirect br anch with correlat ed branch history is shown in Example 2-9.
General Optimization Guidelines 2 2-27 • The Pentium 4 processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional br anches in the loop.
IA-32 Intel® Ar chitectur e Optimization 2-28 In this example, a loop that ex ecutes 100 times assigns x to every even-numbered element and y to every odd-numbered element. By unrolling the loop you can make both assignments each iteration, removing one branch in the loop bod y .
General Optimization Guidelines 2 2-29 Memory Accesses This section discusses guidelines for optimizing code an d data memory accesses. The most important recommendations are: • align data, paying a.
IA-32 Intel® Ar chitectur e Optimization 2-30 Assembly/Compiler Coding Rule 16. (H impact, H generality) Align data on natural operand size addr ess boundaries. If the data will be accesses with vector instru ction loads and stor es, align the data o n 16 byte boundaries.
General Optimization Guidelines 2 2-31 Alignment of code is less of an issue for th e Pentium 4 processor . Alignment of branch targets to ma ximize bandwidth of fetching cached instructions is an issue only when not executing out of the trace cache.
IA-32 Intel® Ar chitectur e Optimization 2-32 Store Forwar ding The processor ’ s memory system only sends stores to memory (includin g cache) after store retirement. Howeve r , store data can be forwarded from a store to a subsequent load fro m the same address to give a much shorter store- load latency .
General Optimization Guidelines 2 2-33 If a variable is known not to change between when it is stored and when it is used again, the register that was stored can be copied or used directly . If register pressure is too high, or an unseen function is called before the store and th e second load, it may not be possible to eliminate the second load.
IA-32 Intel® Ar chitectur e Optimization 2-34 The size and alignment restrictions fo r store forwarding are illustrated in Figure 2-2. Coding rules to help programmers satis fy size and alignment restrictions for store forwarding follow . Assembly/Compiler Coding Rule 18.
General Optimization Guidelines 2 2-35 A load that forwards from a store mu st wait for the store’ s data to be written to the store buffer before pr oceeding, but other , unrel ated loads need not wait.
IA-32 Intel® Ar chitectur e Optimization 2-36 Example 2-14 illustrates a stalled store-forwarding situation that may appear in compiler generated code. Sometimes a compiler generates code similar to that shown in Example 2-14 to handle spilled byte to the stack and convert the byte to an integer value.
General Optimization Guidelines 2 2-37 When moving data that is smalle r than 64 bits between memory locations, 64-bit or 128-bit SIMD register moves ar e more efficient (if aligned) and can be used to avoid un aligned loads.
IA-32 Intel® Ar chitectur e Optimization 2-38 Store-forwar ding Restrict ion on Data A vailability The value to be stored must be available before the load operation can be completed. If this restriction is vi olated, the execution of the load will be delayed until the data is availabl e.
General Optimization Guidelines 2 2-39 An example of a loop-carried dependence chain is shown in Example 2-17. Data La yout Optimizations User/Source Coding Rule 2. (H impact, M generality) Pad data structur es defined in the sour ce code so that every d ata element is aligned t o a natural operand size a ddre ss boundary .
IA-32 Intel® Ar chitectur e Optimization 2-40 Cache line size for Pentium 4 and Pentium M processors can impact streaming applications (for example, multimedia).
General Optimization Guidelines 2 2-41 However , if the access pattern of the array exhibits locality , such as if the array index is being swept through, then the Pentium 4 processor prefetches data from struct_of_array , even if the elements of the structure are accessed together .
IA-32 Intel® Ar chitectur e Optimization 2-42 non-sequential manner , the automa tic hardware prefetcher cannot prefetch the data. The prefetcher can recognize up to eight concur rent streams. See Chapter 6 for more information and the hardware prefetcher .
General Optimization Guidelines 2 2-43 If for some reason it is not possible to align the stack for 64-bits, the routine should access the parameter and save it into a register or known aligned storage, thus incurring the penalty only once.
IA-32 Intel® Ar chitectur e Optimization 2-44 Capacity Limits in Set-Associative Caches Capacity limits may occur if th e number of outstanding memory references that are mapped to the same set in each way of a given cache exceeded the number of ways of that cache.
General Optimization Guidelines 2 2-45 Aliasing Cases in the P entium ® 4 and Intel ® Xeon ® Processor s Aliasing conditions that are specific to the Pentium 4 processor and Intel Xeon processor are: • 16K for code – there can only be one of these in the trace cache at a time.
IA-32 Intel® Ar chitectur e Optimization 2-46 Aliasing Cases in t he P entium M Pr ocessor Pentium M, Intel Core Solo and I ntel Core Duo processors have the following al iasi ng case: • Store forw.
General Optimization Guidelines 2 2-47 Mixing Code and Data The Pentium 4 processor ’ s aggressive prefetching and pre-decoding of instructions has two related ef fects: • Self-modifying code works corr ectly , according to the Intel architecture processor requirements, but incurs a significant performance penalty .
IA-32 Intel® Ar chitectur e Optimization 2-48 and cross-modifying code (when more than one processor in a multi-processor system are writing to a code p age) should be avoided when high performance is desired.
General Optimization Guidelines 2 2-49 write misses; only four write-combining b uffers are guaranteed to be available for simultaneous use. W r ite combining applies to memory type WC; it does not apply to memory type UC. Assembly/Compiler Coding Rule 28.
IA-32 Intel® Ar chitectur e Optimization 2-50 be no RFO since the line is not cached , and there is no such delay . For details on write-combining, see the Intel Ar chitectur e Softwar e Devel- oper ’ s Manual .
General Optimization Guidelines 2 2-51 Locality enhancement to the last level cache can be accomplished with sequencing the data access pattern to take advantage of hardware prefetching.
IA-32 Intel® Ar chitectur e Optimization 2-52 Minimizing Bus Latency The system bus on Intel Xeon and Pentium 4 processo rs provides up to 6.4 GB/sec bandwidth of throug hput at 200 MHz scalable bus clock rate. (See MSR_EBC_FREQUENCY_ID register .) The peak bus bandwidth is even higher with higher bu s clock rates.
General Optimization Guidelines 2 2-53 User/Sourc e Coding Rule 8. (H impact, H generality) T o achieve effective amortization of b us latency , softwar e should pay attentio n to favor data access pa.
IA-32 Intel® Ar chitectur e Optimization 2-54 Example 2-21 Non-temporal Stores and 64-byte Bus W rite T ransactions Example 2-22 Non-temporal Stores a nd Partial Bus Write T ransactions #define STRID.
General Optimization Guidelines 2 2-55 Prefetc hing The Pentium 4 processor has th ree prefetching mechanisms: • hardware instruction prefetcher • software prefetch for data • hardware prefetch for cache lines of data or instructions.
IA-32 Intel® Ar chitectur e Optimization 2-56 access patterns to suit the hardware prefetcher is highly recommended, and should be a higher -priority consideration than using software prefetch instructions. The hardware prefetcher is best fo r small-stride data access patterns in either direction with cache-miss stride not far from 64 bytes.
General Optimization Guidelines 2 2-57 • new cache line flush instruction • new memory fencing instructions For a detailed description of us ing cacheability instructions, see Chapter 6.
IA-32 Intel® Ar chitectur e Optimization 2-58 Guidelines fo r Optimizi ng Floating-point Code User/Sourc e Coding Rule 10. (M impact, M generality) Enable the compiler ’ s use of S SE, SSE2 or SSE3 instructions wi th appr opria te switches.
General Optimization Guidelines 2 2-59 to early out). However , be careful of intr oducing more than a total of two values for the flo ating po int cont r ol wor d, or the r e will be a lar g e perfor mance penalty . See “Float in g-point Mod es”.
IA-32 Intel® Ar chitectur e Optimization 2-60 desir ed numeric pr ecision, the size of the look-up tableland t aking advantage of the paralleli sm of the Str eamin g S IMD Extensions an d the S treaming SIMD Extensions 2 i nstructions.
General Optimization Guidelines 2 2-61 executing SSE/SSE2/SSE3 instruct ions and when speed is more important than complying to IEEE st andard. The following paragraphs give recommendations on how to optimize yo ur code to reduce performance degradation s related to floating-point exceptions.
IA-32 Intel® Ar chitectur e Optimization 2-62 Underflow exceptions and denormalized source operan ds are usually treated according to the IEEE 754 specification.
General Optimization Guidelines 2 2-63 FPU control word (FCW), such as when performing conversions to integers. On Pentium M, Intel Core Solo and Intel Core Duo processors; FLDCW is improved over previous generations. Specifically , the optimization for FLDCW allows programmers to alternate between two constant values efficiently .
IA-32 Intel® Ar chitectur e Optimization 2-64 Assembly/Compiler Coding Rule 31. (H impact, M generality) Minimize changes to bits 8-12 of the floating poin t contr ol wor d.
General Optimization Guidelines 2 2-65 If there is more than one change to rounding , precision and infinity bits and the rounding mode is not importan t to the result; use the algorithm in Example 2-23 to avoid synchronization issues, the overhead of the fldcw instruction and having to change the ro unding mode.
IA-32 Intel® Ar chitectur e Optimization 2-66 Example 2-23 Algorithm to A void Changing the Rounding Mode _fto132proc lea ecx,[esp-8] sub esp,16 ; allocate frame and ecx,-8 ; align pointer on boundar.
General Optimization Guidelines 2 2-67 Assembly/Compiler Coding Rule 32. (H impact, L generality) Minimize the number of changes to th e rounding mode. Do not use changes in the rounding mode to implement the floor and ceiling f unctions if this involves a to tal of mor e than two valu es of the set of r ounding, pr ecision and i nfinity bits.
IA-32 Intel® Ar chitectur e Optimization 2-68 Assembly/Compiler Coding Rule 33. (H impact, L generality) Minimize the number of changes to the precision mode. Impr oving P arallelism and the Use of FXCH The x87 instruction set relies on the floating po int stack for one of its operands.
General Optimization Guidelines 2 2-69 This in turn allows instructions to be reordered to make instructions available to be executed in parallel. Out-of-order execution precludes the need for using fxch to move instructions for very short distances. x87 vs.
IA-32 Intel® Ar chitectur e Optimization 2-70 • Scalar floating-point registers may be accessed directly , avoiding fxch and top-of-stack restrictions. On th e Pentium 4 processor , the floating-point register stack may be used simultaneously with XMM registers.
General Optimization Guidelines 2 2-71 Recommendation : Use the compiler switch to generate SSE2 scalar floating-point code over x87 code. When working with scalar SSE/SSE2 code, pay attention to the need for clearing the content of unused slots in an xmm register and the associated performance impact.
IA-32 Intel® Ar chitectur e Optimization 2-72 Floating-P oint Stalls Floating-point instructions have a latency of at least two cycles. But, because of the out-of-order nature of Pentium II and the subsequent processors, stalls will not necessarily occur on an in struction or µop basis.
General Optimization Guidelines 2 2-73 Note that transcendental functions are supported only in x 87 floating point, not in St reaming SIMD Extensions or Streaming SIMD Extensions 2. Instruction Selection This section explains how to generate optimal assembly co de.
IA-32 Intel® Ar chitectur e Optimization 2-74 Complex Instructions Assembly/Compiler Coding Rule 40. (ML impact, M generality) A void using complex in struc tio ns (f or example, enter , leave , or loop ) that have mor e than four µops and r equir e multipl e cycles to decode .
General Optimization Guidelines 2 2-75 Use of the inc and dec Instructions The inc and dec instructions modify o nly a subs et of the bits in the flag register .
IA-32 Intel® Ar chitectur e Optimization 2-76 CMPXCHG8B, various rotate instructions, STC, an d STD. An example of assembly with a partial flag regist er stall and alternative code without the stall is shown in T able 2-2. Integer Divide T ypically , an integer divide is preceded by a cwd or cdq instruction.
General Optimization Guidelines 2 2-77 (model 9) does incur a penalty . This is because every operation on a partial register updates the whole register . However , this does mean that there may be false dependencies between any references to partial registers.
IA-32 Intel® Ar chitectur e Optimization 2-78 T able 2-3 illustrates using movzx to avoid a partial register stall when packing three byte values into a register . Assembly/Compiler Coding Rule 44. (ML i mpact, L generality) Use sim ple instructions tha t ar e less than eight bytes in length.
General Optimization Guidelines 2 2-79 less delay than the partial register update prob lem mentioned above, but the performance gain may vary . If the additional μ op is a critical problem, movsx can sometimes be used as alternative. Sometimes sign-extended semantics can be maintained by zero-extending operands.
IA-32 Intel® Ar chitectur e Optimization 2-80 Prefixes and Instruction Decoding An IA-32 instruction can be up to 15 bytes in length. Prefixes can change the length of an instruction th at the decoder must recognize. In some situations, using a length-chang ing prefix (LCP) causes extra delay in decodi ng the instruct ion.
General Optimization Guidelines 2 2-81 • Processing an instruction with the 0x66 prefix th at (i) has a mo dr/m byte in its encodi ng and (ii) the opcode byte of the instruction happens to be aligned on byte 14 of an instruction fetch line. The performance delay in this case is ap proximately twice of those other two situations.
IA-32 Intel® Ar chitectur e Optimization 2-82 String move/store instructions ha ve multiple data granularities. For efficient data movement, larger data granularities are preferable.
General Optimization Guidelines 2 2-83 • Cache eviction: If the amount of data to be processed by a memory routine approaches half the size of the last level on-die cache, temporal locality of the cache may suf fer . Using streaming store instructions (for example: movntq, movntdq) can minimize the effect of flushing the cache.
IA-32 Intel® Ar chitectur e Optimization 2-84 improve address alignment, a small piece of prolog code using movsb/stosb with count less than 4 can be used to p eel off the non-aligned data moves before starting to use mo vsd/stosd.
General Optimization Guidelines 2 2-85 Memory routines in the runtime library generated by Intel Compilers are optimized across wide range of address alignment, counter values, and microarchitectures. In most cases, ap plications should take advantage of the default memory routines provided by Intel Compilers.
IA-32 Intel® Ar chitectur e Optimization 2-86 In some situations, the byte count of the data to operate is known by the context (versus from a parameter passed from a call). One can take a simpler approach than those required f or a general-purpose library routine.
General Optimization Guidelines 2 2-87 Clearing Registers Pentium 4 processor provides special support to xor , sub , or pxor operations when executed within the same register . This recognizes that clearing a register does not depend on the old value of the register .
IA-32 Intel® Ar chitectur e Optimization 2-88 Using test instruction between the instruction that may modify part of the flag register and the instruction th at uses the flag register can also help prevent partial flag register stall. Assembly/Compiler Coding Rule 52.
General Optimization Guidelines 2 2-89 Use movapd as an alternative; it writes all 128 bits. Even though this instruction has a longer latency , the μ ops for movapd use a different execution port and this port is more likely to be free. The change can impact performance.
IA-32 Intel® Ar chitectur e Optimization 2-90 Pr olog Sequences Assembly/Compiler Coding Rule 57. (M impact, MH generality) In r outines that do not need a frame pointer and that do not have called r outines that modify ESP , use ESP as the base r egister to fr ee up EBP .
General Optimization Guidelines 2 2-91 Using memory as a destination operand may further reduce register pressure at the slight risk of making trace cache packing more dif ficult.
IA-32 Intel® Ar chitectur e Optimization 2-92 Spill Scheduling The spill scheduling algorithm used by a code generator will be impacted by the Pentium 4 processor memory subsystem. A spill scheduling algorithm is an algorithm th at selects what values to spill to memory when there are too many live va lues to fit in registers.
General Optimization Guidelines 2 2-93 Because micro-ops are delivered from the trace cache in the common cases, decoding rules are not required. Scheduling Rules f or the P e ntium M Processor Decode.
IA-32 Intel® Ar chitectur e Optimization 2-94 Data elements in parallel. The number of elements which can be operated on in parallel range from four single-precision floating point data elements in S.
General Optimization Guidelines 2 2-95 User/Source Coding Rule 19. (M impact, ML generality) A void the use of conditional bra nches inside loops and co nsi der using SSE instru ctions to eliminate branches. User/Source Coding Rule 20. (M impact, ML generality) Keep induction (loop) variables ex pr essions simple.
IA-32 Intel® Ar chitectur e Optimization 2-96 The other NOPs have no special hardware support. Their input and output registers are in terpreted by the hardware.
General Optimization Guidelines 2 2-97 User/Sour ce Coding Rules User/Source Coding Rule 1. (M impact, L generality) If an indir ect branch has two or mor e common ta ken tar gets, and at least one of.
IA-32 Intel® Ar chitectur e Optimization 2-98 User/Source Coding Rule 8. (H impact, H generality) T o achieve effective amortization of bus latency , softwar e should.
General Optimization Guidelines 2 2-99 look-up-tabl e- based algo rit hm using interp olation tech niques. It is p ossible to impr ove transcendental p erfor mance with these techniques by choo sin g .
IA-32 Intel® Ar chitectur e Optimization 2-100 or der engine . When tuning, note that all IA-32 based pr ocessors have very high branch prediction rates. Cons istently mispr edicted are rar e. Use these instructi ons only if the incr ease in computation time is l ess than the expected cost of a mispr edicted branch.
General Optimization Guidelines 2 2-101 Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put mor e than four branch es in 16-byte chunks. 2 -22 Assembly/Compiler Coding Rule 1 1. (M impact, L generality) Do not put mor e than two end loop branches in a 16-b yte chunk.
IA-32 Intel® Ar chitectur e Optimization 2-102 Assembly/Compiler Coding Rule 18. (H impact, M generality) A load that forwards fr om a store must have the same addr ess start poin t and ther efor e the same alignmen t as the stor e data. 2-34 Assembly/Compiler Coding Rule 19.
General Optimization Guidelines 2 2-103 first-level cach e working set. A void having mor e than 8 cache lines that ar e some multiple of 64 KB ap art in the same second-l evel cache w orking set. A void having a stor e follo wed by a non-dependent load wi th addr esses that differ by a mult ip le of 4 KB.
IA-32 Intel® Ar chitectur e Optimization 2-104 Assembly/Compiler Coding Rule 32. (H impact , L generality) Minimize the number of chan ges to the r oundin g mo de.
General Optimization Guidelines 2 2-105 Assembly/Compiler Coding Rule 42. (M impact, H generality) inc and dec instructions should be re pl ac ed wit h an add or sub instruction, because add and sub overwrite all flags, wher eas inc and dec do not, ther efor e creating false dependencies on earlier instructio ns that set the flags.
IA-32 Intel® Ar chitectur e Optimization 2-106 instead of a cmp of the r egister to zer o, this saves the need to e ncode the zer o and saves encoding space. A void comparing a constant to a memo ry operand. It is pr eferable to load the memory operand and com p ar e the constant to a r egister .
General Optimization Guidelines 2 2-107 Assembly/Compiler Coding Rule 56. (M impact, ML generality) For arithmetic or lo gical operations that have th eir sour ce operand in memory and the destinat io.
IA-32 Intel® Ar chitectur e Optimization 2-108 T uning Suggestions T uning Suggestion 1. Rar ely , a performance pr oblem may be note d due to executing data on a code page as instructio ns. The only condition wher e this is likely to happen is f ollowing an indir ect branch that is not r esident in the trace cache.
3-1 3 Coding for SIMD Ar chitectur es Intel Pentium 4, Intel Xeon and Pentium M processors include support for S treaming SIMD Extensions 2 (SSE2), S treaming SI MD Extensions technology (SSE), and MMX technology.
IA-32 Intel® Ar chitectur e Optimization 3-2 Chec king for Pr ocessor Suppor t of SIMD Te c h n o l o g i e s This section shows how to check whether a processor supports MMX technology , SSE, SSE2, or SSE3. SIMD technology can be included in your appl ication in three ways: 1.
Coding for SIMD Ar chitectur es 3 3-3 For more information on cpuid see, Intel ® Pr ocessor Identification with CPUID I nstruction , order number 24161 8. Chec king for Streaming SI MD Extensions Support Checking for support of S treaming SIMD Extensions (SSE) on your processor is like checking for MMX technolog y .
IA-32 Intel® Ar chitectur e Optimization 3-4 T o find out whether the operating system supports SSE, execute an SSE instruction and trap for an exception if one occurs.
Coding for SIMD Ar chitectur es 3 3-5 Chec king for Streaming SI MD Extensions 2 Support Checking for support of SSE2 is like checking for SSE support. Y ou must also check whether your operat ing system (OS) sup ports SSE. The OS requirements for SSE2 Support are the same as the requirements for SSE.
IA-32 Intel® Ar chitectur e Optimization 3-6 Chec king for Streaming SI MD Extensions 3 Support SSE3 includes 13 instructions, 1 1 of those are suited for SIMD or x87 style programming. Checking for suppor t of these SSE3 instructions is similar to checking for SSE support.
Coding for SIMD Ar chitectur es 3 3-7 Example 3-6 Identifica tion of SSE3 with cpuid SSE3 requires the same support from the operating system as SSE. T o find out wh ether the operating syst em suppo rts SSE3 (FISTTP and 10 of the SIMD instructions in SSE3), ex ecute an SSE3 inst ruction and trap for an exception if one occurs.
IA-32 Intel® Ar chitectur e Optimization 3-8 Example 3-7 Identificati on of SSE3 by the OS Considerations f or Code Con version to SIMD Programming The VT une Performance Enhancement Environment CD provides tools to aid in the evaluation and tuning. But before implementing them, you need answers to the following questions: 1.
Coding for SIMD Ar chitectur es 3 3-9 Figure 3-1 Con verting to Streaming SIMD Extensions Chart OM15 156 Code benefit s from S IM D STOP Ident ify H ot Spots i n C ode Int eger or fl oati ng-poi nt? Y.
IA-32 Intel® Ar chitectur e Optimization 3-10 T o use any of the SIMD technologies optimally , you must evaluate the following situations in your code: • fragments that are computationally intensiv.
Coding for SIMD Ar chitectur es 3 3-11 specific optimizations. Where appropriate, the coach displays pseudo-code to su ggest the use of highly optimized intrinsics and functions in the Intel ® Performance Library Suite.
IA-32 Intel® Ar chitectur e Optimization 3-12 costly application processing time. However , these routines have potential for increased performance when you convert them to use one of the SIMD technologies.
Coding for SIMD Ar chitectur es 3 3-13 Coding Methodologies Software developers need to compare the performance improvement that can be obtained from assembly code ver sus the cost of those improvements.
IA-32 Intel® Ar chitectur e Optimization 3-14 The examples that follow illustra te the use of coding adjustments to enable the algorithm to benef it from the SSE. The same techniques may be used for single-precision f loating-point, double-precision floating-point, and integer data under SSE2 , SSE, and MMX technology .
Coding for SIMD Ar chitectur es 3 3-15 Assembl y Key loops can be coded directly in assembly lan guage using an assembler or by using inlined assembly (C-asm) in C/C++ code. The Intel compiler or assembler recognize the new instructions and registers, then directly generate the correspondin g code.
IA-32 Intel® Ar chitectur e Optimization 3-16 SIMD Extensions 2 inte ger SIMD and __m128d is used for double precision floating-point SIMD. These ty pes enable the programmer to choose the implementation of an algo rithm directly , while allowi ng the compiler to perform regi ster allocation and instru ction scheduling where possible.
Coding for SIMD Ar chitectur es 3 3-17 The intrinsic data types, however , are not a basic ANSI C data type, and therefore you must observe the following usage restrictions: • Use intrinsic data types only on the left-hand side of an assignment as a return value or as a parameter .
IA-32 Intel® Ar chitectur e Optimization 3-18 Here, fvec.h is the class definition file and F32vec4 is the class representing an array of four fl oats. The “+” and “=” operators are overloaded so that the actual S treaming SIMD Extensions implementation in the previous exam ple is abstracted out, or hidden, from the developer .
Coding for SIMD Ar chitectur es 3 3-19 The caveat to this is that only certain types of loops can be automatically vectorized, and in most cases user interaction with the compiler is needed to fully enable this. Example 3-12 shows the code for auto matic vectorization for the simple four -iteration loop (from Example 3-8).
IA-32 Intel® Ar chitectur e Optimization 3-20 Stac k and Data Alignment T o get the most performance out of code written for SIMD technologies data should be formatted in memory according to the guidelines described in this section. Assembly code with an unaligned accesses is a lot slower than an aligned access.
Coding for SIMD Ar chitectur es 3 3-21 By adding the padding variable pa d , the structure is now 8 bytes, and if the first element is aligned to 8 byte s (64 bits), all following elements will also be aligned.
IA-32 Intel® Ar chitectur e Optimization 3-22 Assuming you have a 64-bit aligned da ta vector and a 64-bit aligned coefficients vector , the filter operation on the first data element wi ll be fully aligned. For the second data element, how ever , access to the data vector will be misaligned.
Coding for SIMD Ar chitectur es 3 3-23 • Functions that use Streaming SIMD Extensions or S treaming SIMD Extensions 2 data need to provide a 1 6-byte aligned stack frame. • The __m128* parameters need to be aligned to 16-byte boundaries, possibly creating “holes” (due to padding) in th e argument block.
IA-32 Intel® Ar chitectur e Optimization 3-24 Another way to improve data alignment is to copy the data into locations that are aligned on 64-bit boundaries. When the data is accessed frequently , this can provide a significant performance improvement.
Coding for SIMD Ar chitectur es 3 3-25 The __declspec(align(16)) specifications can be placed before data declarations to force 16-byte alignmen t. This is pa rticularly useful for local or global data declarations that are assigned to 128-bit data types.
IA-32 Intel® Ar chitectur e Optimization 3-26 In C++ (but not in C) it is also possible to force the alignment of a class / struct / union type, as in the code that follows: struct __ declspec(align(.
Coding for SIMD Ar chitectur es 3 3-27 Impr oving Memory Utilization Memory performance can be improved by rearran ging data and algorithms for SSE 2, SSE, and MMX technology intrinsics.
IA-32 Intel® Ar chitectur e Optimization 3-28 There are two options for comp uting data in AoS format: perform operation on the data as it stands in AoS format, or re-arrange it (swizzle it) into SoA format dynamically . S ee Example 3-16 for code samples of each option based on a dot-product computation.
Coding for SIMD Ar chitectur es 3 3-29 Performing SIMD operations on the original AoS format can require more calculations and some of the op erations do not take advantage of all of the SIMD elements available. Therefore, th is option is generally less efficient.
IA-32 Intel® Ar chitectur e Optimization 3-30 but is somewhat inefficient as there is the overhead of extra instructions during computation. Performing the sw izzle statically , when the data structures are being laid out, is best as there is no runtime overhead.
Coding for SIMD Ar chitectur es 3 3-31 Note that SoA can have the disadvantage of requiring more independent memory stream references. A computation that uses arrays x , y , and z in Example 3-15 would require three separate data streams.
IA-32 Intel® Ar chitectur e Optimization 3-32 Strip Mining Strip minin g, also known as loop s ectioning, is a loop transformation technique for enabling SIMD-encodings of loops, as well as providing a means of improving memory performance.
Coding for SIMD Ar chitectur es 3 3-33 The main loop consists of two func tions: transformation and lighting. For each object, the main loop calls a transformation routine to update some data, then calls the lighting routine to further work on the data.
IA-32 Intel® Ar chitectur e Optimization 3-34 In Example 3-19, the computation has been strip-mined to a size strip_size . The value strip_size is chosen such that strip_size elements of array v[Num] fit into the cache hierarchy .
Coding for SIMD Ar chitectur es 3 3-35 For the first iteration of the inner loop, each access to array B will generate a cache miss. If th e size of one row of array A , that is, A[2, 0:MAX-1] , is large enough, by the time the second iteration starts, each access to array B will always generate a cache miss.
IA-32 Intel® Ar chitectur e Optimization 3-36 This situation can be avoided if the loop is blocked with respect to the cache size. In Figure 3-3, a block_size is selected as the loop blocking factor . Suppose that block_size is 8, then the blocked chunk of each array will be eight cache lines (32 bytes each).
Coding for SIMD Ar chitectur es 3 3-37 As one can see, all the redundant cache misses can be eliminated by applying this loop blocking technique. If MAX is huge, loop blocking can also help reduce the penalty from DTLB (data translation look-aside buffer) misses.
IA-32 Intel® Ar chitectur e Optimization 3-38 Note that this can be applied to both SIMD integer and SIMD floating-point code. If there are multiple consumers of an instan ce of a register , group the consumers together as closely as possible. However , the consumers should not be scheduled near the p roducer .
Coding for SIMD Ar chitectur es 3 3-39 Recommendation : When targeting code generation for Intel Core Solo and Intel Core Duo processors, favor instructio ns consisting of two-micro-ops over those with more than two micro-o ps.
IA-32 Intel® Ar chitectur e Optimization 3-40.
4-1 4 Optimizing for SIMD Integer Applications The SIMD integer instructions provide performance impr ovements in applications that are integer-intensive and can take advantage of the SIMD architecture of Pentium 4, In tel Xeon, and Pentium M processors.
IA-32 Intel® Ar chitectur e Optimization 4-2 For planning considerations of using the new SIMD integer instructions, refer to “Checking for S treaming SIMD Extensions 2 Support” in Chapter 3.
Optimizing for SIMD Integer Applications 4 4-3 Using SIMD Integer with x87 Floating-point All 64-bit SIMD integer instructions use the MMX registers, which share register state with the x87 floating-point stack. Because of this sharing, certain rules and considera tions apply .
IA-32 Intel® Ar chitectur e Optimization 4-4 Using emms clears all of the valid bits, effectively emptying the x87 floating-point stack and making it ready f or new x87 floating-point operations. The emms instruction ensures a clean transition between using operations on the MMX registers and using operations on the x 87 floating-point stack.
Optimizing for SIMD Integer Applications 4 4-5 • Don’ t empty when alr eady empty : If the next instruction uses an MMX register , _mm_empty() incurs a cost with no benefit. • Gr oup Instructions: T ry to partition regions that use x87 FP instructions from those that use 64-bit SIMD integer instructions.
IA-32 Intel® Ar chitectur e Optimization 4-6 Data Alignment Make sure that 64-bit SIMD integer data is 8- byte aligned and that 128-bit SIMD integer data is 1 6-byte aligned. Referencing unaligned 64-bit SIMD integer data can incur a performance penalty due to accesses that span 2 cache lines.
Optimizing for SIMD Integer Applications 4 4-7 Signed Unpac k Signed numbers should be sign-ext ended when unpacking the values. This is simil ar to the zero-exte nd shown above except that the psrad instruction (packed shift right arith metic) is used to effectively sign extend the values.
IA-32 Intel® Ar chitectur e Optimization 4-8 Interleaved P ack with Saturation The pack instructions pack two values into the destination register in a predetermined order .
Optimizing for SIMD Integer Applications 4 4-9 Figure 4-2 illustrates two values interleaved in the destination register , and Example 4-4 shows co de that us es the operation. The two signed doublewords are used as source operands and the result is interleaved signed words.
IA-32 Intel® Ar chitectur e Optimization 4-10 The pack instructions always as sume that the source operands are signed numbers. The result in the destination register is always d efined by the pack instruction that perform s the operation.
Optimizing for SIMD Integer Applications 4 4-11 Non-Interleaved Unpac k The unpack instructions perform an interleave merge of the data elements of the destination and source oper ands into the destination register . The following example merges the two operands into the destination registers without interleaving.
IA-32 Intel® Ar chitectur e Optimization 4-12 The other destination register w ill contain the opposite combination illustrated in Figure 4-4. Code in the Example 4-6 unpacks two packed-word sources in a non-interleaved way .
Optimizing for SIMD Integer Applications 4 4-13 Extract W or d The pextrw instruction takes the word in the designated MMX register selected by the two least significant bits of the immediate value and moves it to the lower half of a 32-bit integer re gister , see Figure 4-5 and Example 4-7.
IA-32 Intel® Ar chitectur e Optimization 4-14 Insert W ord The pinsrw instruction loads a word from the lower half of a 32-bit integer register or from memory and inserts it in the MMX technology destination register at a position de fined by the two least significant bits of the immediate constant.
Optimizing for SIMD Integer Applications 4 4-15 If all of the operands in a register are being replaced by a series of pinsrw instructions, it can be useful to clear the content and break the dependence chain by either using the pxor instruction or loading the register .
IA-32 Intel® Ar chitectur e Optimization 4-16 Move Byte Mask to Integer The pmovmskb instruction returns a bit mask formed from the most significant bits of each byte of its source operand. When used with the 64-bit MMX registers, this produces an 8-bit mask, zeroing out the upper 24 bits in the destination re gister .
Optimizing for SIMD Integer Applications 4 4-17 Figure 4 -7 pmovmskb Instruction Example Example 4-10 pmovmskb Instruction Code ; Input: ; source value ; Output: ; 32-bit register containing the byte mask in the lower ; eight bits ; movq mm0, [edi] pmovmskb eax, mm0 OM151 65 MM R32 31 0 63 0.
IA-32 Intel® Ar chitectur e Optimization 4-18 P acked Shuffle W ord f or 64-bit Registers The pshuf instruction (see Figure 4-8, Example 4-1 1) uses the immediate ( imm8 ) operand to select between the four words in either two MMX registers or one MMX register and a 64-bit memory location.
Optimizing for SIMD Integer Applications 4 4-19 P acked Shuffle W ord f or 128-bit Registers The pshuflw / pshufhw instruction performs a fu ll shuffle of any source word field within the low/high 64 .
IA-32 Intel® Ar chitectur e Optimization 4-20 Unpac king/interleaving 64-bit Data in 128-bit Registers The punpcklqdq / punpchqdq instructio ns interleav e the low/high-order 64-bits of the source operand and the low/high- order 64-bits of the destination operand and writes them to the destination register .
Optimizing for SIMD Integer Applications 4 4-21 Data Mo vement There are two additional instructions to enable data movement from the 64-bit SIMD integer registers to the 128-bit SIMD registers. The movq2dq instruction moves the 64-bit integer data from an MMX register (source) to a 128-bit destination register .
IA-32 Intel® Ar chitectur e Optimization 4-22 pxor MM0, MM0 pcmpeq MM1, MM1 psubb MM 0, MM1 [psubw MM0, MM1] (psubd MM0, MM1) ; three instructions above generate ; the constant 1 in every ; packed-by.
Optimizing for SIMD Integer Applications 4 4-23 Building Bloc ks This section describes instr uctions and algorithms which implement common code building blocks ef ficiently . Absolute Difference of Unsigned Numbers Example 4-16 computes the absolu te difference of two unsigned numbers.
IA-32 Intel® Ar chitectur e Optimization 4-24 Absolute Difference of Signed Numbers Chapter 4 computes the absolute difference of two signed numbers. The technique used here is to first sort the co rresponding elements of the input operands into packed words of the maximum values, and packed words of the minimum values.
Optimizing for SIMD Integer Applications 4 4-25 Absolute V alue Use Example 4-18 to compute | x | , where x is signed. This example assumes signed words to be the oper ands.
IA-32 Intel® Ar chitectur e Optimization 4-26 Clipping to an Arbitrary Range [high, low] This section explains how to clip a values to a range [ high, low ]. Specifically , if the value is less than low or greater than high , then clip to low or high, respectively .
Optimizing for SIMD Integer Applications 4 4-27 Highly Efficient Clipping For clipping signed words to an arbitrary range, the pmaxsw and pminsw instructions may be used. For clipping un signed bytes to an arbitrary range, the pmaxub and pminub instructions may be used.
IA-32 Intel® Ar chitectur e Optimization 4-28 The code above converts values to un signed numbers first and then clips them to an unsigned range. The last in struction converts the data back to signed data and places the data with in the signed range.
Optimizing for SIMD Integer Applications 4 4-29 packed-subtract instructions with unsigned saturation, thus this technique can only be used on p acked-bytes and packed-words data types.
IA-32 Intel® Ar chitectur e Optimization 4-30 Unsigned Byte The pmaxub instruction returns the maximum between the eight unsigned bytes in either two SIMD registers, or one SIMD register and a memory location.
Optimizing for SIMD Integer Applications 4 4-31 The subtraction operation presented above is an absolute difference, that is, t = abs(x-y ) . The byte values are stored in temporary space, all values are summed together , and the result is written into the lower word of the destination register .
IA-32 Intel® Ar chitectur e Optimization 4-32 The PA VGB instruction operates on pack ed unsigned bytes and the PAVGW instruction operates on packed unsigned words. Complex Multipl y by a Constant Complex multiplication is an op eration which requires four multiplications and two additions.
Optimizing for SIMD Integer Applications 4 4-33 Note that the output is a pack ed doubleword. If needed, a pack instruction can be used to convert th e result to 16-bit (thereby matching the format of the input).
IA-32 Intel® Ar chitectur e Optimization 4-34 Memory Optimizations Y ou can improve memory accesses using the following techniques: • A voiding partial memory accesses • Increasing the bandwidth of memory fills and video fills • Prefetching data with Streaming SIMD Extensions (see Chapter 6, “Optimizing Cache Usage”).
Optimizing for SIMD Integer Applications 4 4-35 P ar tial Memory Accesses Consider a case with large load after a series of small stores to the same area of memory (beginni ng at memory address mem ). The lar ge load will stall in this case as shown in Example 4-24.
IA-32 Intel® Ar chitectur e Optimization 4-36 Let us now consider a case with a seri es of small loads after a large store to the same area of memory (beginning at memory address mem ) as shown in Example 4-26. Most of th e small loads will stall because they are not aligned with the store; see “Store Forwarding” in Chapter 2 for more details.
Optimizing for SIMD Integer Applications 4 4-37 These transformations, in general, increase the number of instructions required to perform the desired oper ation.
IA-32 Intel® Ar chitectur e Optimization 4-38 SSE3 provides an instruction LDDQU for loading from memory address that are not 16 byte aligned. LDDQU is a special 128-bit unaligned load designed to avoid cach e line splits. If the address of the load is aligned on a 16-byte boundary , LDQQU loads the 16 bytes requested.
Optimizing for SIMD Integer Applications 4 4-39 Increasing Bandwidth of Memory Fills and Video Fills It is beneficial to understand how memory is accessed and filled.
IA-32 Intel® Ar chitectur e Optimization 4-40 same DRAM page have shorter la tencies than sequential accesses to dif ferent DRAM pages. In many systems the latency for a p age miss (that is, an acces.
Optimizing for SIMD Integer Applications 4 4-41 aligned versions; this can reduce the performance gains when using the 128-bit SIMD integer extensions. The general guidelines on the alignment of memory operands are: — The greatest performance gains can be achieved when all memory streams are 16-byte aligned.
IA-32 Intel® Ar chitectur e Optimization 4-42 P acked SSE2 Integer versus MMX Instructions In general, 128-bit SIMD integer instr uctions should be favored over 64-bit MMX instructions on Intel Core Solo and Intel Core Duo processors.
5-1 5 Optimizing for SIMD Floating-point Applications This chapter discusses general rules of optimizing for the single-instruction, multiple-data (SIM D) floating-point instructions available in Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2 (SSE2)and S treaming SIMD Extensions 3 (SSE3).
IA-32 Intel® Ar chitectur e Optimization 5-2 • Use MMX technology instructions and registers or for cop ying data that is not used later in SIMD floating-point computations. • Use the reciprocal instructions followed by iteration for increased accuracy .
Optimizing for SIMD Float ing-point Applications 5 5-3 • Is the data arranged for ef fici ent utilization of the SIMD floating-point registers? • Is this application targeted for processors without SIMD floating-point instructions? For more details, see the section on “Consideration s for Code Conversion to SIMD Programming” in Chapter 3.
IA-32 Intel® Ar chitectur e Optimization 5-4 When using scalar floating-point in structions, it is not necessary to ensure that the data appears in vector form. However , all of the optimizations regarding alignment, scheduling, instruction selection, and other optimizations covered in Chapter 2 and Chapter 3 should be observed.
Optimizing for SIMD Float ing-point Applications 5 5-5 For some applications, e.g., 3D geometry , the traditional data arrangement requires some changes to fully u tilize the SIMD registers and parallel techniques. T raditionally , the data layout has been an array of structures (AoS).
IA-32 Intel® Ar chitectur e Optimization 5-6 simultaneously referred to as an xyz data representation, see the diagram below) are computed in parallel, and the array is updated one vertex at a time.
Optimizing for SIMD Float ing-point Applications 5 5-7 T o utilize all 4 computation slot s, the vertex data can be reorganized to allow computation on each component of 4 separate ver tices, that is, processing multiple vectors simultaneously . This can also be referred to as an SoA form of representing vertices data shown in T able 5-1.
IA-32 Intel® Ar chitectur e Optimization 5-8 Figure 5-2 shows how 1 result would be computed for 7 instructions if the data were or ganized as AoS an d using SSE alone: 4 results would require 28 instructions.
Optimizing for SIMD Float ing-point Applications 5 5-9 Now consider the case when the data is organized as SoA. Example 5-2 demonstrates how 4 results are computed for 5 instructions.
IA-32 Intel® Ar chitectur e Optimization 5-10 T o gather data from 4 different memory locations on the f ly , follow steps: 1. Identify the first half of the 128-bit memory location. 2. Group the different h alves together using the movlps and movhps to form an xyxy layout in two registers.
Optimizing for SIMD Float ing-point Applications 5 5-11 y1 x1 movhps xmm7, [ecx+16] // xmm7 = y2 x2 y1 x1 movlps xmm0, [ecx+32] // xmm0 = -- -- y3 x3 movhps xmm0, [ecx+48] // xmm0 = y4 x4 y3 x3 movaps.
IA-32 Intel® Ar chitectur e Optimization 5-12 Example 5-4 shows the same data -swizzling algorithm encoded using the Intel C++ Compiler ’ s intrinsics for SSE.
Optimizing for SIMD Float ing-point Applications 5 5-13 Although the generated result of all zeros does not depend on the specific data contained in the source operand (that is, XOR of a registe r with itself always produces all zeros), the instruction cannot execute until the instruction that generates xmm0 has completed.
IA-32 Intel® Ar chitectur e Optimization 5-14 Data Deswizzling In the deswizzle operation, we want to arrange the SoA format back into AoS format so the xxxx , yyyy , zzzz are rearranged and stored in memory as xyz .
Optimizing for SIMD Float ing-point Applications 5 5-15 Y ou may have to swizzle data in the registers, but not in memory . This occurs when two different functions n eed to process the data in dif ferent layout. In lighting, for example, data comes as rrrr gggg b bbb aaaa , and you must deswizzle them into rgba before convertin g in to in teger s.
IA-32 Intel® Ar chitectur e Optimization 5-16 // Start deswizzling here movaps xmm7, xmm4 // xmm7= a1 a2 a3 a4 movhlps xmm7, xmm3 // xmm7= b3 b4 a3 a4 movaps xmm6, xmm2 // xmm6= g1 g2 g3 g4 movlhps x.
Optimizing for SIMD Float ing-point Applications 5 5-17 Using MMX T echnolog y Code for Cop y or Shuffling Functions If there are some parts in the code th at ar e mainly copyin g, shuf fling, or doing logical manipulations that do not requir e use of SSE code, consider performing these actions with MMX technology co de.
IA-32 Intel® Ar chitectur e Optimization 5-18 Example 5-8 illustrates how to use MMX technology code for copying or shuf fling. Horizontal ADD Using SSE Although vertical computations use the SIMD performan ce better than horizontal computations do, in some cases, the code must use a horizontal operation.
Optimizing for SIMD Float ing-point Applications 5 5-19 Figure 5-3 Horizontal Add Using mo vhlps/movlhps Example 5-9 Horizontal Add Using mo vhlps/movlhps void horiz_add(Vertex_soa *in, float *out) { .
IA-32 Intel® Ar chitectur e Optimization 5-20 // START HORIZONTAL ADD movaps xmm5, xmm0 // xmm5= A1,A2,A3,A4 movlhps xmm5, xmm1 // xmm5= A1,A2,B1,B2 movhlps xmm1, xmm0 // xmm1= A3,A4,B3,B4 addps xmm5.
Optimizing for SIMD Float ing-point Applications 5 5-21 Use of cvttps2pi/cvttss2si Instructions The cvttps2pi and cvttss2si instructions encode the truncate/chop rounding mode implicitly in the instruction, thereby taking precedence over the rounding mode specified in the MXCSR register .
IA-32 Intel® Ar chitectur e Optimization 5-22 avoided since there is a penalty associated with writing this register; typically , through the use of the cvttps2pi and cvttss2si instructions, the rounding contr ol in MXCSR can be always be set to round-nearest.
Optimizing for SIMD Float ing-point Applications 5 5-23 SSE3 and Complex Arithmetics The flexibility of SSE3 in dealing with AOS-type of data structure can be demonstrated by the example of multiplicatio n and division of complex numbers. For example, a complex number can be stored in a structure consisting of its real and im aginary part.
IA-32 Intel® Ar chitectur e Optimization 5-24 instructions to perform multiplica tions of single-precision complex numbers. Example 5-12 demonstrates using SSE3 instructions to perform division of complex numbers. In both of these examples, the comple x numbers are store in arrays of structures.
Optimizing for SIMD Float ing-point Applications 5 5-25 Example 5-12 Division of T wo P air of Single-precision Complex Number // Division of (ak + i bk ) / (ck + i dk ) movshdup xmm0, Src1; load imaginary parts into t he ; destination, b1, b1, b0, b0 movaps xmm1, src2; load the 2nd pair of comple x values, ; i.
IA-32 Intel® Ar chitectur e Optimization 5-26 SSE3 and Horizontal Comp utation Sometimes the AOS type of data organization are more natural in many algebraic formula. SSE3 enhances the flexibility of SIMD programming for applications that rely on the horizontal computation model.
Optimizing for SIMD Float ing-point Applications 5 5-27 SIMD Optimizations and Microar chitectures Pentium M, Intel Core Solo and I ntel Core Duo processors have a different microarchitecture than Intel NetBurst ® microarchitecture. The following sub-section discusses optimiz ing SIMD code that target Intel Core Solo and Intel Core Duo processors.
IA-32 Intel® Ar chitectur e Optimization 5-28 When targeting complex arithme tics on Intel Core Solo and Intel Core Duo processors, using sing le-precision SSE3 instructions can deliver higher performance than alternatives.
6-1 6 Optimizing Cache Usage Over the past decade, processor sp eed has increased more than ten times. Memory access speed has incr eased at a slower pace.
IA-32 Intel® Ar chitectur e Optimization 6-2 • Memory Optimization Using Hardware Prefetching, Software Prefetch and Cacheability Instru ctions: discusses techniques for implementing memory optimizations using the above instructions. • Using deterministic cache parameters to manage cache hierarchy .
Optimizing Cache Usage 6 6-3 • Facilitate compiler optimization: — Minimize use of global variables and pointers — Minimize use of complex control flow —U s e t h e const modifier , avoid register modifier — Choose data types carefully (see below) and avo id type casting.
IA-32 Intel® Ar chitectur e Optimization 6-4 • Optimize software prefetch scheduling distance: — Far ahead enough to allow interim computation to overlap memory access time. — Near enough that the prefetched data is not replaced from the data cache.
Optimizing Cache Usage 6 6-5 3. Follows only one stream per 4K page (load or store) 4. Can prefetch up to 8 simultaneous independent streams f rom eight dif feren t 4K regions 5. Does not prefetch across 4K boundary; note that this is independent of paging modes 6.
IA-32 Intel® Ar chitectur e Optimization 6-6 Data reference patterns can be classified as follows: T emporal data will be used again soon Spatial data will be used in adjacent locations, for example,.
Optimizing Cache Usage 6 6-7 The prefetch instruction is implementation -specific; applications need to be tuned to each implemen tation to maximize performance.
IA-32 Intel® Ar chitectur e Optimization 6-8 The Prefetch Instructions – P e ntium 4 Processor Implementation Streaming SIMD Extensions include four flavors of prefetch instructions, one non-temporal, and three temporal. They correspond to two types of operations, temporal and non-temporal.
Optimizing Cache Usage 6 6-9 Currently , the prefetch instruction provides a greater performance gain than preloading because it: • has no destination register , it only updates cache lines. • does not stall the normal instruction retirement. • does not af fect the functional behavior of the program.
IA-32 Intel® Ar chitectur e Optimization 6-10 The Non-temporal Store Instructions This section describes the behavior of streaming stores and reiterates some of the information presented in the previous section.
Optimizing Cache Usage 6 6-11 • Reduce disturbance of frequently used cached (temporal) data, since they write around th e processor caches. Streaming stores allow cross-aliasing of memory types for a given memory region.
IA-32 Intel® Ar chitectur e Optimization 6-12 evicting data from all processor caches). The Pentium M processor implements a combin ation of both approaches. If the streaming store hits a line th at is present in the first-level cache, the store data is combined in place within the first-level cache.
Optimizing Cache Usage 6 6-13 possible. This behavior should be considered reserved, and dependence on the behavior of any particular implementation risks future incompatibility . Streaming Store Usage Mo dels The two primary usage domains for streaming store are coherent requests and non-coherent r equests.
IA-32 Intel® Ar chitectur e Optimization 6-14 In case the region is not mapped as WC , the streaming might update in-place in the cache and a subsequent sfence would not result in the data being written to system memory .
Optimizing Cache Usage 6 6-15 The maskmovq/maskmovdqu (non-temporal by te mask store of packed integer in an MMX technology or S treaming SIMD Ex tensions register) instructions store data from a regist er to the location specified by the edi register .
IA-32 Intel® Ar chitectur e Optimization 6-16 The degree to which a consumer o f data knows that the data is weakly-ordered can vary for these cases. As a result, the sfence instruction should be used to ensure ordering between routines that produce weakly-ordered data and rou tines that consume this data.
Optimizing Cache Usage 6 6-17 The clflush Instruction The cache line associated with the li near address specified by the value of byte address is invalidated from all levels of the processor cache hierarchy (data and instruction) . The invalidation is broadcast throughout the coherence domain.
IA-32 Intel® Ar chitectur e Optimization 6-18 Memory Optimization Using Prefetch The Pentium 4 processor has two mechanisms for data prefetch: software-controlled prefetch and an automatic hardware prefetch.
Optimizing Cache Usage 6 6-19 Har dware Prefetc h The automatic hardware prefetch, can bring cache lines into the unified last-level cache based on prior data misses. The automatic hardware prefetcher will attempt to prefetch two cache lines ahead of the prefetch stream.
IA-32 Intel® Ar chitectur e Optimization 6-20 • May consume extra system bandwidth if the application’ s memory traffic has significant portions with strides of cache misses greater than the trigger distance threshold of hardwar e prefet ch (lar ge-stride memory traffic).
Optimizing Cache Usage 6 6-21 Example 6-2 Populating an Array for Circ ular Pointer Chasin g with Constant Stride register char ** p; char *next; // Populating pArray for circular point er // chasing .
IA-32 Intel® Ar chitectur e Optimization 6-22 Example of Latency Hiding with S/W Prefetch Instruction Achieving the highest level of memor y optimization using prefetch instructions requires an understanding of the microarchitecture and system architecture of a given machin e.
Optimizing Cache Usage 6 6-23 execution units sit idle and wait until data is returned. On the other hand, the memory bus sits idle while the execution units are processing vertices. This scenario severely decreases the advantage of having a decoupled architecture.
IA-32 Intel® Ar chitectur e Optimization 6-24 The performance loss caused by poor utilization of resources can be completely eliminated by correctly scheduling the prefetch instructions appropriately . As shown in Figure 6-3 , prefetch instructions are issued two vertex iterations ahead.
Optimizing Cache Usage 6 6-25 • Balance single-pass versus multi-pass execution • Resolve memory bank conflict issues • Resolve cache management issues The subsequent sections discuss all the above items.
IA-32 Intel® Ar chitectur e Optimization 6-26 lines of data per iteration. The PSD would need to be increased/decreased if more/less th an two cache lines are used per iteration. Software Prefetc h Concatenation Maximum performance can be achieved when execution pipeline is at maximum throughput, without incurring an y memo ry latency penalties.
Optimizing Cache Usage 6 6-27 This memory de-pipelining creates inefficiency in both the memory pipeline and execution pipeline. Th is de-pipelining effect can be removed by applying a technique ca lled prefetch concatenation. W ith this technique, the memory access an d execution can be fully pipelined and fully utilized.
IA-32 Intel® Ar chitectur e Optimization 6-28 Prefetch concatenation can bridge the execution pipeline bubbles between the boundary of an inn er loop and its associated outer loop.
Optimizing Cache Usage 6 6-29 Minimize Number of Software Prefetches Prefetch instructions are not completely free in terms of bus cycles, machine cycles and resources, even though they requ ire minimal clocks and memory bandwidth.
IA-32 Intel® Ar chitectur e Optimization 6-30 Figure 6-5Figure demonstrates the ef fectiveness of software prefetches in latency hiding. The X ax is indicates the number of computation clocks per loop (each iteration is inde pendent). The Y axis indicates the execution time measured in clocks per loop.
Optimizing Cache Usage 6 6-31 Figure 6-5 Memory Access Latency and Execution With Pr efetch 2 Load streams, 1 stor e str eam 50 100 150 200 250 300 350 54 108 144 19 2 240 336 390 Comput a tions per loop Eff ect ive loop lat enc y 0.00% 10.00% 20.00% 30.
IA-32 Intel® Ar chitectur e Optimization 6-32 Mix Software Prefetc h with Computation In structions It may seem convenient to cluster all of the prefetch instructions at the beginning of a loop body or before a loop, but this can lead to severe performance degradation.
Optimizing Cache Usage 6 6-33 Example 6-6 Spread Prefet ch In st ru c ti on s NO TE. T o avoid instruction execution stalls due to the over-utilization of the r esour ce, pr efetch instruc tions must be interspersed with computational instructions.
IA-32 Intel® Ar chitectur e Optimization 6-34 Software Prefetc h and Cache Bloc king T echniques Cache blocking techniques, such as strip-mining, are used to impr ove temporal locality , and thereby cache hit rate. Strip-mining is a one-dimensional temporal locality optimization for memory .
Optimizing Cache Usage 6 6-35 In the temporally-adjacent scenario , subsequent passes use the same data and find it already in second-level cache. Prefetch issues aside, this is the preferred situation.
IA-32 Intel® Ar chitectur e Optimization 6-36 Figure 6-7 shows how prefetch instructions and strip-mining can be applied to increase performance in both of these scenarios.
Optimizing Cache Usage 6 6-37 In scenario to the right, in Figure 6- 7, keeping the data in one way of the second-level cache does not improve cache locality .
IA-32 Intel® Ar chitectur e Optimization 6-38 W ithout strip-mining, all the x,y ,z coor dinates for the four vertices mu st be re-fetched from memory in the seco nd pass, that is, the lighting loop. This causes under-utilization of cache lines fetched during transformation loop as well as ban dwidth wasted in the lighting loop.
Optimizing Cache Usage 6 6-39 T able 6-1 summarizes the steps of the basic usage model that incorporates only software prefetch with strip-mining. The steps are: • Do strip-mining: partition loops so that the dataset fits into second-level cache.
IA-32 Intel® Ar chitectur e Optimization 6-40 happen to be powers of 2, aliasing conditio n due to finite number of way-associativity (see “Capacity Lim its and Aliasing in Caches” in Chapter 2) will exacerbate the likelihood of cache evictions.
Optimizing Cache Usage 6 6-41 references enables the hardware prefetcher to initiate bus requests to read some cache lines before the code actually reference the linear addresses.
IA-32 Intel® Ar chitectur e Optimization 6-42 selected to ensure that the batch stays within the processor caches through all passes. An intermediate cached buf fer is used to pass the batch of vertices from one stag e or pass to the next on e.
Optimizing Cache Usage 6 6-43 The choice of single-pass or multi-pass can have a number of performance implications. For instance, in a multi-pass pipeline, stages that are limited by bandwidth (either input or output) will reflect more of this performance limitation in overal l execution time.
IA-32 Intel® Ar chitectur e Optimization 6-44 a line burst transaction. T o achieve the best possible performance, it is recommended to align data along the cache line boundary and write them consecutively in a cache line si ze while using non-temporal stores.
Optimizing Cache Usage 6 6-45 The following examples of using prefetching instructions in the operation of video encoder and decode r as well as in simple 8-byte memory copy , illustrate performance gain from using the prefetching instructions for efficient cache management.
IA-32 Intel® Ar chitectur e Optimization 6-46 Later , the processor re-reads the data using prefetchnta , which ensures maximum bandwidth, yet minimizes disturbance of other cached temporal data by using the non- temporal (NT A) version of prefetch.
Optimizing Cache Usage 6 6-47 The memory copy algorithm can be o ptimized using the Streamin g SIMD Extensions with these considerations: • alignment of data • proper layout of pages in memory • cache size • interaction of the transaction lookaside buf fer (TLB) with memory accesses • combining prefetch and streaming-store instructions.
IA-32 Intel® Ar chitectur e Optimization 6-48 Using the 8-by te Streamin g Stores and Software Prefetc h Example 6-1 1 presents the copy algorithm that uses second level cache.
Optimizing Cache Usage 6 6-49 In Example 6-1 1, eig ht _mm_load_ps and _mm_stream_ ps intrinsics are used so that all of the data prefet ched (a 128-byte cache line) is written back. The prefetch and streaming-stor es are executed in separate loops to minimize the number of transitions between readin g and writing data.
IA-32 Intel® Ar chitectur e Optimization 6-50 The instruction, temp = a[kk+CACHESIZE] , is used to ensure the page table entry for array , and a is entered in the TLB prior to prefetching. This is essentially a prefetch itself , as a cache line is filled from that memory location with this instruction.
Optimizing Cache Usage 6 6-51 prefetch_loop: movaps xmm0, [esi+ecx] movaps xmm0, [esi+ecx+64] add ecx,128 cmp ecx,BLOCK_SIZE jne prefetch_loop xor ecx,ecx align 16 cpy_loop: movdqa xmm0,[esi+ecx] movd.
IA-32 Intel® Ar chitectur e Optimization 6-52 P erformance Comparisons of Memory Copy Routines The throughput of a lar ge-region, memory copy routine depends on several factors: • coding techniques.
Optimizing Cache Usage 6 6-53 The baseline for performance compariso n is the throughput (bytes/sec) of 8-MByte region memory copy on a first-generation Pentium M processor (CPUID signature 0x69n) with a 400-MHz system bus using byte-sequential technique similar to that shown in Example 6-10.
IA-32 Intel® Ar chitectur e Optimization 6-54 query each level of the cache hierarchy . Enumeration of each cache level is by specifying an index value (starting form 0) in the ECX register .
Optimizing Cache Usage 6 6-55 • Determine multi-threading resource topology in an MP system (See Section 7.10 of IA-32 Intel® Ar chitectur e Softwar e Developer ’ s Manual, V olume 3A ). • Determine cache hierarchy topology in a platform using multi-core processors (See Example 7-13).
IA-32 Intel® Ar chitectur e Optimization 6-56 platform, software can extract in formation on the numb er and the identities of each logical processor sharing that cache level and is made available to application by the OS. This is discussed in detail in “Using Shared Execution Resources in a Processor Core” in Chapter 7 and Example 7-13.
7-1 7 Multi-Cor e and Hyper -Thr eading T echnology This chapter describes software optimization techniques for multithreaded applications running in an environment using either multiprocessor (MP) systems or pr ocessors with hardware-based multi-threading suppor t.
IA-32 Intel® Ar chitectur e Optimization 7-2 cores but shared by two logical pr ocessors in the same core if Hyper -Threading T echnology is enabled. This chapter covers guidelines that apply to either situations.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-3 Figure 7-1 illustrates how performance gains can be realized for any workload according to Amdahl’ s law . The bar in Figure 7-1 represents an individual task unit or the collective workload of an entire application.
IA-32 Intel® Ar chitectur e Optimization 7-4 When optimizing application performance in a multithreaded environment, control flow parallelis m is likely to have the lar gest impact on performance scaling with respect to the number of physical processors and to the number of logical processors per physical processor .
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-5 terms of time of completion relative to the same task when in a single-threaded environment) will vary , depending on how much shared execution resources and memory are utilized.
IA-32 Intel® Ar chitectur e Optimization 7-6 When two applications are employe d as part of a multi-tasking workload, there is little synchron ization overhead between these two processes. It is also important to ensure each application has minimal synchronization overhead within itself.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-7 P arallel Programming Models T wo common programming models for transforming independent task requirements into application threads are: • domain .
IA-32 Intel® Ar chitectur e Optimization 7-8 Functional Decomposition Applications usually process a wide variety of tasks with diverse functions and many unrelated data sets. For example, a video codec needs several dif ferent processing functions. These include DCT , motion estimation and colo r conversion.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-9 overhead when buffers are exch anged between the producer and consumer . T o achieve optimal scalin g with th e number of cores, the synchronization overhead must be kept low .
IA-32 Intel® Ar chitectur e Optimization 7-10 Producer -Consumer Threading Models Figure 7-3 illustrates the basic scheme of interaction between a pair of producer and consumer threads. The horizon tal direction represents time. Each block represents a task unit, processing the buffer assigned to a thread.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-11 It is possible to structure the prod ucer -consumer model in an interlaced manner such that it can minimize bus traffic and be ef fective on multi-core processors without shared second-level cache.
IA-32 Intel® Ar chitectur e Optimization 7-12 corresponding task to use its designated buffer . Thus, the producer and consumer tasks execute in parallel in two threads. As long as the data generated by the producer reside in either the first or second level cache of the same core, the consumer can access them without incurring bus traffic.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-13 Example 7-3 Thread Function for an Interlace d Producer Consumer Model // master thread starts the first it eration, the other thread must wait // .
IA-32 Intel® Ar chitectur e Optimization 7-14 T ools for Creating Multithreaded Applications Programming directly to a multithreading application pro gramming interface (API) is not the only me thod for creating multithreaded applications.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-15 Automatic Parallelization of Code . While OpenMP directives allow programmers to quickly transform serial applicatio ns into parallel applications, programmers must id entify specific portions of the application code that contain parall elism and add compiler directives.
IA-32 Intel® Ar chitectur e Optimization 7-16 Optimization Guidelines This section summarizes optimization guidelines for tuning multithreaded applications.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-17 • Place each synchronization variable alone, separated by 128 bytes or in a separate cache line.
IA-32 Intel® Ar chitectur e Optimization 7-18 • Adjust the private stack of each th read in an application so the spacing between these stacks is not offset by multiples of 64 KB or 1 MB (prevents unnecessary cache line evictions) when targ eting IA-32 processors supporting Hyper-Threading T echnology .
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-19 • For each processor s upporting Hyper -Thr eading T echnology , consider adding functionally unco rrelated threads to increase the hardware resource utilization of each physical processor package.
IA-32 Intel® Ar chitectur e Optimization 7-20 The best practice to reduce the overhead of thread synchro nization is to start by reducing the application’ s requirements for synchronization.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-21 the white paper “ Developing Multi-thr eaded Applications: A Platform Consistent Appr oach ” (referenced in the Introduction chapter).
IA-32 Intel® Ar chitectur e Optimization 7-22 Synchr onization for Short P eriods The frequency and duration that a thread needs to synchronize with other threads depends applicat ion characteristics. When a synchronization loop needs very fast response, ap plications may use a spin-wait loop.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-23 the processor must guarantee no violations of memo ry order occur . The necessity of maintaining the order of outstanding memory operations inevitably costs the pro cessor a severe penalty that impacts all threads.
IA-32 Intel® Ar chitectur e Optimization 7-24 Example 7-4 Spin- wait Loop and P AUSE Instructions (a) An un-optimized spin-wait loop experiences performance penalty when exiting the loop. It consumes execu tion resources without contributing computational work.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-25 User/Sourc e Coding Rule 21. (M impact, H generality) Insert the P AUSE instruction in fast spin loop s and keep the nu mber of loop repetitions to a minimum to improve overall system performance.
IA-32 Intel® Ar chitectur e Optimization 7-26 T o reduce the performance penalty , one approach is to reduce the likelihood of many threads competing to acquire the same lock. Apply a software pipelining technique to handle data that must be shared between multiple threads.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-27 If an application thread must remain idle for a long time, the application should use a thread b locking API or other method to release the idle processor .
IA-32 Intel® Ar chitectur e Optimization 7-28 A void Coding Pitfalls in Thread Synchr onization Synchronization between multiple th reads must be designed and implemented with care to achieve good performance scaling with respect to the number of discrete pr ocessors and the nu mber of logical processor per physical processor .
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-29 In general, OS function calls should be used with care when synchronizing threads. When using OS-suppo rted thread synchronization objects (critica.
IA-32 Intel® Ar chitectur e Optimization 7-30 Prevent Sharing of Modified Data and False-Sharing On an Intel Core Duo processor , sh aring of modified data incurs a performance penalty when a thread running on one core tries to read or write data that is currently present in modified state in the first level cache of the other core.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-31 User/Source Coding Rule 24 . (H impact, M generality) Bewar e of false sharing within a cache line (64 bytes on Intel Pen tium 4, Intel Xeon, Pentium M, Intel Core Duo pr ocessors), an d wi thin a sector (128 bytes on Pentium 4 and Intel Xeon processors).
IA-32 Intel® Ar chitectur e Optimization 7-32 • Objects allocated dynamically by different threads may share cache lines. Make sure that the variable s used locally by one thread are allocated in a manner to prevent sharing the cache line with other threads.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-33 • In managed environments that provide automatic object allocation, the object allocators and garbag e collectors are responsible for layout of the objects in memory so that false sharing through two objects does not happen.
IA-32 Intel® Ar chitectur e Optimization 7-34 Conserve Bus Bandwidth In a multi-threading environment, bus bandwidth may be shared by memory traffic originated from multip le bus agents (These agents can be several logical processors and/or several processor cores).
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-35 reads. An approximate working guideline for software to operate below bus saturation is to check if bus read queue depth is sign ificantly below 5. Some MP platform may have a chipset that provides two buses, with each bus servicing one or more physi cal processors.
IA-32 Intel® Ar chitectur e Optimization 7-36 A void Excessive Software Prefetc hes Pentium 4 and Intel Xeon Processors have an auto matic hardware prefetcher . It can bring data an d instructions into the unified second-level cache based on prior refere nce patterns.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-37 latency of scattered memory reads can be improved by issuing multiple memory reads back-to-back to over lap multiple outstanding memory read transactions.
IA-32 Intel® Ar chitectur e Optimization 7-38 Frequently , multiple partial writes to WC memory can be combined into full-sized writes using a software wr ite-combining technique to separate WC store operations from competi ng with WB store traf fic.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-39 block size for loop blocking should be determined by dividing the tar get cache size by the number of logical processors available in a physical processor package.
IA-32 Intel® Ar chitectur e Optimization 7-40 User/Source Coding Rule 33 . (H impact, M generality) Minimize the sharing of data betw een thr eads tha t execut e on differ ent bu s agent s sha ring a common bus .
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-41 Example 7-8 shows the batched implementation of the producer and consumer thread functions. Example 7-8 Batched Implement ation of the Producer Con.
IA-32 Intel® Ar chitectur e Optimization 7-42 Eliminate 64-KByte Al iased Data Accesses The 64 KB aliasing condition is discussed in Chapter 2. Memory accesses that satisfy the 64 KB aliasing condition can cause excessive evictions of the first-level data cache.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-43 Preventing Excessive Evictions in First-Le vel Data Cache Cached data in a first-level data cache are indexed to linear addresses but physically tagged. Data in second-level and third-level caches are tagged and indexed to physical addres ses.
IA-32 Intel® Ar chitectur e Optimization 7-44 P er-thread Stac k Offset T o prevent private stack accesses in concurrent thread s from thrashing the first-level data cache, an applica tion can use a per -thread stack offset for each of its threads. The size of th ese of fsets should be multiples of a common base of fset.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-45 Example 7-9 Adding an Offset to t he St ack Pointer of Three Thread s Void Func_thread_entry(DW ORD *pArg) {DWORD StackOffset = *pArg; DWORD var1; // The local variable at this scope may not benefit DWORD var2; // from the adjustment of the stack pointer that ensue .
IA-32 Intel® Ar chitectur e Optimization 7-46 P er-instance Stac k Offset Each instance an application runs in its own linear address space; but the address layout of data for stack se gments is identical for the both instances.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-47 However , the buffer space does enable the first-level data cache to be shared cooperatively when two copies of the same application are executing on the two logical processo rs in a physical processor package.
IA-32 Intel® Ar chitectur e Optimization 7-48 Front-end Optimization In the Intel NetBurst microarchit ecture family of processors, the instructions are decoded into micro-ops (μ ops) and sequences of μ ops (called traces) are stored in the Execution T race Cache.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-49 On Hyper -Threading-T echnology-enabled processors, excessive loop unrolling is likely to reduce the T r ace Cache’ s ability to deliver high bandwidth μ op streams to the execution engine.
IA-32 Intel® Ar chitectur e Optimization 7-50 initial APIC_ID (See Section 7.10 of IA-32 Intel Ar chitectur e Softwar e Developer ’ s Manual , V olume 3A for more details) associated with a logical processor . The three levels are: • physical processor package.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-51 Affinity mask s can be used to optimize shared multi-threading resources. Example 7-1 1 Assembling 3-level IDs , Affinity Masks for Each Logical Processor // The BIOS and/or OS may limit the number of logical processors // available to applic ations after system boot.
IA-32 Intel® Ar chitectur e Optimization 7-52 Arrangements of af finity-binding can benefit performance more than other arrangements. This applies to: • Scheduling two domain-decomposition threads .
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-53 first to the primary logical proces sor of each processor core. This example is also optimized to the situations of schedu ling two memory-intensive threads to run on separate cores an d scheduling two compute-intensive threads on separate cores.
IA-32 Intel® Ar chitectur e Optimization 7-54 Example 7-12 Assembling a Look up T abl e to Manage Affinit y Mas ks and Schedule Threads to Each Core First AFFINITYMASK LuT[64]; // A Look up table to retrie ve the affinity // mask we want to use from the thread // scheduling sequence index.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-55 Example 7-13 Discovering the Affinity Masks fo r Sibling Logical Processors Sharing the Same Cache // Logical processors sharing the same cache can.
IA-32 Intel® Ar chitectur e Optimization 7-56 PackageID[Proce ssorNUM] = PACKAGE_ID; CoreID[ProcessorNum] = CORE_ID; SmtID[ProcessorNum] = SMT_ID; CacheID[ProcessorNUM] = CACHE_ID; // Only the target.
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-57 For (ProcessorNum = 1; ProcessorNum < NumStartedLPs; ProcessorNum++) { ProcessorMask << = 1; For (i = 0; i < CacheNum; i++) { // We may.
IA-32 Intel® Ar chitectur e Optimization 7-58 Optimization of Other Shared Resources Resource optimization in multi-thread ed application depends on the cache topology and execution resources associated within the hierarchy of processor topology .
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-59 seldom reaches 50% of peak retirement bandwidth. Thus, improving single-thread execution throughput sh ould also benefit multi-threading performance.
IA-32 Intel® Ar chitectur e Optimization 7-60 throughput of a physical processor package. The non-halted CPI metric can be interpreted as the inverse of the throughpu t of a logical processor 9 .
Multi-Cor e and Hyper-Thr e ading T echnology 7 7-61 Using a function decomposition th reading model, a multithreaded application can pair up a thread with critical dependence on a low-throughput resource with other threads th at do not have the same dependency .
IA-32 Intel® Ar chitectur e Optimization 7-62 W rite-combining buf fers are another example of execution resources shared between two logical proces sors. W ith two threads running simultaneously on a pr ocessor supporting Hyper -Threading T echnology , the write s of both threads count toward the limit of four write-combining buf fers.
8-1 8 64-bit Mode Coding Guidelines Intr oduction This chapter describes coding gui delines for application software written to run in 64-bit mode. These guidelines should be considered as an addendum to the coding guidelin es described in Chap ter 2 through 7.
IA-32 Intel® Ar chitectur e Optimization 8-2 This optimization holds true for the lower 8 general purpose registers: EAX, ECX, EBX, EDX, ESP , EBP , ESI, EDI. T o access the data in registers r9-r15, the REX prefix is required. Using the 32- bit form there does not reduce code size.
64-bit Mode Coding Guidelines 8 8-3 If the compiler can determine at compile time that the result of a multiply will not exceed 64 bits, then the compiler should generate the multiply instruction that produces a 64-bit result.
IA-32 Intel® Ar chitectur e Optimization 8-4 Can be replaced with: movsx r8, r9w ;If bits 63:8 do not need to be ;preserved. movsx r8, r10b ;If bits 63:8 do not need to ;be preserved. In the above example, the moves to r8w and r8b both require a mer ge to preserve the rest of the bits in th e register .
64-bit Mode Coding Guidelines 8 8-5 IMUL RAX, RCX The 64-bit version above is more ef ficient than using the following 32-bit version: MOV EAX, DWORD PTR[X] MOV ECX, DWORD PTR[Y] IMUL ECX In the 32-bit case above, EAX is required to be a source. The result ends up in the EDX:EAX pair instead of in a single 64-bit register .
IA-32 Intel® Ar chitectur e Optimization 8-6 Use 32-Bit V ersions of CVTSI2SS and CVTSI2SD When P ossible The CVTSI2SS and CVTSI2SD instruct ions convert a signed integer in a general-purpose register or memory location to a single-pr ecision or double-precision floating-point value.
9-1 9 Power Optimization for Mobile Usages Overview Mobile computing allows computer s to operate anywhere, anytime. Battery life is a key factor in deliver ing this benefit. Mobile applications require software optimization that considers both performance and power consumption.
IA-32 Intel® Ar chitectur e Optimization 9-2 Pentium M, Intel Core Solo and In tel Core Duo processors implement features designed to enable the re duction of active power and static power consumption.
Power Optimization for Mobile Usages 9 9-3 to accommodate demand and adapt power consumption. The interaction between the OS power management policy and perf ormance history is described below: 1. Demand is high and the proces sor wo rks at its highest possible frequency (P0).
IA-32 Intel® Ar chitectur e Optimization 9-4 A CPI C-States When computational demands are less than 100%, part of the time the processor is doing useful work and the rest of the time it is idle.
Power Optimization for Mobile Usages 9 9-5 The index of a C-state type desi gnates the depth of sleep. Higher numbers indicate a deeper sleep state and lower power consumption.
IA-32 Intel® Ar chitectur e Optimization 9-6 Figure 9-3 Application of C-states to Idle Ti me Consider that a processor is in lo west frequency (LFM- low frequency mode) and utilization is low .
Power Optimization for Mobile Usages 9 9-7 • In an Intel Core Solo or Duo pro cessor , after staying in C4 for an extended time, the processor may enter into a Deep C4 state to save additional static power .. The processor reduces volt age to the minimum l evel required to safely maintain processor context.
IA-32 Intel® Ar chitectur e Optimization 9-8 Adjust P erformance to Meet Quality of Features When a system is battery powered, applications can extend battery life by reducing the performan ce or quality of features, turning of f background activities, or both.
Power Optimization for Mobile Usages 9 9-9 • GetActivePwrScheme: Retrieves the active power scheme (current system power scheme) index. An application can use this API to ensure that system is ru nning best power scheme.A void Using Spin Loops Spin loops are used to wait fo r short intervals of time or for synchronization.
IA-32 Intel® Ar chitectur e Optimization 9-10 workload (usually that equates to reducing the number of instructions that the processor needs to ex ecute, or optimizing application performance).
Power Optimization for Mobile Usages 9 9-11 disk operations over time. Use the GetDevicePowerS tate() W indows API to test disk state an d delay the disk access if it is not spinning. Handling Sleep State T ransitions In some cases, transitioni ng to a sleep state may harm an application.
IA-32 Intel® Ar chitectur e Optimization 9-12 Using Enhanced Intel SpeedStep ® T echnolog y Use Enhanced Intel SpeedS tep T echnology to adjust the processor to operate at a lower frequency and save ener gy . The basic idea is to divide computations into smaller pieces a nd use OS power management policy to effect a transition to higher P-states.
Power Optimization for Mobile Usages 9 9-13 The same application can be written in such a way that work units are divided into smaller granularity , but scheduling of each work unit and Sleep() occurring at more frequent intervals (e.g. 100 ms) to deliver the same QOS (operating at full performance 50% of the time).
IA-32 Intel® Ar chitectur e Optimization 9-14 An additional positive ef fect of continuously operating at a lower frequency is that frequent changes in power draw (from low to high in our case) and battery current even tually harm the battery . They accelerate its deterioration.
Power Optimization for Mobile Usages 9 9-15 Eventually , if the interval is large enough, the processor will be able to enter deeper sleep and save a considerable amount of power . The following guidelines can help applica tions take advantage of Intel® Enhanced Deeper Sleep: • A void setting higher interrupt rates.
IA-32 Intel® Ar chitectur e Optimization 9-16 thread enables the physical proces sor to operate at lower frequency relative to a single-threaded version.
Power Optimization for Mobile Usages 9 9-17 demands only 50% of processor r esources (based on idle history). The processor frequency may be reduced by such multi-core unaware P-state coordination, resulting in a perfo rmance anomaly .
IA-32 Intel® Ar chitectur e Optimization 9-18 processor to enter the lowest possible C-state type (lower -numbered C state has less power saving). For example, if Core 1 meets the requirement to be in ACPI C1 and Core 2 meets requirement for ACPI C3, multi-core-unaware OS coordination takes the physical processor to ACPI C1.
Power Optimization for Mobile Usages 9 9-19 imbalance can be accomplished using performance monitoring events. Intel Core Duo processo r provides an event for this purpose.
IA-32 Intel® Ar chitectur e Optimization 9-20.
A-1 A Application Performance T ools Intel of fers an array of application performance tools that are optimized to take advantage of the Intel arch itecture (IA)-based processors. This appendix introduces these tools and explains their capabilities for developing the most ef ficient programs without having to write assembly code.
IA-32 Intel® Ar chitectur e Optimization A-2 • Intel Performance Libraries The Intel Performance Library family consists of a set of sof tware libraries optimized for Intel arch itecture processors.
Application Performance T ools A A-3 family . V ectorization, processor disp atch, inter-procedural optimization, profile-guided optimization and OpenMP parallelism are all suppor ted by the Intel compilers and can sign ifican tl y ai d the performance of an application.
IA-32 Intel® Ar chitectur e Optimization A-4 default, and targets the Intel Pentium 4 processor and s ubsequent processors. Code produced will run on any Intel architecture 32-bit processor , but will be optimized speci fically for the targeted processor .
Application Performance T ools A A-5 V ectorizer Swit ch Options The Intel C++ and Fortran Compiler can vectorize your code using the vectorizer switch options. The options that enable the vectorizer are the -Qx[M,K,W,B,P] and -Qax[M,K,W,B,P] d escribed above.
IA-32 Intel® Ar chitectur e Optimization A-6 Multithreading with OpenMP* Both the Intel C++ and Fortran Compilers support shared memory parallelism via OpenMP compiler directives, library functions and environment variables. Op enMP directives are ac tivated by the compiler switch -Qopenmp .
Application Performance T ools A A-7 The -Qrcd option disables the change to truncation of the ro unding mode in floating-point-to-integer conversions. For complete details on all of the code optimization options, refer to the Intel® C++ Compiler User ’ s Guide.
IA-32 Intel® Ar chitectur e Optimization A-8 When you use PGO, consider the following guidelines: • Minimize the changes to your program after instrumented execution and before feedback compilation. During feedback compilation, the compiler ignores dynamic information for functions modified after that information was generated.
Application Performance T ools A A-9 Sampling Sampling allows you to profile all active software on your sy stem, including operating sy stem, device driver , and application software. It works by occasionally interrupting the processor and collecting the instruction address, process ID, and thread ID.
IA-32 Intel® Ar chitectur e Optimization A-10 Figure A-1 provides an example of a hotspots r eport by location. Event-based Sampling Event-based sampling (EBS) can be used to provide detailed information on the behavior of the microprocessor as it executes software.
Application Performance T ools A A-11 different events at a time. The numb er of the events that the VT une analyzer can collect at once on the Pentium 4 and Intel Xeon processor depends on the events selected. Event-based samples are collected after a specific number of processor events have occurred.
IA-32 Intel® Ar chitectur e Optimization A-12 duration of read traffic compared to the duration of the workload is significantly less than unity , it indicat es the dominant data locality of the workload is cache access traffic.
Application Performance T ools A A-13 stride inefficiency is most prom inent on memory traf fic. A useful indicator for lar ge-stride inefficiency in a workload is to compare the ratio between bus rea.
IA-32 Intel® Ar chitectur e Optimization A-14 The Call Graph V iew depicts the cal ler / callee relationships. Each thread in the application is the root of a call tree. Each node (box) in the call tree represents a function. E ach edge (line with an arrow) connecting two nodes represents the call from the parent to the child function.
Application Performance T ools A A-15 (SSE), St reaming SIMD Extensions 2 (SSE2) and Streaming SIMD Extensions 3 (SSE3). The library se t includes the Intel Math Kernel Library (MKL) and the Intel Integr ated Performance Primitives (IPP).
IA-32 Intel® Ar chitectur e Optimization A-16 • Performance: Highly-optimized routin es with a C interface that give Assembly-level performance in a C/C++ development enviro nment (MKL also supports a Fortran interface) . • Platform tuned: Processor -specific optimizations that yield the best performance for each Intel processor .
Application Performance T ools A A-17 developed with the Intel Performance Libraries benefit from new architectural features of future genera tions of Intel processors simply by relinking the application with upg raded versions of the libraries.
IA-32 Intel® Ar chitectur e Optimization A-18 The Intel Thread Checker product is an Intel VT une Performance Analyzer plug-in data collector that executes your program and automatically locates threading errors .
Application Performance T ools A A-19 Figure A-2 shows Intel Th read Checker displaying the source code of the selected instance from a list of detected data race conditions that occurred during threaded execution.
IA-32 Intel® Ar chitectur e Optimization A-20 Intel ® Software College The Intel ® Software College is a valuable resource for classes on Streaming SIMD Extensions 2 (SSE2), Threading and the IA-32 Intel Architecture.
B-1 B Using Performance Monitoring Events Performance monitoring events provides faciliti es to chara cterize the interaction between programmed sequen ces of instructions and dif ferent microarchitectural sub-systems.
IA-32 Intel® Ar chitectur e Optimization B-2 The performance metrics listed n T ables B-1 through T able B-5 may be applicable to processors that support Hyper -Threading T echnology , see Using Performance Metrics with Hyper -Threading T echnology section.
Using Performance Monitoring Events B B-3 Repla y In order to maximize performance for the common case, the Intel NetBurst microarchitecture sometimes aggressively schedules μ ops for execution before all the conditions for correct execution are guaranteed to be satisfied.
IA-32 Intel® Ar chitectur e Optimization B-4 miss more than once during its life time, but a Misses Retired metric (for example, 1 st -Level Cache Misses Retired ) will increment only once for that μ op.
Using Performance Monitoring Events B B-5 The first two metrics use performance counters, and thus can be used to cause interrupt upon overflow for sampling. They may also be useful for those cases where it is easier for a tool to read a performance counter instead of the time stamp counter .
IA-32 Intel® Ar chitectur e Optimization B-6 Non-Sleep Cloc kticks The performance monitoring counters can also be configured to count clocks whenever the performance monitoring hardware is not powered-down. T o count “non-sleep clockticks” with a performance-monitoring counter , do the following: • Select any one of the 18 counters.
Using Performance Monitoring Events B B-7 that logical processor is not halted (it may include some portion of the clock cycles for that logical processor to complete a transition into a halted state). A physical processo r that supports Hyper-Threading T echnology enters into a power -saving state if all logical processors are halted.
IA-32 Intel® Ar chitectur e Optimization B-8 Micr oarchitecture Notes T race Cache Even ts The trace cache is not directly comparable to an instruction cache. The two are organized very dif ferently . For example, a trace can span many lines' worth of instruction-cache data.
Using Performance Monitoring Events B B-9 There is a simplified block diagram below of the sub-systems connected to the IOQ unit in the front side bus sub-system and the BSQ unit that interface to the IOQ.
IA-32 Intel® Ar chitectur e Optimization B-10 Figure B-1 Relationships Between the Ca ch e Hierarch y , IOQ , BSQ and Front Side Bus Chip Set System Memo ry 1st Level Data Cache 3rd Level C ache FSB_.
Using Performance Monitoring Events B B-11 Core references are nominally 64 bytes, the size of a 1st-level cache line. Smaller sizes are called partials, e.g., uncacheable and write combining reads, uncacheable, write-t hrough and write-protect writes, and all I/O.
IA-32 Intel® Ar chitectur e Optimization B-12 • IOQ_allocation, IOQ_active_entries: 64 bytes for hits or misses, smaller for partials' hits or misses Writebac ks (dir ty evictions) • BSQ_cac.
Using Performance Monitoring Events B B-13 transactions of the writeback (WB) memory type for the FSB IOQ and the BSQ can be an indication of h ow of ten this happens. It is less likely to occur for applications with poor locality of writes to the 3rd-level cache, and of course cannot happen when no 3rd-level cache is present.
IA-32 Intel® Ar chitectur e Optimization B-14 Current implementations of the BSQ_cache_reference event do not distinguish between programmatic read and write misses. Programmatic writes that miss must get the rest of the cache line and merge the new data.
Using Performance Monitoring Events B B-15 Usage Notes on Bus Activities A number of performance metrics in T able B-1 are based on IOQ_active_entries and BSQ_active entr ies. The next three paragraphs provide information of various bu s transaction underway metrics.
IA-32 Intel® Ar chitectur e Optimization B-16 accesses (i.e., are also 3rd-level misses ). This can decrease the average measured BSQ latencies for workloads that frequently thrash (miss or prefetch a lot into) the 2nd-level cache but hit in the 3rd-level cache.
Using Performance Monitoring Events B B-17 an expression built up from other metrics; for example, IPC is derived from two single-event metrics. • Column 2 provides a description of the metric in column 1.
IA-32 Intel® Ar chitectur e Optimization B-18 T able B-1 P entium 4 Proces sor Perf ormance Metrics Metric Descrip tion Event Name or Metric Expression Event Mask V alue Required General Metr ics Non-Sleep Cl ock t ick s The number of clocktic ks.while a processor is not in any sleep modes.
Using Performance Monitoring Events B B-19 Speculative Uops Retired Number of uops retired (include both instr uctions e xecuted to completion and speculatively ex ecuted in the path of branch mispredictions).
IA-32 Intel® Ar chitectur e Optimization B-20 Mispredicted retur ns The number of mispredicted returns including all causes. retired_mispred_ branch_type RETURN All conditional s The number of branch.
Using Performance Monitoring Events B B-21 TC Flushes Number of TC flushes (The counter will count twice for each occurrence. Divide the count by 2 to get the number of flushes.
IA-32 Intel® Ar chitectur e Optimization B-22 Logical Processor 1 Deliver Mode The number of cycles that the trace and delivery engin e (TDE) is delivering traces associated with logical processor 1, regardless of the operating modes of the TDE fo r traces associated with logical processor 0.
Using Performance Monitoring Events B B-23 Logical Processor 0 Build Mode The number of cycles that the trace and delivery engin e (TDE) is building traces associated with logical processor 0, regardless of the operating modes of the TDE fo r traces associated with logical processor 1.
IA-32 Intel® Ar chitectur e Optimization B-24 T race Cache Misses The number of times that significant dela ys occurred in order to decode instr uctions and build a trace be cause of a TC miss.
Using Performance Monitoring Events B B-25 Memor y Metr ics P age W alk DTLB All Misses The number of page walk requests due to DTLB misses from either load o r store. page_walk_type DTMISS 1 st -Lev el Cache Load Misses Retired The number of retired μ ops that experienced 1 st -Lev el cache load misses.
IA-32 Intel® Ar chitectur e Optimization B-26 64K Aliasing Conflicts 1 The number of 64K aliasing conflicts. A memor y refe rence causing 64K aliasing conflict can be counted more than once in this stat. The performance penalty resulted from 64K-aliasing conflict can vary from being unnoticeable to considerable.
Using Performance Monitoring Events B B-27 MOB Load Replays The number of repla yed lo ads related to the Memor y Order Buffer (MOB). This metric counts only the case where the store-f orwarding data is not an aligned subset of t he stored data.
IA-32 Intel® Ar chitectur e Optimization B-28 2nd-Le vel Cache Reads Hit Shared The number of 2nd-lev el cache read references (loads and RFOs) that hit the cache line in shared state.
Using Performance Monitoring Events B B-29 3rd-Lev el Cache Reads Hit Modified The number of 3rd-le vel cache read references (loads and RFOs) that hit the cache line in modified state.
IA-32 Intel® Ar chitectur e Optimization B-30 All WCB Evictio ns The number of times a WC buff er e viction occurred due to any causes (This can be used to distingui sh 64K aliasing cases that contribute mor e significantly to performance penalty , e.
Using Performance Monitoring Events B B-31 Bus Metrics Bus Accesses from the Processor The number of all bus transactions that were allocated in the IO Queue from this processor .
IA-32 Intel® Ar chitectur e Optimization B-32 Prefetch Ratio F raction of all bus transactions (including retires) that were f or HW or SW pref etching.
Using Performance Monitoring Events B B-33 Writes from the Processor The number of all write transactions on the bus that w ere allocated in IO Queue from this processor (e xcludes RFOs).
IA-32 Intel® Ar chitectur e Optimization B-34 All WC from the Processor The number of Write Combining memor y transactions on the bus th at originated from this pr ocessor .
Using Performance Monitoring Events B B-35 Bus Accesses from All Agents The number of all bus transactions that were allocated in the IO Queue by all agents.
IA-32 Intel® Ar chitectur e Optimization B-36 Bus Reads Underwa y from the processor 7 This is an accrued sum of the durat ions of all read (includes RFOs) transactions by this processor . Divide by “Reads from the Processor” to get bus read request latency .
Using Performance Monitoring Events B B-37 All UC Underwa y from the processor 7 This is an accrued sum of the durat ions of all UC transactions by this processor .
IA-32 Intel® Ar chitectur e Optimization B-38 Bus Writes Underwa y from the processor 7 This is an accrued sum of the durat ions of all write transactions b y this processor . Divide by “Writes from the Processor” to get bus write request latency .
Using Performance Monitoring Events B B-39 Write WC Full (BSQ) The number of write (but neither writeback nor RFO) transactions to WC-typ e memor y . BSQ_allocation 1. REQ_TYPE1 | REQ_LEN0 | REQ_LEN1 | MEM_ TYPE0 | REQ_DEM_ TYPE 2. Enable edge filtering 6 in the CCCR.
IA-32 Intel® Ar chitectur e Optimization B-40 Reads Non-prefetch Full (BSQ) The number of read (excludes RFOs and HW|SW prefetches) transactions to WB-type memor y . Bew are of granularity issues with this eve n t. BSQ_allocation 1. REQ_LEN0 | REQ_LEN1 | MEM_TYPE1 | MEM_TYPE2| REQ_CACHE_TYPE| REQ_DEM_TYPE 2.
Using Performance Monitoring Events B B-41 UC Write P ar tial (BSQ) The number of UC write transactions. Bew are of granularity issues between BSQ and FSB IOQ e vents . BSQ_allocation 1. REQ_TYPE0 | REQ_LEN0 | REQ_SPLIT_TYPE | REQ_ORD_TYPE | REQ_DEM_TYPE 2.
IA-32 Intel® Ar chitectur e Optimization B-42 WB Writes Full Underwa y (BSQ) 8 This is an accrued sum of the durat ions of writeback (e victed from cache) transactions to WB-type memor y . Divide by Writes WB Full (BSQ) to estimate a verage request latency .
Using Performance Monitoring Events B B-43 Write WC P ar tial Underwa y (BSQ) 8 This is an accrued sum of the durat ions of par tial wr ite transactions to WC-typ e memor y . Divide by Write WC P ar tial (BSQ) to estimate a verage request latency . User note: Allocated entries of WC par tials that origina te from D Word operands are not included.
IA-32 Intel® Ar chitectur e Optimization B-44 SSE Input Assists The number of occurrences of SSE/SSE2 floating-point operations needing assistance to handl e an e xception condition. The number of occurrences includes speculative counts. SSE_input_assist ALL P acked SP Retired 3 Non-bogus packed single-precision instructi ons retired.
Using Performance Monitoring Events B B-45 1. A memory reference causing 64K aliasing conflict can be counte d more than once in this stat. The resulting perf or mance penalty can vary from unnoticeab le to consi derable .
IA-32 Intel® Ar chitectur e Optimization B-46 4. Most commonly used x87 instructions (e .g., fmul, fadd, fdiv, fsqrt, fstp , etc.) decode i nto a single μ op. Howe ver , transcendental and some x87 instructions decode into se veral μ ops; in these limited cases, the metrics will count the number of μ ops that are actually tagged.
Using Performance Monitoring Events B B-47 T able B-2 Metrics That Utiliz e Replay T agging Mechanism Replay Metric T ags 1 Bit field to set: IA32_PEBS_ ENABLE Bit field to set: MSR_ PEBS_ MA T RIX_ V.
IA-32 Intel® Ar chitectur e Optimization B-48 T ags for fr ont_end_event T able B-3 provides a list of the tags that ar e used by various metrics derived from the front_end_event . The event names referenced in column 2 can be found f rom the Pentium 4 processor performance monitoring events.
Using Performance Monitoring Events B B-49 T able B-4 Metrics That Utilize the Ex ecution T agging Mechanism Execution Me tric T ags Ups tream ESCR Ta g V a l u e i n Upstream ESCR See Event Mask P ar ameter for Execution_ event Packed_SP_retired Set the ALL bit in the e vent mask and the TagUop bit in the ESCR of packed_SP_uop .
IA-32 Intel® Ar chitectur e Optimization B-50 T able B-5 New Metri cs for P entium 4 Pr ocessor (Famil y 15, Model 3) Using P e rf ormance Metrics with Hyper-Threading Te c h n o l o g y On Intel Xeo.
Using Performance Monitoring Events B B-51 The performance metrics listed in T able B-1 fall into three categories: • Logical processor specific and su pporting parallel counting. • Logical processor specific but c onstrained by ESCR limitations. • Logical processor independent and not su pporting parallel counting.
IA-32 Intel® Ar chitectur e Optimization B-52 Branching Metrics Branches Retired T agged Mispredicted Branches Retired Mispredicted Branche s Retired All returns All indirect branches All calls All c.
Using Performance Monitoring Events B B-53 Memory Metrics Split Load Replays 1 Split Store Replays 1 MOB Load Replays 1 64k Aliasing Conflicts 1st-Le vel Cache Load Misses Retired 2nd-Lev el Cache L o.
IA-32 Intel® Ar chitectur e Optimization B-54 Bus Metrics Bus Accesses from the Processor 1 Non-pref etch Bus Accesses from the Processor 1 Reads from the Processor 1 Writes from the Processor 1 Read.
Using Performance Monitoring Events B B-55 Character ization Metrics x87 Input Assists x87 Output Assists Machine Clear Cou nt Memor y Order Machine Clear Self-Modifying Code Cle ar Scalar DP Retired .
IA-32 Intel® Ar chitectur e Optimization B-56 Using P e rf ormance Events of Intel Core Solo and Intel Core Duo pr ocessors There are performance events specific to the microarchitecture of Intel Core Solo and Intel Core Duo processors (see T able A-9 of the IA-32 Intel® Ar chitecture Softwar e Developer ’ s Manual, V olume 3B ).
Using Performance Monitoring Events B B-57 There are three cycle-counting events which will not progress on a halted core, even if the halted co re is being snooped. Th ese are: Unhalted core cycles, Unhalted reference cycles, and Unhalted bus cycles.
IA-32 Intel® Ar chitectur e Optimization B-58 • Some events, such as writeback s, may have non-deter ministic behavior for different runs. In such a case, only measurements collected in the same run yield meaningful ratio values.
Using Performance Monitoring Events B B-59 • Serial_Execution_Cycles, event number 3C, unit mask 02H This event counts the bus cycles during which the core is actively executing code (non-halted ) while the other core in the physical processor is halted.
IA-32 Intel® Ar chitectur e Optimization B-60.
C-1 C IA-32 Instruction Latency and Thr oughput This appendix contains tables of the latency , throughput and execution units that are associated with mo re-commonly-used IA-32 instructions 1 . The instruction timing data varies within the IA-32 family of processors.
IA-32 Intel® Ar chitectur e Optimization C-2 Overview The current generation of IA-32 family of processors use out-o f-order execution with dynamic scheduling and buf fering to tolerate poor instruction selection and scheduling that may occur in legacy code.
IA-32 Instruction Latency and Thr oughput C C-3 While several items on the above list involve selecting the right instruction, this appendix focuse s on the following issues. These are listed in an expected priority order , though which item contributes most to performance will vary by application.
IA-32 Intel® Ar chitectur e Optimization C-4 Definitions The IA-32 instruction performance data are listed in several tables. The tables contain the following information: Instruction Name:The assembly mnemonic of each instruction.
IA-32 Instruction Latency and Thr oughput C C-5 accurately predict realistic performance of actual code sequences based on adding instruction latency data. • The instruction latency data are useful when tun ing a dependency chain. However , dependency chains limit the out-of-order core’ s ability to execute micro-ops in pa rallel.
IA-32 Intel® Ar chitectur e Optimization C-6 Latency and Thr oughput with Register Operands IA-32 instruction latency and th roughput data are presented in T able C-2 through T able C-8.
IA-32 Instruction Latency and Thr oughput C C-7 T able C-2 Streaming SIMD Ext ension 2 128-bit Integer Instructions Instruction Latency 1 Thr oughput Execution Unit 2 CPUID 0F3n 0F2n 0x69n 0F3n 0F2n 0.
IA-32 Intel® Ar chitectur e Optimization C-8 PCMPGTB/PCMPGTD/PC MPGTW xmm, xmm 2 2 1 2 2 1 MMX_ALU PEXTR W r32, xmm, imm8 7 7 3 2 2 2 MMX_SHFT , FP_MISC PINSR W xmm, r32, imm8 4 4 1+1 2 2 2 MMX_SHFT .
IA-32 Instruction Latency and Thr oughput C C-9 PSUBB/PSUBW/PSUBD xmm, xmm 2 2 1 2 2 1 MMX_ALU PSUBSB/PSUBSW/PSUB U SB/PSUBUSW xmm, xmm 2 2 1 2 2 1 MMX_ALU PUNPCKHBW/PUNPCKH WD/PUNPCKHDQ xmm, xmm 4 4 .
IA-32 Intel® Ar chitectur e Optimization C-10 COMISD xmm, xmm 7 6 1 2 2 1 FP_ADD , FP_MISC CVTDQ2PD xmm, xmm 8 8 4+1 3 3 4 FP_ADD , MMX_SHFT CVTPD2PI mm, xmm 12 11 5 3 3 3 FP_ADD , MMX_SHFT , MMX_ALU.
IA-32 Instruction Latency and Thr oughput C C-11 DIVPD xmm, xmm 7 0 69 32+31 70 69 62 FP_DIV DIVSD xmm, xmm 39 38 32 39 38 31 FP_DIV MAXPD xmm, xmm 5 4 4 2 2 2 FP_ADD MAXSD xmm, xmm 5 4 3 2 2 1 FP_ADD.
IA-32 Intel® Ar chitectur e Optimization C-12 T able C-4 Streaming SIMD Extensio n Single-precision Floating-point Instructions Instruction Latency 1 Thr oughput Execution Unit 2 CPUID 0F3n 0F2n 0x69.
IA-32 Instruction Latency and Thr oughput C C-13 MOVLHPS 3 xmm, xmm 44 2 2 M M X _ S H F T MO VMSKPS r32, xmm 6 6 2 2 FP_MISC MO VSS xmm, xmm 4 4 2 2 MMX_SHFT MO VUPS xmm, xmm 6 6 1 1 FP_MO VE MULPS x.
IA-32 Intel® Ar chitectur e Optimization C-14 T able C-5 Stre aming SIMD Extension 64-bit Integ er Instructi ons Instruction Latency 1 Thr oughput Execution Unit CPUID 0F3n 0F2n 0x69n 0F3n 0F2n 0x69n.
IA-32 Instruction Latency and Thr oughput C C-15 PCMPGTB/PCMPGTD/ PCMPGTW mm, mm 22 1 1 M M X _ A L U PMADDWD 3 mm, mm 98 1 1 F P _ M U L PMULHW/PMULL W 3 mm, mm 98 1 1 F P _ M U L POR mm, mm 2 2 1 1 .
IA-32 Intel® Ar chitectur e Optimization C-16 T able C-7 IA-32 x87 Fl oating-point Instruct ions Instruction Latency 1 Throug hput Execution Unit 2 CPUID 0F3n 0F2n 0x69n 0F3n 0F2n 0x69n 0F2n F ABS 3 .
IA-32 Instruction Latency and Thr oughput C C-17 FSCALE 4 60 7 FRNDINT 4 30 11 FXCH 5 01 F P _ M O V E FLDZ 6 0 FINCSTP/FDECSTP 6 0 See “Table Footnotes” T able C-8 IA-32 General Purpose Instructi.
IA-32 Intel® Ar chitectur e Optimization C-18 Jcc 7 Not Appli- cable 0.5 ALU LOOP 8 1.5 ALU MO V 1 0.5 0.5 0.5 ALU MO VSB/MO VSW 1 0.5 0.5 0.5 ALU MO VZB/MOVZW 1 0.5 0.5 0.5 ALU NEG/NO T/NOP 1 0.5 0.5 0.5 ALU POP r32 1.5 1 MEM_LO AD , ALU PUSH 1.5 1 MEM_STORE, ALU RCL/RCR reg, 1 8 64 1 1 ROL / ROR 1 4 0 .
IA-32 Instruction Latency and Thr oughput C C-19 T able Footnotes The following footnotes refer to all tables in this appendix. 1. Latency information for many of in structions that are complex (> 4 μ ops) are estimates based on conservative and worst-case estimates.
IA-32 Intel® Ar chitectur e Optimization C-20 4. Latency and Throughput of transcen dental instructions can vary substantially in a dynamic execution environment. Only an approximate value or a range of values are given for these instructions. 5. The FXCH instruction has 0 latency in code sequences.
IA-32 Instruction Latency and Thr oughput C C-21 For the sake of simplicity , all data being requested is assumed to reside in the first level data cache (cache hit).
IA-32 Intel® Ar chitectur e Optimization C-22.
D-1 D S tack Alignment This appendix details on the alignment of th e stacks of data for Streaming SIMD Extensions and Streaming SIMD Extensions 2. Stac k Frames This section describes the stack alig nment conventions for both esp -based (normal), and ebp -based (debug) stack frames.
IA-32 Intel® Ar chitectur e Optimization D-2 alignment for __m64 and do uble type data by enforcing that these 64-bit data items are at least eight-byte aligned ( they will now be 16-byte aligned).
S tack Alignment D D-3 As an optimization, an alternate entr y point can be created that can be called when proper stack alig nment is pr ovided by the caller .
S tack Alignment D D-4 Example D-1 in the following sections illustrate this technique. Note t he entry points foo and foo.aligned , the latter is the alternate aligned entry point.
S tack Alignment D D-5 Example D-1 Aligned esp-Based Stac k Frames void _cdecl foo (int k) { int j; foo: // See Note A push ebx mov ebx, esp sub esp, 0x00000008 and esp, 0xfffffff0 add esp, 0x00000008 jmp common foo.
S tack Alignment D D-6 Aligned ebp -Based Stack Frames In ebp -based frames, padding is also inserted immediately before the return address. However , this frame is slightly unusual in that the return address may actually reside in two dif ferent places in the stack.
S tack Alignment D D-7 Example D-2 Aligned ebp-based Stac k Frames void _stdcall foo (int k) { int j; foo: push ebx mov ebx, esp sub esp, 0x00000008 and esp, 0xfffffff0 add esp, 0x00000008 // esp is (8 mod 16) after add jmp common foo.
S tack Alignment D D-8 // the goal is to make esp and ebp // (0 mod 16) here j = k; mov edx, [ebx + 8] // k is (0 mod 16) if caller aligned // its stack mov [ebp - 16], edx // J is (0 mod 16) foo(5); add esp, -4 // normal call sequence to // unaligned entry mov [esp],5 call foo // for stdcall, callee // cleans up stack foo.
S tack Alignment D D-9 Stac k Frame Optimizations The Intel C++ Compiler provides certain optimizations that may improve the way aligned frames are set up and used.
IA-32 Intel® Ar chitectur e Optimization D-10 Inlined Assembl y and ebx When using aligned frames, the ebx register generally should n ot be modified in inlined assembly blocks since ebx is used to keep track of the argu ment block.
E-1 E Mathematics of Pr efetch Scheduling Distance This appendix discusses how far away to insert prefetch instructions. It presents a mathematical model allowing you to deduce a simplified equation which you can use for determining the prefetch schedu ling distance (PSD) for your application.
IA-32 Intel® Ar chitectur e Optimization E-2 N inst is the number of instructions in the scope of one loop iteration. Consider the following example of a heuristic equation assuming that parameters have the values as indicated: where 60 corresponds to Nlookup , 25 to Nxfer , and 1.
Mathematics of Pr efetch Scheduling Distance E E-3 T b data transfer latency which is equal to number of lines per iteration * line burst latency Note that the potential effects of µ op reordering are not factored into the estimations discussed.
IA-32 Intel® Ar chitectur e Optimization E-4 Memory access plays a pivotal role in prefetch scheduling. For more understanding of a memory subsy stem, consider Streaming SIMD Extensions and S treaming SIMD Extensions 2 memory pipeline depicted in Figure E-1.
Mathematics of Pr efetch Scheduling Distance E E-5 T l varies dynamically and is also syst em hardware-dependent. The static variants include the core-to-front-sid e-bus ratio, memory manufacturer and memory controller (chipset).
IA-32 Intel® Ar chitectur e Optimization E-6 No Preloading or Prefetc h The traditional prog ramming approach does not perform data preloading or prefetch. It is sequen tial in nature and will experience stalls because the memory is unable to provide the data immediately when the execution pipeline re quires it.
Mathematics of Pr efetch Scheduling Distance E E-7 The iteration latency is approximately equal to the computation laten cy plus the memory leadoff latency (inc ludes cache miss latency , chipset latency , bus arbitration, and so on.) plus the data transfer latency where transfer latency = number of lines per iteration * line burst latency .
IA-32 Intel® Ar chitectur e Optimization E-8 The following formula shows the re lationship among the parameters: It can be seen from this relationship that the iteration latency is equal to the computation latency , which means the memory accesses are executed in background and their latencies are completely hidden.
Mathematics of Pr efetch Scheduling Distance E E-9 For this particular example the pref etch scheduling distance is greater than 1. Data being prefetched for iteration i will be consumed in iteration i+2 .
IA-32 Intel® Ar chitectur e Optimization E-10 Memory Throughput Bound (Case: T b >= T c ) When the application or loop is memory throughput bou nd, the memory latency is no way to be hidden. Under such circumstances, the burst latency is always greater than the co mpute latency .
Mathematics of Pr efetch Scheduling Distance E E-11 memory to you cannot do much abou t it. T ypically , data copy from one space to another space, for example, graphics driver moving data from writeback memory to write-combi ning memory , belongs to this category , where performance advantage from pref etch in structions will be marginal.
IA-32 Intel® Ar chitectur e Optimization E-12 Now for the case T l =18, T b =8 (2 cache lines are needed per iteration) examine the following gr aph. Consider the graph of accesses per iteration in example 1, Figure E-6. The prefetch scheduling dist ance is a step function of T c , the computation latency .
Mathematics of Pr efetch Scheduling Distance E E-13 In reality , the front-side bus (FSB) pipelining depth is limited, that is, only four transactions are al lowed at a time in the Pentium III and Pentium 4 processors.
IA-32 Intel® Ar chitectur e Optimization E-14.
Index-1 Index 64-bit mode default operand size, 8-1 introduction, 8-1 legacy instructions, 8-1 multiplicati on notes, 8-2 register usage, 8-2, 8-4 sign-extension, 8-3 software prefetch, 8-6 using CVTS.
IA-32 Intel® Ar chitectur e Optimization Index-2 coding methodologies, 3-13 coding techniques, 3-12 absolute difference of signed numbers, 4-24 absolute difference of unsigned numbers, 4-23 absolute .
Index Index-3 floating-point stalls, 2-72 flow dependency, E-7 flush to zero, 5-22 FXCH instruction, 2-70 G general optimizati on techniques, 2-1 branch prediction, 2-15 static prediction, 2-19 genera.
IA-32 Intel® Ar chitectur e Optimization Index-4 L large load stalls, 2-37 latency, 2-72, 6-5 lea instruction, 2-74 loading and storing to and from the same DRAM page, 4-39 loop blocking, 3-34 loop u.
Index Index-5 O optimizing ca che util ization cache management, 6-44 examples, 6-15 non-temporal store instructions, 6-10 prefetch and load, 6-9 prefetch Instructions, 6-8 prefetching, 6-7 SFENCE ins.
IA-32 Intel® Ar chitectur e Optimization Index-6 R reciprocal instructions, 5-2 rounding control option, A-6 S sampling event-based, A-10 Self-modifying code, 2-47 SFENCE Instruction, 6-15, 6-16 sign.
INTEL SALES OFFICES ASIA P ACIFIC Australia Intel Corp. Level 2 448 St Kilda Road Melbourne VI C 3004 Australia Fax:613- 9862 5599 China Intel Corp. Rm 709, Shaanxi Zhongda Int'l Bldg No.30 Nandajie Street Xian AX71000 2 China Fax:(86 29) 7203 356 Intel Corp.
Intel Corp. 999 CANADA PLACE, Suite 404,#1 1 Va n c o u v e r B C V6C 3E2 Canada Fax:604- 844-28 13 Intel Corp. 2650 Quee nsview Dr ive, Suite 250 Ottawa ON K2B 8H6 Canada Fax:613- 820-59 36 Intel Corp. 190 Attwell D rive, Suite 500 Rexcdale ON M9W 6H8 Canada Fax:416- 675-24 38 Intel Corp.
An important point after buying a device Intel ARCHITECTURE IA-32 (or even before the purchase) is to read its user manual. We should do this for several simple reasons:
If you have not bought Intel ARCHITECTURE IA-32 yet, this is a good time to familiarize yourself with the basic data on the product. First of all view first pages of the manual, you can find above. You should find there the most important technical data Intel ARCHITECTURE IA-32 - thus you can check whether the hardware meets your expectations. When delving into next pages of the user manual, Intel ARCHITECTURE IA-32 you will learn all the available features of the product, as well as information on its operation. The information that you get Intel ARCHITECTURE IA-32 will certainly help you make a decision on the purchase.
If you already are a holder of Intel ARCHITECTURE IA-32, but have not read the manual yet, you should do it for the reasons described above. You will learn then if you properly used the available features, and whether you have not made any mistakes, which can shorten the lifetime Intel ARCHITECTURE IA-32.
However, one of the most important roles played by the user manual is to help in solving problems with Intel ARCHITECTURE IA-32. Almost always you will find there Troubleshooting, which are the most frequently occurring failures and malfunctions of the device Intel ARCHITECTURE IA-32 along with tips on how to solve them. Even if you fail to solve the problem, the manual will show you a further procedure – contact to the customer service center or the nearest service center