Intel(R) C++ and Fortran Compiler 9.0 for Itanium ------------------------------------------------- Optimization Levels The Intel C++ and Intel Fortran Compilers apply the following optimizations when you invoke the -O1, -O2, or -O3 options: Constant propagation, copy propagation, dead- code elimination, global register allocation, instruction scheduling, lLoop unrolling (-O2, -O3 only), loop-invar- iant code movement, partial redundancy elimination, strength reduction/induction variable simplification, var- iable renaming, exception handling optimizations, tail recursions, peephole optimizations, structure assignment lowering and optimizations, dead store elimination, loop- invariant code motion. Depending on the Intel architecture, optimization options can have different effects. To specify optimizations for your target architecture, refer to the following: -O1 Optimizes to favor code size and code locality. Disables loop unrolling. -O1 may improve perfor mance for applications with very large code size, many branches, and execution time not dominated by code within loops. In most cases, -O2 is recom mended over -O1. Itanium architecture-based systems: Disables soft ware pipelining and global code scheduling. -O2 (DEFAULT) Optimize for code speed. This is the generally rec ommended optimization level. Itanium architecture-based systems: Turns software pipelining ON. -O3 Enable -O2 optimizations and in addition, enable more aggressive optimizations such as loop and mem ory access transformation. The -O3 optimizations may slow down code in some cases compared to -O2 optimizations. Recommended for applications that have loops with heavy use of floating point calcu lations and process large data sets. -fast Enhances speed across the entire program. Sets the following command options that can improve run-time performance: -O3, -ipo, and -static. The default is -nofast. -O Same as -O2 -Ob Control inline expansion, where is one of the following values: 0 -- Disables inlining. 1 -- (DEFAULT) Enables inlining of functions declared with the __inline keyword. Also enables inlining according to the C++ language. 2 -- Inline any function, at the compiler's discre tion. Enables interprocedural optimizations and has the same effect as -ip. -O0 Disables optimizations. Floating Point Optimization Options -mp Maintain floating-point precision (disables some optimizations). The -mp option restricts optimiza tion to maintain declared precision and to ensure that floating-point arithmetic conforms more closely to the ANSI and IEEE standards. For most programs, specifying this option adversely affects performance. If you are not sure whether your application needs this option, try compiling and running your program both with and without it to evaluate the effects on both performance and preci sion. -mp1 Improve floating-point precision. -mp1 disables fewer optimizations and has less impact on perfor mance than -mp. -ftz Flush denormal results to zero. -IPF_fma[-] Enable [disable] the combining of floating point multiplies and add/subtract operations. -IPF_fma[-] enables [disables] the contraction of float ing-point multiply and add/subtract operations into a single operation. Unless -mp is specified, the compiler contracts these operations whenever possi ble. The -mp option disables the contractions. -IPF_fma and -IPF_fma- can be used to override the default compiler behavior. -IPF_fp_relaxed Enables use of faster but slightly less accurate code sequences for math functions, such as divide and sqrt. When compared to strict IEEE* precision, this option slightly reduces the accu- racy of floating-point calculations performed by these func- tions, usually limited to the least significant digit. -IPF_fp_speculation Enable floating point speculations with the follow ing conditions: fast -- Speculate floating point operations safe -- Speculate only when safe strict -- Same as off off -- Disables speculation of floating-point oper ations -IPF_flt_eval_method0 Directs the compiler to evaluate the expressions involving floating-point operands in the precision indicated by the variable types declared in the program. -IPF_fltacc[-] Enable [disable] optimizations that affect floating point accuracy. By default (-IPF_fltacc-) the com piler may apply optimizations that reduce float ing-point accuracy. You may use -IPF_fltacc or -mp to improve floating-point accuracy, but at the cost of disabling some optimizations. Optimizing Non-Exclusively for Specific Processors The -tpp options optimize your application's performance for a specific processor. The resulting binary will also run on other processors in the same architecture (IA-32 or Itanium architecture). The Intel C++ Compiler includes gcc-compatible versions of the -tpp options. These options are listed below, following the -tpp options. -tpp1 Target optimization to the Itanium processor. -tpp2 Target optimization to the Itanium 2 processor. (DEFAULT on Itanium architecture-based systems) The Intel C++ Compiler includes gcc-compatible versions of the -tpp options, listed as follows -mcpu= Optimize for a specific cpu, where is one of the following: itanium -- Optimize for Intel Itanium processor itanium2 -- Optimize for Intel Itanium 2 processor. Interprocedural Optimizations (IPO) Enable and Specify the Scope of IPO -ip Enable single-file IP optimizations (within files). With this option, the compiler performs inline function expansion for calls to functions defined within the current source file. -ipo Enable multi-file IP optimizations (between files). When you use -ipo to specify multifile IPO, the compiler performs inline function expansion for calls to functions defined in separate files. For this reason, it is important to compile the entire application or multiple, related source files together when you specify -ipo. -ipo_c Generate a multi-file object file (ipo_out.o) that can be used in further link steps. -ipo_S Generate a multi-file assembly file (ipo_out.s) that can be used in further link steps. Modify the Behavior of IPO -ip_no_inlining Disables inlining that would result from the -ip interprocedural optimization, but has no effect on other interprocedural optimizations. -ipo_obj Force the compiler to create real object files when used with -ipo (requires -ipo). -nolib_inline Disable inline expansion of intrinsic functions -auto_ilp32 Specifies that the application should run within a 32-bit address space. Also tells the compiler to use 32-bit pointers whenever possible. To use this option, you must specify -ipo. Profile-guided Optimizations (PGO) -prof_gen[x] Instructs the compiler to produce instrumented code in your object files in preparation for instru mented execution. With the x qualifier, extra information is gathered. This option is used in Phase 1 of PGO to instruct the compiler to produce instrumented code in your object files in prepara tion for instrumented execution. Parallel make is automatically supported for -prof_genx compila tions. -prof_use Instruct the compiler to produce a profile-opti mized executable and merge available dynamic infor mation (.dyn) files into a pgopti.dpi file. Use the -prof_use option in Phase 3 of PGO. -prof_dir Specify directory for profiling output files (*.dyn and *.dpi). Use the -prof_dir option with -prof_gen as recommended for most programs, espe cially if the application includes the source files located in multiple directories. -prof_dir ensures that the profile information is generated in one consistent place. -prof_file Specify file name for profiling summary file -fnsplit[-] Enable[disable] function splitting. Function split ting is enabled by -prof_use in Phase 3 to improve code locality by splitting routines into different sections: one section to contain the cold or very infrequently executed code, and one section to con tain the rest of the code (hot code). You can use -fnsplit to disable function splitting for the following reasons: Most importantly, to get improved debugging capa bility. In the debug symbol table, it is difficult to represent a split routine, that is, a routine with some of its code in the hot code section and some of its code in the cold code section. The -fnsplit- option disables the splitting within a routine but enables function grouping, an optimization in which entire routines are placed either in the cold code section or the hot code section. Function grouping does not degrade debug ging capability. Another reason can arise when the profile data does not represent the actual program behavior, that is, when the routine is actually used fre quently rather than infrequently. High-level Language Optimizations (HLO) -O3 Turn on high-level optimizations. Enable -O2 plus more aggressive optimizations, such as loop trans formation and prefetching. -O3 optimizes for maxi mum speed, but may not improve performance for some programs. -unroll The only allowed value for n is 0. Disable loop unrolling. -ivdep_parallel For Itanium architecture-based applications, the -ivdep_parallel option indicates there is abso lutely no loop-carried memory dependency in the loop where IVDEP directive is specified. This tech nique is useful for some sparse matrix applica tions. Auto Parallelization Options -parallel Enable the auto-parallelizer to generate multi-threaded code for loops that can be safely executed in parallel. The -parallel option enables the auto-parallelizer if either the -O2 or -O3 optimization option is also on (the default is -O2). The -parallel option detects parallel loops capable of being executed safely in parallel and automatically generates multithreaded code for these loops. -par_report[] Control the diagnostic messages from the auto-par allelizer where is one of the following: 0 -- no diagnostic information is displayed. 1 -- indicates loops successfully auto-parallelized (DEFAULT). Issues a "LOOP AUTO-PARALLELIZED" mes sage for parallel loops. 2 -- indicates successfully auto-parallelized loops as well as unsuccessful loops. 3 -- same as 2 plus additional information about any proven or assumed dependencies inhibiting auto-parallelization (reasons for not paralleliz ing). -par_threshold[] Set a threshold for the auto-parallelization of loops based on the probability of profitable execu tion of the loop in parallel. This option is used for loops whose computation work volume cannot be determined at compile-time. The threshold is usu ally relevant when the loop trip count is unknown at compile-time. =0-100. (DEFAULT: =75) The compiler applies a heuristic that tries to bal ance the overhead of creating multiple threads ver sus the amount of work available to be shared amongst the threads. Conformance Options -ansi_alias directs the compiler to assume that the program adheres to the rules defined in the ISO C Standard. If your program adheres to these rules, then this option will allow the compiler to optimize more aggressively. If it doesn't adhere to these rules, then it can cause the compiler to generate incorrect code. -Xc, -ansi Select strict ANSI C/C++ conformance dialect -Xa (Itanium architecture-based systems) Select extended ANSI C dialect -c99[-] Enable(DEFAULT) [disable] C99 support for C pro grams. -std=c99 Enable C99 support for C programs -mp This option is described above in Floating-point Optimizations. Miscellaneous Optimization Options -qp, -p Compile and link for function profiling with Linux gprof* tool -alias_args[-] This option implies arguments may be aliased [not aliased]. DEFAULT: -alias_args -mserialize-volatile (Itanium architecture-based systems) Enable strict memory access ordering for volatile data object references. -mno-serialize-volatile (Itanium architecture-based sys tems) Memory access ordering for volatile data object references may be suppressed . -complex_limited_range This option causes the compiler to use the highest performance formulations of complex arithmetic operations, which may not produce acceptable results for input values near the top or bottom of the legal range. Without this option, the compiler users a better formulation of complex arithmetic operations, which produces acceptable results for the full range of input values, at some loss in performance. Language Options -[no]restrict Enable [disable] the ƒGrestrictƒG keyword for disam biguating pointers. -Kc++ Compile all source or unrecognized file types as C++ source files. -fno-rtti Disable RTTI support. -[no]align (IA-32 systems only) Analyze and reorder memory layout for variables and arrays. -Zp[n] Specify alignment constraint for structure and union types, where n is one of the following: 1,2,4,8,16. -fshort-enums Allocate as many bytes as needed for enumerated types. -fsyntax-only Same as -syntax. -funsigned-char Change default char type to unsigned. -funsigned-bitfields Change default bitfield type to unsigned. Linker Options -L Instruct linker to search for libraries. -i_dynamic Link Intel provided libraries dynamically. -dynamic-linker Select a dynamic linker (filename) other than the default. -mrelax Pass -relax to the linker. -mnorelax Do not pass -relax to the linker. -no_cpprt Do not link in C++ runtime libraries. -cxxlib-gcc Use C++ header files provided by gcc during compi lation, and use C++ libraries provided by gcc. Also, see gcc Interoperability above. -cxxlib-icc Use C++ libraries provided by Intel compiler. -gcc-name Use this option to specify the location of g++ when compiler cannot locate gcc C++ libraries. For use with -cxxlib-gcc configuration. -nodefaultlibs Do not use standard libraries when linking. -nostartfiles Do not use standard startup files when linking. -nostdlib Do not use standard libraries and startup files when linking. -static Prevent linking with shared libraries. Causes the executable to link all libraries statically, as opposed to dynamically. -shared Produce a shared object. -static-libcxa Link Intel libcxa C++ library statically. By default, the Intel-provided libcxa C++ library is linked in dynamically. Use -static-libcxa on the command line to link libcxa statically, while still allowing the standard libraries to be linked in by the default behavior. -shared-libcxa Link Intel libcxa C++ library dynamically, overrid ing the default behavior when -static is used. This option has the opposite effect of -static-libcxa. When this option is used, the Intel-provided libcxa C++ library is linked in dynamically, allowing the user to override the static linking behavior when the -static option is used. -u Pretend the is undefined. -T Direct linker to read link commands from . -Xlinker Pass directly to the linker for processing. -Wl,[,,...] Pass options o1, o2, etc. to the linker for pro cessing. ENVIRONMENT VARIABLES You can customize your environment by editing the follow ing environment variables. You can specify paths where the compiler can search for special files such as libraries and include files. LD_LIBRARY_PATH Specifies the location for all Intel-provided libraries. The LD_LIBRARY_PATH environment variable contains a colon-separated list of directories in which the linker will search for library (.a) files. If you want the linker to search additional libraries, you can add their names to LD_LIBRARY_PATH, to the command line, to a response file, or to the configuration file. In each case, the names of these libraries are passed to the linker before the names of the Intel libraries that the driver always specifies. PATH Specifies the directories the system searches for binary executable files. ICCCFG Specifies the configuration file for customizing compilations with the icc compiler. ICPCCFG Specifies the configuration file for customizing compilations with the icpc compiler. TMP Specifies the directory in which to store temporary files. If the directory specified by TMP does not exist, the compiler places the temporary files in the current directory. IA32ROOT (IA32-based systems) Points to the directory containing the bin, lib, include and substitute header directories. IA64ROOT (Itanium architecture-based systems) Points to the directory containing the bin, lib, include and substitute header directories. GNU* Environment Variables The Intel C++ Compiler supports the following GNU environƒ… ment variables: CPATH Specifies a list of directories to search following the directories specified by -I. C_INCLUDE Specifies a list of directories to search following the directories specified by -isystem. CPLUS_INCLUDE_PATH Same as C_INCLUDE. DEPENDENCIES_OUTPUT Specifies how to output dependencies for Make based on preprocessed, non-system header files. SUNPRO_DEPENDENCIES Same as DEPENDENCIES_OUTPUT, except that system header files are not ignored. Compilation Environment Options The Intel C++ Compiler installation includes shell scripts that you can use to set environment variables. See the Intel C++ UserƒGs Guide for more information Standard OpenMP Environment Variables OMP_SCHEDULE Specifies the type of runtime scheduling. DEFAULT: static OMP_NUM_THREADS Sets the number of threads to use during execution. DEFAULT: Number of processors currently installed in the system while generating the executable OMP_DYNAMIC Enables (TRUE) or disables (FALSE) the dynamic adjustment of the number of threads. DEFAULT: FALSE OMP_NESTED Enables (TRUE) or disables (FALSE) nested paral lelism. DEFAULT: FALSE Intel Extensions to OpenMP Environment Variables KMP_LIBRARY Selects the OpenMP run-time library throughput. The options for the variable value are: serial, turnaround, or throughput indicating the execution mode. The default value of throughput is used if this variable is not specified. DEFAULT: throughput (execution mode) KMP_STACKSIZE Sets the number of bytes to allocate for each par allel thread to use as its private stack. Use the optional suffix b, k, m, g, or t, to specify bytes, kilobytes, megabytes, gigabytes, or terabytes. DEFAULT: IA-32 systems: 2m Itanium architecture-based systems: 4m PGO Environment Variables The following environment values determine the directory in which to store dynamic information files or whether to overwrite pgopti.dpi. PROF_DIR Specifies the directory in which dynamic informa tion files are created. This variable applies to all three phases of the profiling process. PROF_NO_CLOBBER Alters the feedback compilation phase slightly. By default, during the feedback compilation phase, the compiler merges the data from all dynamic informa tion files and creates a new pgopti.dpi file if .dyn files are newer than an existing pgopti.dpi file. When this variable is set, the compiler does not overwrite the existing pgopti.dpi file. Instead, the compiler issues a warning and you must remove the pgopti.dpi file if you want to use addi tional dynamic information files. PROF_DUMP_INTERVAL Initiate Interval Profile Dumping in an instru mented application. The _PGOPTI_Set_Inter val_Prof_Dump(int interval) function activates Interval Profile Dumping and sets the approximate frequency at which dumps will occur. The interval parameter is measured in milliseconds and specifies the time interval at which profile dumping will occur. An alternative method of initiating Interval Profile Dumping is by setting this environment variable. Set this environment variable to the desired interval value prior to starting the appli cation. COPYRIGHT INFORMATION Copyright (C) 1985-2003, Intel Corporation. All rights reserved. * Other brands and names are the property of their respec tive owners. Portability flags for SPEC CPU2000: -Dalloca=__builtin_alloca Replace occurrences of alloca() with __builtin_alloca. -DSPEC_CPU2000_LP64 Compile using LP64 programming model. -D_LIBC Don't include GNU C libraries -DLINUX_i386 Linux Intel system, use "long long" as 64bit variable. -DHAS_ERRLIST Prog env provides specification for "sys_errlist[]". -DFMAX_IS_DOUBLE Specifies whether FMAX is double or float. -DSPEC_CPU2000_NEED_BOOL Use SPEC provided definition of the boolean type. -DSPEC_CPU2000_LINUX_IA64 Compile for an IA64 system running Linux. -DPSEC_CPU2000_GLIBC22 Compatibility with 2.2 & later versions of glibc -DSYS_IS_USG Specifies that the operating system is USG compliant. -DSYS_HAS_TIME_PROTO Do not explicitly declare time(). -DSYS_HAS_SIGNAL_PROTO Do not explicitly #include -DSYS_HAS_IOCTL_PROTO Do not explicitly declare ioctl(). -DSYS_HAS_ANSI System is ANSI compliant. -DSYS_HAS_CALLOC_PROTO Do not explicitly declare calloc(). -FI Fixed-format F90 source code. NAME PEXEC pexec, pcreate, pchange - tools to control placement of processes on CPUs SYNOPSIS pexec -np N [-cf FILE] [--mode=s[equential]|c[yclic]|u[ser-defined]] [--width=w] [--align] [--strict|--nostrict] [--auto-clean|--noauto-clean] PROGRAM [PROGRAM_ARGS] pcreate -np N [-cf FILE] [--mode=s[equential]|c[yclic]|u[ser-defined]] [--width=w] [--align] [--strict|--nostrict] [--auto-clean|--noauto-clean] DESCRIPTION pexec and pcreate allocate a area containing N CPUs. The allocating policy can be sequential, but also cyclic to spread processes in various ways among processors. If you want to use some of these advanced features, you must define the geometry of your machine. The basic view of processors is a rectangle (w is the width of the rectan- gle) : Cpus ______________________ |0 |1 |2 |3 |.. |w |w+1 |w+2 |w+3 |.. |2w |2w+1|2w+2|2w+3|.. |... | | | | Processes can be assigned using two methods : - horizontal ranges (sequential mode) CPUs will be taken in order 0, 1, 2, 3, ..., w, w+1, w+2, ... - vertical ranges (cyclic mode) CPUs will be taken in order 0, w, 2w, 3w, ..., 1, w+1, 2w+1, ... - a user defined method CPUs will be taken in the order defined by the user Then, pexec runs PROGRAM in this area. To run programs in areas created using pcreate, use passign(1) Programs launched within this environment will be able to run on any CPU of the area. Within these programs, any call to sched_setaffinity(2) will bind processes on CPUs inside this set of CPUs. So basically, they should try to bind the first process to CPU 0, the second to CPU 1, ... After the execution of PROGRAM, the area is destroyed by pexec. To destroy areas created with pcreate, use pde- stroy(1) OPTIONS -np N Specifies the number of CPUs to allocate. -cf FILE read FILE to get the configuration. See pexec.conf(5) for further information. If not spec- ified, pexec will read /etc/pexec.conf --mode=s[equential]|c[yclic]|u[ser-defined] selects the process binding policy. sequential : processes will be binded to a continu- ous range of CPUs. cyclic : processes will be binded to a section of CPUs, according to the width value. Example : if you request 4 cpus in a 16 cpus machine with width=4, processes will be binded to processors : n, n+4, n+8 and n+12 (depending on free resources). user-defined : processes will be binded in the order defined by order --width=w Defines the geometry of your processors. It has an impact on the cyclic mode, and also on alignments. --align Allocated processors will be in the same physical row of your machine. This system, which only works in sequential mode, will try to group as much as possible processes to make them use same rows of processors. --[no]strict The created area will be exclusive. No other area will be able to share CPUs with this one. If it is impossible, the program will exit. --[no]auto-clean The created area may be automatically destroyed when there is no remaining process using it. Use this option to set this. By default, pcreate uses noauto-clean and pexec uses auto-clean. --order="a b c d .." The area will contain the first N CPUs of this set (N beeing given by -np). If there isn't enough available CPUs among the ones on the list, the allocation will fail. EXAMPLES To launch a job on 6 consecutive reserved processors : pexec -np 6 --mode=s --strict (processes should be assigned to processors 0 to 5) To launch another job on 4 other processors : pexec -np 4 --mode=s --strict (processes should be allocated to processors 6 to 9) If you had wanted to launch the job on processors of the same row : pexec -np 4 --align --strict --mode=s (so, processes should be allocated to processors 8 to 11 instead of 6 to 9) To create an area containing the first, the fourth and the seventh CPU : pcreate -np 3 --mode=u --order="0 3 6" SEE ALSO pexec.conf(5), pdestroy(1), passign(1) APPLICATION TO SPECCPU submit = pexec -np 1 --mode=u --order="$SPECUSERNUM" $command Each speccpu process will be assigned to THE processor choosen by $SPECUSERNUM