Intel produces compilers that produce highly optimized code for their CPUs. As with all compilers, programs compiled with optimization should have their output double-checked for accuracy. If the numeric output is incorrect or lacks the desired accuracy less-aggressive compile options should be tried. The following table summarizes some relevant commands on the SCC:
|module avail intel||List available versions of the Intel compiler.|
|module load intel/2016||Load a particular version.|
|icc||C and C++ compiler.|
Both compilers use the same optimization flags, and both compilers have manuals available:
man icc man ifort
Intel also has a document that makes recommendations for optimization options.
General Compiler Optimization Flags
The Intel compilers optimization flags deliberately mimic many of those used with the GNU family of compilers. The basic optimization flags are summarized below. Using these flags does not result in any incompatibility between CPU architectures. Note that it is not recommended to use the Intel compiler when the program will be run on AMD processors due to lackluster executable performance in that case.
|-O2||More extensive optimization. Recommended by Intel for general use.|
|-O3||More aggressive than -O2 with longer compile times. Recommended for codes that loops involving intensive floating point calculations.|
|-Ofast||-O3 plus some extras.|
|-ipo||Interprocedural optimization, a step that examines function calls between files when the program is linked. This flag must be used to compile and when linking. Compile times are very long with this flag, however depending on the application there may be appreciable performance improvements when combined with the -O* flags.|
|-mtune=processor||This flag does additional tuning for specific processor types, however it does not generate extra SIMD instructions so there are no architecture compatibility issues. The tuning will involve optimizations for processor cache sizes, preferred ordering of instructions, and so on. The useful values for the value processor on the SCC are: broadwell,haswell,ivybridge,sandybridge, or corei7.|
Flags to Specify SIMD Instructions
These flags will produce executables that contain specific SIMD instructions which may effect compatibility with compute nodes on the SCC.
|-xHost||Must be used with at least -O2. Creates an executable that uses SIMD instructions based on the CPU that is compiling the code.|
|-fast||A combination of -Ofast, -ipo, -static (for static linking), and -xHost.|
|-xarch||Must be used with at least -O2. Specifies the type of SIMD instructions to be generated. The useful values for arch on the SCC are: SSE4.2, AVX, CORE-AVX-I, CORE-AVX2|
|-axarch||This must be used with at least -O2 and -xarch. The -xarch flag will produce specific SIMD instructions, and additional SIMD instructions can be supported by adding the -axarch flag. Every function that can be compiled with SIMD instructions will have separate copies created for each instruction set. The executable will auto-detect CPU instruction support at runtime which version to run. The compile times can be very long as functions will be compiled multiple times over and the resulting binary will be large. The useful values for -ax are the same as for -x. Several instruction sets can included with this command when comma-separated. For example: icc -c -O3 -xSSE4.2 -axAVX,CORE-AVX2 mycode.cpp|
Default Optimization Behavior
Most open source programs that compile from source code use the -O2 or -O3 flags. This will result in fast code that can run on any compute node on the SCC. The -fast flag can be problematic (due to its inclusion of the -xHost flag) when run on the login nodes as they are Broadwell architecture CPUs which support AVX2 instructions. Codes compiled with -fast will only be able to execute on Broadwell architecture compute nodes on the SCC.
Most codes will be well-optimized with the -O2 or -O3 flags. Programs that involve intensive floating-point calculations inside of loops can additionally be compiled with the -xarch flag. For maximum cross-compatibility across the SCC compute nodes and probable highest performance a combination of flags should be used:
icc -Ofast -xSSE4.2 -axAVX,CORE-AVX2 -c mycode.cpp
If benchmarking and testing of the compiled code does not show any improvement with the -x and -ax flags then they can be removed to improve compilation times.
Note that selecting specific SIMD instructions with the -xarch flag alone will restrict compatibility with compute nodes unless the job is submitted with this qsub flag: -l cpu_arch=compatible_arch. The compatible_arch value is an architecture name that matches the SIMD instructions. In this example a code is compiled with AVX instructions and a Haswell architecture CPU is requested with qsub:
icc -Ofast -xAVX mycode.cpp -o mycode qsub -l cpu_arch=haswell -b y mycode
If a code is relatively small in scope it can be compiled as part of a queue job. For example, a job that is submitted to run on a Buy-in node equipped with an Ivybridge architecture CPU could be compiled with tunings for that node. As a precaution the source is copied into $TMPDIR: