PGI Compiler

The PGI compiler family is produced by The Portland Group which is owned by Nvidia, Inc.  It is available on SCC. The following table summarizes some relevant commands on the SCC:

Command Description
module avail pgi List available versions of the PGI compiler.
module load pgi/16.5 Load a particular version.
pgcc C compiler.
pg++ C++ compiler.
pgf90 Fortran compiler.

The C/C++ and Fortran compilers use the same optimization flags, and both compilers have manuals available:

man pgcc
man pgf90

The Portland Group has an online reference manual that describes their compiler flags in detail.

General Compiler Optimization Flags

The basic optimization flags are summarized below.

Flag Description
-O0 Optimization level 0. Usually for debugging.
-O1 Optimization level 1. Scheduling within extended basic blocks is performed. No global optimizations are performed. It is the default level if none flag is specified.
-O Optimization level 2. All level 1 optimizations are performed. In addition, traditional scalar optimizations such as induction recognition and loop invariant motion are performed by the global optimizer.
-O2 All -O optimizations are performed. In addition, more advanced optimizations such as SIMD code generation, cache alignment and partial redundancy elimination are enabled.
-O3 All -O1 and -O2 optimizations are performed. In addition, this level enables more aggressive code hoisting and scalar replacement optimizations that may or may not be profitable.
-O4 All -O1, -O2, and -O3 optimizations are performed. In addition, hoisting of guarded invariant floating point expressions is enabled.

Flags to Specify SIMD Instructions

These flags will produce executables that contain specific SIMD instructions which may effect compatibility with compute nodes on the SCC.

Flag Description
-tp=nehalem-64 For Intel Nehalem architecture Core processors.
-tp=sandybridge-64 For Intel SandyBridge and Ivybridge architecture Core processors.
-tp=hashwell-64 For Intel Hashwell and Broadwell architecture Core processors.
-tp=bulldozer-64 For AMD Bulldozer processors.
-tp=x64 For all Intel 64-bit processors and AMD 64-bit processors.
-tp=px For any x86-compatible processors (including all above).
-fast Includes: -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mcache_align -Mflushz -Mpre . Chooses generally optimal flags for target platforms and selects SIMD instructions that are available on the compiling computer.

Default Optimization Behavior

The PGI compilers by default will always produce executables that are tuned for the architecture of the compiling computer.  This means that without the -tp=x64 or -tp=px flags the output executable when compiled on the SCC login nodes will only be compatible with the Broadwell architecture.  The CPU architecture type of all of the login nodes on the SCC is Broadwell.

Recommendations

Here are recommendations for compiling codes on SCC.  Either the -tp=x64 or -tp=px flags should be used for compute node compatibility.  The -tp=x64 flag will generally produce faster code at the cost of longer compile times. The -tp=px flag will usually compile notably faster.

It is recommended that these flags be used to build executables on the SCC.  Here is an examples:

pgcc -fast -tp=x64 mycode.cpp -o myexe

The generated executable will run on any compute node, and it is optimized with most -fast flags – some will be automatically removed if they conflict with -tp=x64 .  The -tp=x64 option will produce a program with multiple execution paths for all of the supported architectures.  The program will select the correct one when it is run on a compute node.

To build an optimized executable for Intel Broadwell on the login nodes on the SCC:

pgcc -fast mycode.cpp -o myexe

The executable is optimized for Intel Broadwell because the login nodes use that architecture and the -tp flag was not used. To run this executable, submit a batch job to to a Broadwell compute node:

qsub -l cpu_arch=broadwell -b y ./myexe

Another option is to compile the code as part of a batch job which completely avoids any architectural issues and allows for the maximum amount of optimizations. For example, a job that is submitted to run on a Buy-in node equipped with an Ivybridge architecture CPU could be compiled with tunings for that node. As a precaution the source is copied into $TMPDIR: