Architecture of processors

PROTEUS is a cluster of nodes with x86_64 Intel Xeon processors connected by a dual communications network (1GbE and Infiniband FDR). It also has 2 NVIDIA Tesla graphics cards for GPGPU.

As storage, it has a Ceph distributed file system for general storage and one based on Lustre for applications requiring higher bandwidth.



The current composition of PROTEUS is the result of a series of extensions and improvements. The new nodes have been added to the existing ones, ensuring that they can coexist smoothly.

Currently, we have the following families of processors (labelled with the nicknames we have assigned to them):

Nick Architecture (model) #Nodes #Cores/Node RAM GB Year
Artemis Clovertown (E5345) 10 8 4 2007
Calypso Hapertown (E5410) 51 8 8 2008
Kratos Westmere (X5690) 42 12 48 2012
Hermes v1 Haswell (E5-2660 v3) 13 20 64 2015
Hermes v2 Broadwell (E5-2640 v4) 4 20 64 2015
Metis v1 Skylake (Gold 6132) 8 28 96 2019
Metis v2 Cascadelake (Gold 6226) 40 24 96 2019

We also have two nodes with a larger amount of memory, for applications that need it:

Nick Architecture (model) #Nodes #Cores/Node RAM GB Year
Hermes00 Haswell (E5-2698 v3) 1 32 256 2015
Metis00 Cascadelake (Gold 6226) 1 24 386 2019



These different families have very different performance. The same program can take different times to complete depending on the node it is running on. To help estimate execution time, a comparison of the computational power of the CPUs in each family has been made.

The computing power of a system is usually measured in FLOPS (Floating Point Operations per Second), i.e. the number of floating point operations per second, usually in double precision. This measure is independent of the CPU instruction set and is used to compare systems with different architectures.

The computing power of today’s processors allows rates of billions of operations per second, so the multiple GFLOPS is often used.



We can calculate the theoretical computing power of a CPU if we know the details of its architecture. This value is known as the theoretical maximum or Rpeak. The simplest way to calculate it is to multiply the number of cores by their frequency and by the maximum number of double-precision floating-point instructions it is capable of performing per cycle. Thus, for a node, the expression would be:

Rpeak (GFLOPS) = #sockets · #cores/socket · GHz · #ops/ciclo

However, this is the theoretical upper limit and is very unlikely to be reached in real situations. The theoretical maximum helps us to know to what degree we are exploiting the CPU’s potential.



Performance tests are programmes created to test the characteristics of a computer system. They come in many types and can measure different aspects of the machine, such as the storage system, bandwidth to memory, graphics processing, etc. or the system as a whole.

In this scientific computing environment, the most relevant aspects are the computing power of the CPU and the bandwidth to main memory. To evaluate these components, two benchmarks widely used in HPC environments have been used: Linpack and HPCG.

Linpack is a library that solves linear algebra problems. It has been widely used as a CPU benchmark. HPL (High Performance Linpack) is a parallel and optimised version of it that is used to measure the performance of supercomputers that appear on the TOP500 list.

HPCG (high performance conjugate gradient) is a new benchmark, also used on supercomputers, created with the intention of modelling the actual access that programs make to main memory. Memory accesses often act as a bottleneck, so this benchmark holds only a fraction of the CPU’s raw power. It complements Linpack, which is more processor intensive.



The following tables show the results obtained by calculating the theoretical maximum and evaluating the nodes using the Linpack and HPCG benchmarks.

Two evaluations have been performed: one using a single core and the other using all the cores contained in the node.

Theoretical maximum – Rpeak (GFLOPS)


Calculated with the above formula and taking into account the number of operations per cycle for each architecture (4, 16 or 32), we obtain the following Rpeaks:

Nick Single core Complete Node
Artemis 9,32 74,56
Calypso 9,32 74,56
Kratos 13,88 166,56
Hermes v1 41,6 832
Hermes v2 38,4 768
Metis v1 83,2 2329,6
Metis v2 86,4 2073,6



The version of HPL offered by Intel, which is specially optimised for its processors, has been used. The tests were performed using a single core and all available cores in the system. Also, for those CPUs that support TurboBoost, the results have been collected with this option enabled and disabled:

Nick Single core % No Turbo % Complete Node %
Artemis 8,44 90,5 50,41 67,6
Calypso 8,87 95,1 66.09 88,6
Kratos 14,09 102,5 13,3 95,8 151,6 91
Hermes v1 42,68 109,7 38,99 93,7 671,2 80,6
Hermes v2
Metis v1
Metis v2 1534,51 74



This table is under construction… Sorry for the inconvenience.

Nick Single core % No Turbo % Complete Node %
Hermes v1
Hermes v2
Metis v1
Metis v2

Do you need more info?