Access and operation

This section explains how to work with PROTEUS. To do this, you need an account in the system. If you don’t have it yet, bellow you have information about how to get one. For any doubt, you have several possibilities to communicate with us in the section “Contact”

Once you have your account, you can access and use PROTEUS as indicated below.

Getting an Account

Any member of the Institute Carlos I for Theoretical and Computational Physics (iC1) has the right to use this computational service. In addition, the iC1 maintains collaboration and development agreements with others institutions and research centres.

If you meet the requirements, you can request an account by email indicating in the subject of the message “Request account Proteus”. In the message, you must indicate your name, a contact email address (to which you will be sent notices and other news related to the cluster), the relationship you have with the iC1 and the username you want to use.

If everything is correct, within a short time, you will receive a notification, telling you that you can use the cluster. You will be assigned a password, which you must change on the first access.

Working with PROTEUS

The configuration of the cluster is oriented to the computation of applications in a reliable way (tolerance to failures, robustness, security, etc.) and simplicity for the users.

The system is ready to run programs that need a lot of resources, whether in CPU time, memory or disk. There are no restrictions on the use of these resources, nor on the time that a program may be running.

To facilitate the use of the cluster, a queue manager is in charge of planning and executing the jobs that users request. The cluster has a gateway (the proteus.ugr.es node) from which jobs are sent to the execution queue. It is the queue manager that is responsible for choosing which jobs to execute (order of priorities), the machine on which it will run (availability of the queue), monitor the execution and collect results and possible errors. In this way, we do not have to worry about looking for a machine available for our jobs, or having to write scripts to serially run several jobs, or be careful to see when they finish. All of this is handled by our queue manager.

To compile programs, the latest versions of the compilers for C, C ++ and GNU Fortran (gcc, gfortran) and Intel (icc, ifort) are available. Parallel and distributed openMP and MPI libraries and the BLAS mathematical libraries optimized for the cluster architecture are also available.

Any free software can be installed on request. If the software needs a license and the interested group provides it, there is no problem installing it, as long as all the members of the cluster can make use of it.

Usage policy

The objective of this infrastructure is to achieve maximum efficiency and usage, so the access of it has been tried to simplify as much as possible.

It is for this reason that the queues of short, medium and long duration, very frequent in systems of this type, have been suppressed. It has been decided that all programs run in a single queue in order to make it easier for all users.

Communication with users of the service will be done via email to the address that the user indicates in their registry. Whenever there is something new, it will be made known to all by this means. They will also appear on the homepage of this website, in the section “Latest News”.

Users are requested to make mention of all of their publications and works obtained using these computer resources, thanking the Instituto Carlos I for their computational services. This will be usefull to request new aids and subsidies for the extension and improvement of the hardware and for the hiring of technical personnel.

Access

To work on PROTEUS it is necessary to connect via SSH to its gateway:

# ssh username@proteus.ugr.es

There is no type of access location restriction (such as UGRGRID that only allows access from a particular IP). However, it has a security filter: if we make mistakes 3 times in a row when typing our user / password, the system perceives it as an attack and blocks the IP. In that case, you will need to contact the administrator to unblock it.

NOTE: proteus is the input machine but it does NOT execute jobs. We have to send our work to the queue manager and this will be responsible for running them on any of the execution machines.

Queue Manager

The queue manager responsible for managing cluster resources and fair and equitable access for all members is HTCondor, developed by the University of Wisconsin-Madison under GPL license and in continuous development. The decision of what program will run on what is done according to the criteria of user priorities and program requirements.

The main characteristics of Condor are:

  • control over the execution of the work
  • log and event notifications
  • order of execution of the works
  • priorities
  • checkpoints and failover

Execution Queue

There is only one queue to which all the programs that are launched to. There are no distinctions between short- and long-term programs, nor the amount of other resources they may need. This planning greatly simplifies usage to users.

Program requirements

Programs may have different hardware requirements for execution, such as processor architecture, number of processors, amount of main memory or disk storage, etc. This is specified in the description file. The program will wait in queue until the resources you have requested are available. Obviously, the lower the requirements, the easier it will be to be free at a given time. If they can not be satisfied, the program will remain queued indefinitely.

Priorities

HTCondor implements an equitable use of cluster resources for all its users. It maintains a history of resources consumed by each user and, based on this, calculates the priority of each user so that users who have been or are using the cluster receive a lower priority in favor of those who are using it less. In this way, all users can use the same amount of resources in the cluster in the long run.

In case a new job enters the queue and the resources it has requested are in use by another work (either because the cluster is fully in use or because they are specific resources with only a few nodes), the Priorities of users who are currently executing jobs on those resources and that of the new owner. If this is higher, the previous job goes into the waiting state (it stops executing) and its position is occupied by the new one.

To check the priority we have at any time, you can use the command condor_userprio that returns a list of users along with their priority. The greater the value, the less effective priority.

You can also distinguish priorities among a user’s work, giving preference to those we want. That is, if we have several jobs and we want some of them to be executed before others. To do this, we can use the command condor_prio

Program compilation

To run a program in the cluster, copy the source code on it and compile it there because this way we avoid problems for different versions of libraries.

The available compilers are Intel and GNU for the C / C ++ and Fortran languages (icc – ifort | gcc – gfortran).

All PROTEUS processors are Intel x86_64, of several families. The environment is configured to generate code optimized for them using the Intel compilers. In addition, it is possible to add optimizations by using the general optimization flags (-O [1,2,3]).

It is recommended to use Intel compilers because we have empirically observed that the programs that are obtained with these are executed faster.

Program execution

Once the program is compiled, the next phase is to send it to the cluster. To do this, some scripts have been made available to users to facilitate this task.

lanzarv

With lanzarv you can automatically send the job to the queue. Sintax:

lanzarv [-cCores] [-mMemory] executable [params]

where

  • -cCores is the number of CPUs (cores) that are requested for the job. Optional. Default: 1. Only use this option if our work is parallel. In that case, we could order up to 32 cores.
  • -mMemoria is the amount of memory requested (in MB). Optional. Default: 450MB. HTCondor allocates this amount of memory to the program. If it is exceeded, the program is aborted.

If the job is taken out of the queue for priority issues or there is some kind of error, once you resume the queue, it will run again from the beginning, losing all its progress. You may even override the generated files by starting over from the beginning. In fact, this behavior has been modified so that, if any failure occurs, the work is retained, pending the intervention of the owner. In this way, the work stored on disk will not be lost until the date of the failure.

This mode is most suitable for jobs whose execution period is short (on the order of days).

lanvarv2

In order to avoid the above mentioned problem, the possibility of creating automatic checkpoints (checkpoints of a program) has been added, saving the execution status of a program that can be resumed if necessary. It also solves the problems of cluster crashes (mainly due to power outages). Its form of use is identical: lanzarv2 [-cCores] [-mMemory] executable [parameters] In order to use lanzarv2 (and its checkpoints) in the compilation of the executable the -static option can not be used. Also, some programs can not make checkpoints by the type of functions they use.

This mode is appropriate for long run programs (weeks or months).

 

 

When sending jobs to the cluster by any of these media (lanzarv or lanzarv2), several files are automatically generated:

  • exename.submit: job description file required for HTCondor
  • exename.idcondor.out: outputs per screen of the work are directed to this file. The identifier refers to the execution number assigned by HTCondor, which is necessary not to overwrite the file if the same job is sent several times
  • exename.idcondor.err: stores the possible errors
  • exename.idcondor.log: is a history of all the stages and events through which the work passes

 

lanvarc

lanzarc is the script that must be used to send PROTEUS programs that are executed, totally or partially, in graphics cards. Its way of use is similar to the previous ones, except that the amount of resources that can be used are fixed: a computer with two NVIDIA Tesla C2050 graphics cards, 8 cores and 24GB of RAM, is available. Run programs written in CUDA or openCL. This machine has been divided into 2 HTCondor slots. Each of them has been assigned:

  • a Tesla card,
  • 2 cores
  • 10GB RAM

Thus, each program sent with lanzarc receives those resources, without the possibility of changing them.

The programs run exactly like any other within HTCondor, with the exception that they can only run on one of these 2 slots. However, it is necessary to know the ID of the graphics card that has been assigned to the program so that the program can work with it. This can be done in two ways: using the environment variable GPU_DEVICE_ID or using argument  lanzarc ejecutable --device=X [params]

Releasing failed jobs

If for some reason a job has failed, it is possible to resume its execution if it has its checkpoint (created with lanzarv2). For this, we have

relanzarv2 id

where id is the id number that HTCondor assigned to the job to be restored.

Programs state

When a job is sent, it goes to the queue and is put on hold. When it gets a “hole,” it goes into the execution state. If there is an error during the process, it goes to the stopped state. These are the three main states of HTCondor, denoted by the letters I (Idle), R (Running) and H (Hold).

To know the status of our works we have the order: condor_q will return information for all programs. There is also the script mi_condor_q that only gives us information about ours.

The information returned, by columns, indicates the following:
Id | User | Submit Date | Run Time | State | Priority | Memory used | Executable

Nodes execution state

It is also very useful to know the state of the resources, especially if we need to send a large number of jobs. For this, in addition to this page in the “State” section, HTCondor has an order that shows the current status of the nodes: condor_status The output is a listing with each of the nodes and their cores, operating system and architecture (LINUX X86_64 for all PROTEUS), its status (Unclaimed or Claimed), memory, etc. And after that, a summary table with the number of free cores per node, free memory and the free memory per core ratio. In this way, you can quickly know how many free colors there are and the memory available to them.

Removing jobs from the execution pool

If it is no longer necessary to execute a job, we can remove it from the HTCondor queue using   condor_rm id  The job is stopped and disappears from the queue. The files you have generated are kept. If we want to eliminate all our jobs condor_rm -all

Email notifications

Changes, news, notices and errors that occur in the service will be notified by email to the address specified when you sign up for the service.

In addition to these notifications, HTCondor will notify you of the events of the programs by means of an automatic email giving details of the same. If you do not wish to receive these emails, you can ask the technical staff at the email address on the Contact tab of the menu.