User Tools

Site Tools


testing_libigcm_to_run_regipsl

Objectives of libIGCM

  • Automatically generate the scripts which submit the simulations as needed for the experiment.
  • Manages in a standard way the files generated.
  • Provides a meta-level for configuring models for the experiment.
  • Post-processes the simulation on the fly.
  • Prepares quick-look graphics for monitoring the progression of the simulations.
  • Manages experiments on 3 computer centres : MesoCentre IPSL, IDRIS/Ada, TGCC/Curie

Elements of the libIGCM implementation

libIGCM provides a set of scripts in the directory libIGCM placed at the same level as RegIPSL. This directory also contains the definition files for the various computer centres on which libIGCM has been ported. Finally it contains a user level script (ins_job) which will generate the script to be launched on the computer.

For each experiment there is a directory which contains the following elements when the experiment is set-up :

  • config.card : This is the file configuring the experiment and containing all the parameters which should apply to all components of the model.
  • COMP/*card and COMP/*.driver : This directory contains the configuration files (*.card) and scripts (*.driver) which set-up the various elements needed to run each component. The scripts will for instance be used to update the configuration files provided in the PARAM directory for the period to be simulated.
  • PARAM/* : Contains all the configuration files for the components. For the models it will contain the run.def or Namelist. For OASIS it will provide the namcouple. To configure the XIOS server it will contain the iodef.xml and file_def_component.xml.
  • POST/* : All scripts and configurations needed for post-processing.

In this following description we will use the shell nomenclature to outline where the variables defined in config.card. Thus {JobName} refers to the job name defined in config.card.

With this set-up, ins_job script provided in the libIGCM directory needs to be executed in order to generate the jobs needed to run this experiment on the machine which you are using. This will create a file named Job_{JobName} and the file run.card.init which will serve to track the evolution of the experiment. For our computer centres the following options to the job work well for RegIPSL :

  • IDRIS/Ada : ../../libIGCM/ins_job -m Intel MPI ada
  • Curie : ????

When the model starts running (once Job_{JobName} has been submitted to the queuing system) some extra information are written into this directory ({SUBMIT_DIR} in libIGCM talk) :

  • run.card : contains the list of the simulations already executed for this experiment.
  • Script_Output_{JobName}_00000 : output of the jobs
  • Debug/* : debug output from the components.

Main elements of the config.card file

Some of the main variables which can be changed in config.card and change the behaviour of the scripts which execute the experiment.

  • JobName = Name of the Jobs to be submitted to the computers.
  • ExperimentName = Name of the experiment
  • DateBegin = Date at which the experiment begins in the format YYYY-MM-DD
  • DateEnd = End date in the same format.
  • PeriodLength = Length of each simulation for the experiment to be carried out (1Y means that a long experiment is divided up into one year simulations). Thus the value of this flag depends on the classes of the queuing system of the computer and which offers the fastest returns.

For the configuration of libIGCM for regional modelling at IPSL the following parameters of config.card should not change :

  • TagName = RegIPSL : This allows to set appart the regional and global simulations on the archive.
  • CalendarType=leap : There is no need to use simpler calendars in regional modelling.

The above lists are to be expanded as a better understanding of the application of libIGCM for RegIPSL is gained.

Main elements of the COMP/comp.card and COMP/comp.driver files

comp.card (where comp is the name of one of the executables to be coupled) contains the list of the various files needed to run the model and which will need to be copied to the temporary directory where the model will run. It will also contain the list of files written by the model and how they should be names once written on the output directory. The file can also contain the post-processing which should be done for a list of variables and how often this should be done. It can also contain higher level configuration parameters which affect more aspects of the execution of the model then just the run.def or Namelist.

comp.driver provides a number of functions which will be called at well determined moments by the Job as it is being executed (In the following COMP stands for the code of the component : ATM, SRF, OCE … or other) :

  • COMP_Initialize : Called at the very beginning of the Job before we know which period will be run.
  • COMP_PeriodStart : Called once the period is determined but before the files are copied to the temporary directory for the execution.
  • COMP_Update : Called once all files are present and thus the ideal moment to update the configuration files for the period to be simulated of the higher level configuration parameters selected in COMP.card.
  • COMP_Finalize : Wraps up the simulation (I have no idea if it is called before or after the files are copied to the output directory ?).

Resulting output files of the experiment

The files produced by the models will be distributed on either the disk storage or the mass store depending on the infrastructure provide by the computer centre and the selection by the user (in config.card) of the post-processing, packing and monitoring frequencies.

The selection of SpaceName between DEVT/TEST/PROD will also dictate that distribution. A production run will store most things on the mass store while a development run will keep files on disks.

Below an example from IDRIS :

  • ~/IGCM_OUT/RegIPSL/PROD/ : The main directory for production runs
  • ~/IGCM_OUT/RegIPSL/PROD/{ExperimentName}/{JobName} : Contain all the files of your experiment
  • ~/IGCM_OUT/RegIPSL/PROD/{ExperimentName}/{JobName}/{COMP} : Are the directories for the files of the various components. These components are obviously described in the config.comp file.

The parameter PackFrequency of the config.card will decide, on IDRIS for instance, how often the output directory created on ADA in the WORKDIR is transferred to ERGON after some packing (reduce the number of files and generate larger ones !) is done.

Current Implementation tests

Some standard implementations are available in SourceSup and will be delivered with the code. These should be benchmarks with which we regularly test the standard configurations of the model. They are characterized by the experiment name FirstBench (ExperimentName=FirstBench).

ORCHIDEEOL

  • This test runs two executables : orchideedriver and XIOS (server mode).
  • It runs without restart file.
  • It runs a simple ORCHIDEE simulation over the standard Mediterranean domain.
  • STOMATE is switched off.
  • No routing is activated.
  • The N year experiment is executed in 6 month periods.
  • The output is a single file of monthly averages.

ORCHOASIS

  • In this case three executables are coupled with OASIS : driver2oasis, orchideeoasis, XIOS.
  • The routing is activated.
  • This case needs a restart file because of the memory cost of the routing initialisation.
  • STOMATE is switched off.
  • The N year experiment is executed in 6 month period.
  • The driver2oasis code will output daily discharge values into the largest estuaries of the domain. These files fill be collected in the ATM/Output/DA part of the output directory.
  • ORCHIDEE will output a file of daily values in the SRF/Output/DA directory and another with 3hourly sampled variables in SRF/Output/HF.

WRFORCH

  • In this case we have also three executables : wrf, orchidee and XIOS.
  • The routing is activated.
  • It start from a spin-up simulation which is available on ergon under : ~rron972/IGCM_OUT/RegIPSL/PROD/FirstBench/WRFORCH
  • The reliable distribution of processors is : SRF=orchideeoasis : 32MPI, ATM=wrf.exe, 212MPI, IOS=xios_server.exe 12MPI. These numbers can be doubled to run quicker.
  • The simulation is divided in monthly runs. Every 3 month the data is packed, transfered to ergon and the monitoring produced.
  • The jobs require 4hours of wall clock time per month.
  • WRF produced monthly, daily and 3hourly files
  • ORCHIDEE produces daily and 3hourly files.

Output of these first benchmarks of the model

The output of these simulation, to evaluate the organization of the date and output, are available on ergon at ~rron972/IGCM_OUT/RegIPSL/PROD/FirstBench/

The monitoring of the spin-up of WRFORCH is available here : https://prodn.idris.fr/thredds/fileServer/ipsl_public/rron972/RegIPSL/PROD/FirstBench/WRFORCH/MONITORING/index.html

Solving regular issue with simulations

Here are some of the standard problems encountered when problems occurs with the simulation management of libIGCM.

One of the components crashed and I just need to restart ?

In this case you have to update your run.card in the Home directory of your simulation. This the file which contains the status of your simulation. In this file you will find the variable PeriodState which will have the “Fatal” value. Replacing this value by “Running” and then re-submitting the job will allow the simulation to continue.

The simulation crashed and files are everywhere ?

In some cases the model crash can leave the entire simulation in a more complex situation with un-packed files still in the WORKDIR, incomplete time series and out of date monitoring.

In this case the missed packings need to be re-done and all the files migrated from the WORKDIR to the mass store. This can be done with the pack_output.job from libIGCM. Here a simple cookbook of how to recover from such a situation.

  • Copy pack_output.job from libIGCM to the Home of the simulation.
  • Edit the job in order to correct the following variables : DateBegin, DateEnd, PeriodPack.
  • You will find there default values which need to be replaced by the values corresponding to your simulation and the period which needs to be redone.
  • Once the job has been updated it can be submitted.
  • If more then one packing period has been missed, then a corresponding number of packing jobs need to be created, one per period.
  • Submit them in the right order.
  • The last one will then launch the time series construction and update the monitoring.
  • In the run.card you will find the period which the model was running over when it crashed. The corresponding files need to be deleted from the if they exist.
  • To remove existing files for the current period use the following command : find . -name “*19810531*” -exec rm -r {} \; The number has to be replaced by the year and month you want to delete.

Further reading

testing_libigcm_to_run_regipsl.txt · Last modified: 2017/12/04 08:21 by jan.polcher@lmd.jussieu.fr