12.1 Core configuration

12.1.1 Top-level structure

The first line of the XML file should contain version and encoding information.

<?xml version="1.0" encoding="UTF-8"?>

The rest of the XML file should be surrounded by <pecan>...</pecan> tags.

<pecan>
  ...XML body here...
</pecan>

12.1.2 info: Run metadata

This section contains run metadata. This information is not essential to a successful model run, but is useful for tracking run provenance.

  <info>
    <notes>Example run</notes>
    <userid>-1</userid>
    <username>guestuser</username>
    <date>2018/09/18 19:12:28 +0000</date>
  </info>

The <notes> tag will be filled in by the web GUI if you provide notes, or you can add notes yourself within these tags. We suggest adding notes that help identify your run and a brief description of what the run is for. Because these notes are searchable within the PEcAn database and web interface, they can be a useful way to distinguish between similar runs.

The <userid> and <username> section is filled in from the GUI if you are signed in. If you are not using the GUI, add the user name and ID you are associated with that exists within the PEcAn database.

The <date></date> tag is filled automatically at the time of your run from the GUI. If you are not using the GUI, add the date you execute the run. This tag is not the tag for the dates you would like to run your model simulation.

12.1.3 outdir: Output directory

The <outdir> tag is used to configure the output folder used by PEcAn. This is the directory where all model input and output files will be stored. By default, the web interface names this folder PEcAn_<workflow ID>, and higher-level location is set by the $output_folder$ variable in the web/config.php file. If no outdir is specified, PEcAn defaults to the working directory from which it is called, which may be counterintuitive.

  <outdir>/data/workflows/PEcAn_99000000006</outdir>

12.1.4 database: PEcAn database settings

12.1.4.1 bety: PEcAn database (Bety) configuration

The bety tag defines the driver to use to connect to the database (we support PostgreSQL, which is the default, and Postgres) and parameters required to connect to the database. Note that connection parameters are passed exactly as entered to the underlying R database driver, and any invalid or extra parameters will result in an error.

In other words, this configuration…

  <database>
    ...
    <bety>
      <user>bety</user>
      <password>bety</password>
      <host>postgres</host>
      <dbname>bety</dbname>
      <driver>PostgreSQL</driver>
      <port>5432</port>
      <write>true</write>
    </bety>
    ...
  </database>

…will be translated (by way of PEcAn.DB::db.open(settings$database$bety)) into R code like the following:

con <- DBI::dbConnect(
  drv = RPostgreSQL::PostgreSQL(),
  user = "bety",
  password = "bety",
  dbname = "bety",
  host = "postgres",
  port = "5432",
  write = TRUE
)

Common parameters are described as follows:

  • driver: The driver to use to connect to the database. This should always be set to PostgreSQL, unless you absolutely know what you’re doing.
  • dbname: The name of the database (formerly name), corresponding to the -d argument to psql. In most cases, this should be set to bety, and will only be different if you named your Bety instance something else (e.g. if you have multiple instances running at once). If unset, it will default to the user name of the current user, which is usually wrong!
  • user: The username to connect to the database (formerly userid), corresponding to the -U argument to psql. default value is the username of the current user logged in (PostgreSQL uses user for this field).
  • password: The password to connect to the database (was passwd), corresponding to the -p argument to psql. If unspecified, no password is used. On standard PEcAn installations, the username and password are both bety (all lowercase).
  • host: The hostname of the bety database, corresponding to the -h argument to psql. On the VM, this will be localhost (the default). If using docker, this will be the name of the PostgreSQL container, which is postgres if using our standard docker-compose. If connecting to the PEcAn database on a remote server (e.g. psql-pecan.bu.edu), this should be the same as the hostname used for ssh access.
  • write: Logical. If true (the default), write results to the database. If false, PEcAn will run but will not store any information to bety.

When using the web interface, this section is configured by the web/config.php file. The default config.php settings on any given platform (VM, Docker, etc.) or in example files (e.g. config.php.example) are a good place to get default values for these fields if writing pecan.xml by hand.

Key R functions using these parameters are as follows:

  • PEcAn.DB::db.open – Open a database connection and create a connection object, which is used by many other functions for communicating with the PEcAn database.

12.1.4.2 dbfiles: Location of database files

The dbfiles is a path to the local location of files needed to run models using PEcAn, including model executables and inputs.

  <database>
    ...
    <dbfiles>/data/dbfiles</dbfiles>
    ...
  </database>

12.1.4.3 (Experimental) fia: FIA database connection parameters

If a version of the FIA database is available, it can be configured using <fia> node, whose syntax is identical to that of the <bety> node.

  <database>
    ...
    <fia>
        <dbname>fia5data</dbname>
        <username>bety</username>
        <password>bety</password>
        <host>localhost</host>
    </fia>
    ...
  </database>

Currently, this is only used for extraction of specific site vegetation information (notably, for ED2 css, pss, and site files). Stability not ensured as of 1.5.3.

12.1.5 pft: Plant functional type selection

The PEcAn system requires at least 1 plant functional type (PFT) to be specified inside the <pfts> section.

  <pfts>
    <pft>
      <name>tundra.grasses</name> 
      <constants>
        <num>1</num>
      </constants>
      <posterior.files>Path to a post.distns.*.Rdata or prior.distns.Rdata</posterior.files>
    </pft>
  </pfts>
  • name : (required) the name of the PFT, which must exactly match the name in the PEcAn database.
  • outdir: (optional) Directory path in which PFT-specific output will be stored during meta-analysis and sensitivity analysis. If not specified (recommended), it will be written into <outdir>/<pftname>.
  • contants: (optional) this section contains information that will be written directly into the model specific configuration files. For example, some models like ED2 use PFT numbers instead of names for PFTs, and those numbers can be specified here. See documentation for model-specific code for details.
  • posterior.files (Optional) this tag helps to signal write.config functions to use specific posterior/prior files (such as HPDA or MA analysis) for generating samples without needing to access to the bety database.

``

This information is currently used by the following PEcAn workflow function:

  • get.traits - ??????

12.1.6 meta.analysis: Trait Meta Analysis

The section meta.analysis needs to exists for a meta.analysis to be executed, even though all tags inside are optional. Conversely, if you do not want to do a trait meta-analysis (e.g. if you want to manually set all parameters), you should omit this node.

  <meta.analysis>
    <iter>3000</iter>
    <random.effects>
      <on>FALSE</on>
      <use_ghs>TRUE</use_ghs>
    </random.effects>
  </meta.analysis>

Some of the tags that can go in this section are:

  • iter: MCMC (Markov Chain Monte Carlo) chain length, i.e. the total number of posterior samples in the meta-analysis, default is 3000. Smaller numbers will run faster but produce larger errors.
  • random.effects: Settings related to whether to include random effects (site, treatment) in meta-analysis model.
  • on: Default is set to FALSE to work around convergence problems caused by an over parameterized model (e.g. too many sites, not enough data). Can be turned to TRUE for including hierarchical random effects.
  • use_ghs: Default is set to TRUE to include greenhouse measurements. Can be set to FALSE to exclude cases where all data is from greenhouse.
  • update: Should previous results of meta.analysis and get.traits be re-used. If set to TRUE the meta-analysis and get.trait.data will always be executed. Setting this to FALSE will try and reuse existing results. Future versions will allow for AUTO as well which will try and reuse if the PFT/traits have not changed. The default value is FALSE.
  • threshold: threshold for Gelman-Rubin convergence diagnostic (MGPRF); default is 1.2.

This information is currently used by the following PEcAn workflow function:

  • PEcAn.MA::run.meta.analysis - ???

12.1.7 model: Model configuration

This section describes which model PEcAn should run and some instructions for how to run it.

<model>
    <id>7</id>
    <type>ED2</type>
    <binary>/usr/local/bin/ed2.r82</binary>
    <prerun>module load hdf5</prerun>
    <config.header>
        <!--...xml code passed directly to config file...-->
    </config.header>
</model>

Some important tags are as follows:

  • id – The unique numeric ID of the model in the PEcAn database models table. If this is present, then type and binary are optional since they can be determined from the PEcAn database.
  • type – The model “type”, matching the PEcAn database modeltypes table name column. This also refers to which PEcAn model-specific package will be used. In PEcAn, a “model” refers to a specific version (e.g. release, git commit) of a specific model, and “model type” is used to link different releases of the same model. Model “types” also have specific PFT definitions and other requirements associated with them (e.g. the ED2 model “type” requires a global land cover database).
  • binary – The file path to the model executable. If omitted, PEcAn will use whatever path is registered in the PEcAn database for the current machine.
  • prerun – Additional options added to the job.sh script, which is used to execute the model. This is useful for setting specific environment variables, load modules, etc.

This information is currently used by the following PEcAn workflow function:

12.1.7.1 Model-specific configuration

See the following:

12.1.7.2 ED2 specific tags

Following variables are ED specific and are used in the ED2 Configuration.

Starting at 1.3.7 the tags for inputs have moved to <run><inputs>. This includes, veg, soil, psscss, inputs.

    <edin>/home/carya/runs/PEcAn_4/ED2IN.template</edin>
    <config.header>
        <radiation>
            <lai_min>0.01</lai_min>
        </radiation>
        <ed_misc>
            <output_month>12</output_month>
        </ed_misc> 
    </config.header>
    <phenol.scheme>0</phenol.scheme>
  • edin : [required] template used to write ED2IN file
  • veg : OBSOLETE [required] location of VEG database, now part of <run><inputs>
  • soil : OBSOLETE [required] location of soild database, now part of <run><inputs>
  • psscss : OBSOLETE [required] location of site inforation, now part of <run><inputs>. Should be specified as <pss>, <css> and <site>.
  • inputs : OBSOLETE [required] location of additional input files (e.g. data assimilation data), now part of <run><inputs>. Should be specified as <lu> and <thsums>.

12.1.8 run: Run Setup

This section provides detailed configuration for the model run, including the site and time period for the simulation and what input files will be used.

  <run>
    <site>
      <id>1000000098</id>
      <met.start>2004/01/01</met.start>
      <met.end>2004/12/31</met.end>
      <site.pft>
        <pft.name>temperate.needleleaf.evergreen</pft.name>
        <pft.name>temperate.needleleaf.evergreen.test</pft.ame>
      </site.pft>
    </site>
    <inputs>
      <met>
        <source>CRUNCEP</source>
        <output>SIPNET</output>
      </met>
      <poolinitcond>
        <source>NEON_veg</source>
        <output>poolinitcond</output>
        <ensemble>100</ensemble>
        <startdate>2019-01-01</startdate>
        <enddate>2019-12-31</enddate>
        <storedir>/projectnb/dietzelab/neon_store</storedir>
      </poolinitcond>
    </inputs>
    <start.date>2004/01/01</start.date>
    <end.date>2004/12/31</end.date>
    <stop_on_error>TRUE</stop_on_error>
  </run>

12.1.8.1 site: Where to run the model

This contains the following tags:

  • id – This is the numeric ID of the site in the PEcAn database (table sites, column id). PEcAn can automatically fill in other relevant information for the site (e.g. name, lat, lon) using the site ID, so those fields are optional if ID is provided.
  • name – The name of the site, as a string.
  • lat, lon – The latitude and longitude coordinates of the site, as decimals.
  • met.start, met.end – ???
  • <site.pft> (optional) If this tag is found under the site tag, then PEcAn automatically makes sure that only PFTs defined under this tag is used for generating parameter’s samples. Following shows an example of how this tag can be added to the PEcAn xml :
    <site.pft>
     <pft.name>temperate.needleleaf.evergreen</pft.name>
     <pft.name>temperate.needleleaf.evergreen</pft.name>
    </site.pft>

For multi-site runs if the pft.site tag (see {#xml-run-inputs}) is defined under input, then the above process will be done automatically under prepare settings step in PEcAn main workflow and there is no need for adding the tags manually. Using the pft.site tag however, requires a lookup table as an input (see {#xml-run-inputs}).

12.1.8.2 inputs: Model inputs

Models require several different types of inputs to run. Exact requirements differ from model to model, but common inputs include meteorological/climate drivers, site initial conditions (e.g. vegetation composition, carbon pools), and land use drivers.

In general, all inputs should have the following tags:

  • id: Numeric ID of the input in the PEcAn database (table inputs, column id). If not specified, PEcAn will try to figure this out based on the source tag (described below).
  • path: The file path of the input. Usually, PEcAn will set this automatically based on the id (which, in turn, is determined from the source). However, this can be set manually for data that PEcAn does not know about (e.g. data that you have processed yourself and have not registered with the PEcAn database).
  • source: The input data type. This tag name needs to match the names in the corresponding conversion functions. If you are using PEcAn’s automatic input processing, this is the only field you need to set. However, this field is ignored if id and/or path are provided.
  • output: ???

The following are the most common types of inputs, along with their corresponding tags:

12.1.8.2.1 met: Meteorological inputs

(Under construction. See the PEcAn.data.atmosphere package, located in modules/data.atmosphere, for more details.)

12.1.8.2.2 (Experimental) soil: Soil inputs

(Under construction. See the PEcAn.data.land package, located in modules/data.land, for more details).

12.1.8.2.3 (Experimental) veg: Vegetation initial conditions
12.1.8.2.4 poolinitcond: initial condition inputs
  • source: Data source of initial condition .rd veg files ex: NEON_veg, FIA.
  • output: This tag can only must match the cooresponding tag in the modeltypes_formats table that matches with your model type ex for SIPNET model tag is poolinitcond.
  • ensemble: Number of initial conditions ensemble member files ex: 30.

(Under construction. Follow developments in the PEcAn.data.land package, located in modules/data.land in the source code).

12.1.8.2.5 pft.site Multi-site site / PFT mapping

When performing multi-site runs, it is not uncommon to find that different sites need to be run with different PFTs, rather than running all PFTs at all sites. If you’re interested to use a specific PFT for your site/sites you can use the following tag to tell PEcAn which PFT needs to be used for what site.

<pft.site>
  <path>site_pft.csv</path>
</pft.site>

For example using the above tag, user needs to have a csv file named site_pft stored in the pecan folder. At the moment we have functions supporting just the .csv and .txt files which are comma separated and have the following format:

site_id, pft_name
1000025731,temperate.broadleaf.deciduous
764,temperate.broadleaf.deciduous

Then pecan would use this lookup table to inform write.ensemble.config function about what PFTs need to be used for what sites.

12.1.8.3 start.date and end.date

The start and end date for the run, in a format parseable by R (e.g. YYYY/MM/DD or YYYY-MM-DD). These dates are inclusive; in other words, they refer to the first and last days of the run, respectively.

NOTE: Any time-series inputs (e.g. meteorology drivers) must contain all of these dates. PEcAn tries to detect and throw informative errors when dates are out of bounds inputs, but it may not know about some edge cases.

12.1.8.4 Other tags

The following tags are optional run settings that apply to any model:

  • jobtemplate: the template used when creating a job.sh file, which is used to launch the actual model. Each model has its own default template in the inst folder of the corresponding R package (for instance, here is the one for ED2). The following variables can be used: @SITE_LAT@, @SITE_LON@, @SITE_MET@, @START_DATE@, @END_DATE@, @OUTDIR@, @RUNDIR@ which all come variables in the pecan.xml file. The following two command can be used to copy and clean the results from a scratch folder (specified as scratch in the run section below, for example local disk vs network disk) : @SCRATCH_COPY@, @SCRATCH_CLEAR@.

  • stop_on_error: (logical) Whether the workflow should immediately terminate if any of the model runs fail. If unset, this defaults to TRUE unless you are running an ensemble simulation (and ensemble size is greater than 1).

Some models also have model-specific tags, which are described in the PEcAn Models section.

12.1.9 host: Host information for remote execution

This section provides settings for remote model execution, i.e. any execution that happens on a machine (including “virtual” machines, like Docker containers) different from the one on which the main PEcAn workflow is running. A common use case for this section is to submit model runs as jobs to a high-performance computing cluster. If no host tag is provided, PEcAn assumes models are run on localhost, a.k.a. the same machine as PEcAn itself.

For detailed instructions on remote execution, see the Remote Execution page. For detailed information on configuring this for RabbitMQ, see the RabbitMQ page. The following provides a quick overview of XML tags related to remote execution.

NOTE: Any paths specified in the pecan.xml refer to paths on the host specified in this section, /not/ the machine on which PEcAn is running (unless models are running on localhost or this section is omitted).

    <host>
        <name>pecan2.bu.edu</name>
        <rundir>/fs/data3/guestuser/pecan/testworkflow/run</rundir>
        <outdir>/fs/data3/guestuser/pecan/testworkflow/out</outdir>
        <scratchdir>/tmp/carya</scratchdir>
        <clearscratch>TRUE</clearscratch>
        <qsub>qsub -N @NAME@ -o @STDOUT@ -e @STDERR@ -S /bin/bash</qsub>
        <qsub.jobid>Your job ([0-9]+) .*</qsub.jobid>
        <qstat>'qstat -j @JOBID@ &amp;> /dev/null || echo DONE'</qstat>
        <prerun>module load udunits R/R-3.0.0_gnu-4.4.6</prerun>
        <cdosetup>module load cdo/2.0.6</cdosetup>
        <modellauncher>
      <binary>/usr/local/bin/modellauncher</binary>
      <qsub.extra>-pe omp 20</qsub.extra>
        </modellauncher>
    </host>

The host section has the following tags:

  • name: [optional] name of host server where model is located and executed, if not specified localhost is assumed.
  • folder: [required] The location to store files on the remote machine. For localhost this is optional (<outdir> is the default), for any other host this is required.
  • rundir: [optional] location where all the configuration files are written. If not specified, this will be generated starting at the path specified in folder.
  • outdir: [optional] location where all the outputs of the model are written. If not specified, this will be generated starting at the path specified in folder.
  • scratchdir: [optional] location where output is written. If specified the output from the model is written to this folder and copied to the outdir when the model is finished, this could significantly speed up the model execution (by using local or ram disk).
  • clearscratch: [optional] if set to TRUE the scratchfolder is cleaned up after copying the results to the outdir, otherwise the folder will be left. The default is to clean up after copying.
  • qsub: [optional] the command to submit a job to the queuing system. There are 3 parameters you can use when specifying the qsub command, you can add additional values for your specific setup (for example -l walltime to specify the walltime, etc). You can specify @NAME@ the pretty name, @STDOUT@ where to write stdout and @STDERR@, where to write stderr. You can specify an empty element (<qsub/>) in which case it will use the default value is qsub -V -N @NAME@ -o @STDOUT@ -e @STDERR@ -s /bin/bash.
  • qsub.jobid: [optional] the regular expression used to find the jobid returned from qsub. If not specified (and qsub is) it will use the default value is Your job ([0-9]+) .*
  • qstat: [optional] the command to execute to check if a job is finished, this should return DONE if the job is finished. There is one parameter this command should take @JOBID@ which is the ID of the job as returned by qsub.jobid. If not specified (and qsub is) it will use the default value is qstat -j @JOBID@ || echo DONE
  • prerun: [optional] additional options to add to the job.sh at the top.
  • cdosetup: [optional] additional options to add to the job.sh at the top, specifically work for the CDO Linux command line package, allowing the user to generate the full year netCDF file through model2netcdf.SIPNET function (there has been an issue while this function will iteratively overlap the previous netCDF file, resulting in partial data loss).
  • modellauncher: [optional] this is an experimental section that will allow you to submit all the runs as a single job to a HPC system.

The modellauncher section if specified will group all runs together and only submit a single job to the HPC cluster. This single job will leverage of a MPI program that will execute a single run. Some HPC systems will place a limit on the number of jobs that can be executed in parallel, this will only submit a single job (using multiple nodes). In case there is no limit on the number of jobs, a single PEcAn run could potentially submit a lot of jobs resulting in the full cluster running jobs for a single PEcAn run, preventing others from executing on the cluster.

The modellauncher has 3 arguments: * binary : [required] The full path to the binary modellauncher. Source code for this file can be found in pecan/contrib/modellauncher](https://github.com/PecanProject/pecan/tree/develop/contrib/modellauncher). * qsub.extra : [optional] Additional flags to pass to qsub besides those specified in the qsub tag in host. This option can be used to specify that the MPI environment needs to be used and the number of nodes that should be used. * mpirun: [optional] Additional commands to be added to the front of launcher.sh. Default is mpirun <path to binary> <path to joblist.txt>. Edit this to, for example, load necessary modules to run mpirun.