11 The PEcAn XML

The PEcAn system is configured using a XML file, often called pecan.xml. It contains the following major sections (“nodes”):

A basic example looks like this:

In the following sections, we step through each of these sections in detail.

11.0.1 Core configuration Top-level structure

The first line of the XML file should contain version and encoding information.

The rest of the XML file should be surrounded by <pecan>...</pecan> tags. info: Run metadata

This section contains run metadata. This information is not essential to a successful model run, but is useful for tracking run provenance.

The <notes> tag will be filled in by the web GUI if you provide notes, or you can add notes yourself within these tags. We suggest adding notes that help identify your run and a brief description of what the run is for. Because these notes are searchable within the PEcAn database and web interface, they can be a useful way to distinguish between similar runs.

The <userid> and <username> section is filled in from the GUI if you are signed in. If you are not using the GUI, add the user name and ID you are associated with that exists within the PEcAn database.

The <date></date> tag is filled automatically at the time of your run from the GUI. If you are not using the GUI, add the date you execute the run. This tag is not the tag for the dates you would like to run your model simulation.

11.0.2 outdir: Output directory

The <outdir> tag is used to configure the output folder used by PEcAn. This is the directory where all model input and output files will be stored. By default, the web interface names this folder PEcAn_<workflow ID>, and higher-level location is set by the $output_folder$ variable in the web/config.php file. If no outdir is specified, PEcAn defaults to the working directory from which it is called, which may be counterintuitive.

11.0.3 database: PEcAn database settings bety: PEcAn database (Bety) configuration

The bety tag defines the driver to use to connect to the database (the only driver we support, and the default, is PostgreSQL) and parameters required to connect to the database. Note that connection parameters are passed exactly as entered to the underlying R database driver, and any invalid or extra parameters will result in an error.

In other words, this configuration…

…will be translated into R code like the following:

Common parameters are described as follows:

  • driver: The driver to use to connect to the database. This should always be set to PostgreSQL, unless you absolutely know what you’re doing.
  • dbname: The name of the database (formerly name), corresponding to the -d argument to psql. In most cases, this should be set to bety, and will only be different if you named your Bety instance something else (e.g. if you have multiple instances running at once). If unset, it will default to the user name of the current user, which is usually wrong!
  • user: The username to connect to the database (formerly userid), corresponding to the -U argument to psql. default value is the username of the current user logged in (PostgreSQL uses user for this field).
  • password: The password to connect to the database (was passwd), corresponding to the -p argument to psql. If unspecified, no password is used. On standard PEcAn installations, the username and password are both bety (all lowercase).
  • host: The hostname of the bety database, corresponding to the -h argument to psql. On the VM, this will be localhost (the default). If using docker, this will be the name of the PostgreSQL container, which is postgres if using our standard docker-compose. If connecting to the PEcAn database on a remote server (e.g. psql-pecan.bu.edu), this should be the same as the hostname used for ssh access.
  • write: Logical. If true (the default), write results to the database. If false, PEcAn will run but will not store any information to bety.

When using the web interface, this section is configured by the web/config.php file. The default config.php settings on any given platform (VM, Docker, etc.) or in example files (e.g. config.php.example) are a good place to get default values for these fields if writing pecan.xml by hand.

Key R functions using these parameters are as follows:

  • PEcAn.DB::db.open – Open a database connection and create a connection object, which is used by many other functions for communicating with the PEcAn database. dbfiles: Location of database files

The dbfiles is a path to the local location of files needed to run models using PEcAn, including model executables and inputs. (Experimental) fia: FIA database connection parameters

If a version of the FIA database is available, it can be configured using <fia> node, whose syntax is identical to that of the <bety> node.

Currently, this is only used for extraction of specific site vegetation information (notably, for ED2 css, pss, and site files). Stability not ensured as of 1.5.3.

11.0.4 pft: Plant functional type selection

The PEcAn system requires at least 1 plant functional type (PFT) to be specified inside the <pfts> section.

  • name : (required) the name of the PFT, which must exactly match the name in the PEcAn database.
  • outdir: (optional) Directory path in which PFT-specific output will be stored during meta-analysis and sensitivity analysis. If not specified (recommended), it will be written into <outdir>/<pftname>.
  • contants: (optional) this section contains information that will be written directly into the model specific configuration files. For example, some models like ED2 use PFT numbers instead of names for PFTs, and those numbers can be specified here. See documentation for model-specific code for details.

This information is currently used by the following PEcAn workflow function:

  • get.traits - ??????

11.0.5 meta.analysis: Trait Meta Analysis

The section meta.analysis needs to exists for a meta.analysis to be executed, even though all tags inside are optional. Conversely, if you do not want to do a trait meta-analysis (e.g. if you want to manually set all parameters), you should omit this node.

Some of the tags that can go in this section are:

  • iter: MCMC (Markov Chain Monte Carlo) chain length, i.e. the total number of posterior samples in the meta-analysis, default is 3000. Smaller numbers will run faster but produce larger errors.
  • random.effects: Settings related to whether to include random effects (site, treatment) in meta-analysis model.
  • on: Default is set to FALSE to work around convergence problems caused by an over parameterized model (e.g. too many sites, not enough data). Can be turned to TRUE for including hierarchical random effects.
  • use_ghs: Default is set to TRUE to include greenhouse measurements. Can be set to FALSE to exclude cases where all data is from greenhouse.
  • update: Should previous results of meta.analysis and get.traits be re-used. If set to TRUE the meta-analysis and get.trait.data will always be executed. Setting this to FALSE will try and reuse existing results. Future versions will allow for AUTO as well which will try and reuse if the PFT/traits have not changed. The default value is FALSE.
  • threshold: threshold for Gelman-Rubin convergence diagnostic (MGPRF); default is 1.2.

This information is currently used by the following PEcAn workflow function:

  • PEcAn.MA::run.meta.analysis - ???

11.0.6 model: Model configuration

This section describes which model PEcAn should run and some instructions for how to run it.

Some important tags are as follows:

  • id – The unique numeric ID of the model in the PEcAn database models table. If this is present, then type and binary are optional since they can be determined from the PEcAn database.
  • type – The model “type”, matching the PEcAn database modeltypes table name column. This also refers to which PEcAn model-specific package will be used. In PEcAn, a “model” refers to a specific version (e.g. release, git commit) of a specific model, and “model type” is used to link different releases of the same model. Model “types” also have specific PFT definitions and other requirements associated with them (e.g. the ED2 model “type” requires a global land cover database).
  • binary – The file path to the model executable. If omitted, PEcAn will use whatever path is registered in the PEcAn database for the current machine.
  • job.sh – Additional options added to the job.sh script, which is used to execute the model. This is useful for setting specific environment variables, load modules, etc.

This information is currently used by the following PEcAn workflow function:

  • PEcAn.<MODEL>::write.config.<MODEL> – Write model-specific configuration files {#pecan-write-configs}
  • PEcAn.remote::start.model.runs – Begin model run execution Model-specific configuration

See the following:

  • [ED2][ED2 Configuration]
  • [SIPNET][SIPNET Configuration]
  • [BIOCRO][BioCro Configuration] ED2 specific tags

Following variables are ED specific and are used in the ED2 Configuration.

Starting at 1.3.7 the tags for inputs have moved to <run><inputs>. This includes, veg, soil, psscss, inputs.

  • edin : [required] template used to write ED2IN file
  • veg : OBSOLETE [required] location of VEG database, now part of <run><inputs>
  • soil : OBSOLETE [required] location of soild database, now part of <run><inputs>
  • psscss : OBSOLETE [required] location of site inforation, now part of <run><inputs>. Should be specified as <pss>, <css> and <site>.
  • inputs : OBSOLETE [required] location of additional input files (e.g. data assimilation data), now part of <run><inputs>. Should be specified as <lu> and <thsums>.

11.0.7 run: Run Setup

This section provides detailed configuration for the model run, including the site and time period for the simulation and what input files will be used. site: Where to run the model

This contains the following tags:

  • id – This is the numeric ID of the site in the PEcAn database (table sites, column id). PEcAn can automatically fill in other relevant information for the site (e.g. name, lat, lon) using the site ID, so those fields are optional if ID is provided.
  • name – The name of the site, as a string.
  • lat, lon – The latitude and longitude coordinates of the site, as decimals.
  • met.start, met.end – ???
  • <site.pft> (optional) If this tag is found under the site tag, then PEcAn automatically makes sure that only PFTs defined under this tag is used for generating parameter’s samples. Following shows an exmaple of how this tag can be added to the PEcAn xml :

For multi-site runs if the pft.site tag (see {#xml-run-inputs}) is defined under input, then the above process will be done automatically under prepare settings step in PEcAn main workflow and there is no need for adding the tags manually. Using the pft.site tag however, requires a lookup table as an input (see {#xml-run-inputs}). inputs: Model inputs

Models require several different types of inputs to run. Exact requirements differ from model to model, but common inputs include meteorological/climate drivers, site initial conditions (e.g. vegetation composition, carbon pools), and land use drivers.

In general, all inputs should have the following tags:

  • id: Numeric ID of the input in the PEcAn database (table inputs, column id). If not specified, PEcAn will try to figure this out based on the source tag (described below).
  • path: The file path of the input. Usually, PEcAn will set this automatically based on the id (which, in turn, is determined from the source). However, this can be set manually for data that PEcAn does not know about (e.g. data that you have processed yourself and have not registered with the PEcAn database).
  • source: The input data type. This tag name needs to match the names in the corresponding conversion functions. If you are using PEcAn’s automatic input processing, this is the only field you need to set. However, this field is ignored if id and/or path are provided.
  • output: ???

The following are the most common types of inputs, along with their corresponding tags: met: Meteorological inputs

(Under construction. See the PEcAn.data.atmosphere package, located in modules/data.atmosphere, for more details.) (Experimental) soil: Soil inputs

(Under construction. See the PEcAn.data.land package, located in modules/data.land, for more details). (Experimental) veg: Vegetation initial conditions

(Under construction. Follow developments in the PEcAn.data.land package, located in modules/data.land in the source code). ##### pft.site Multi-site site / PFT mapping

When performing multi-site runs, it is not uncommon to find that different sites need to be run with different PFTs, rather than running all PFTs at all sites. If you’re interested to use a specific PFT for your site/sites you can use the following tag to tell PEcAn which PFT needs to be used for what site.


For example using the above tag, user needs to have a csv file named site_pft stored in the pecan folder. At the moment we have functions supporting just the .csv and .txt files which are comma separated and have the following format:

site_id, pft_name

Then pecan would use this lookup table to inform write.ensemble.config function about what PFTs need to be used for what sites. start.date and end.date

The start and end date for the run, in a format parseable by R (e.g. YYYY/MM/DD or YYYY-MM-DD). These dates are inclusive; in other words, they refer to the first and last days of the run, respectively.

NOTE: Any time-series inputs (e.g. meteorology drivers) must contain all of these dates. PEcAn tries to detect and throw informative errors when dates are out of bounds inputs, but it may not know about some edge cases. Other tags

The following tags are optional run settings that apply to any model:

  • jobtemplate: the template used when creating a job.sh file, which is used to launch the actual model. Each model has its own default template in the inst folder of the corresponding R package (for instance, here is the one for ED2). The following variables can be used: @SITE_LAT@, @SITE_LON@, @SITE_MET@, @START_DATE@, @END_DATE@, @OUTDIR@, @RUNDIR@ which all come variables in the pecan.xml file. The following two command can be used to copy and clean the results from a scratch folder (specified as scratch in the run section below, for example local disk vs network disk) : @SCRATCH_COPY@, @SCRATCH_CLEAR@.

Some models also have model-specific tags, which are described in the PEcAn Models section.

11.0.8 host: Host information for remote execution

This section provides settings for remote model execution, i.e. any execution that happens on a machine (including “virtual” machines, like Docker containers) different from the one on which the main PEcAn workflow is running. A common use case for this section is to submit model runs as jobs to a high-performance computing cluster. If no host tag is provided, PEcAn assumes models are run on localhost, a.k.a. the same machine as PEcAn itself.

For detailed instructions on remote execution, see the Remote Execution page. For detailed information on configuring this for RabbitMQ, see the RabbitMQ page. The following provides a quick overview of XML tags related to remote execution.

NOTE: Any paths specified in the pecan.xml refer to paths on the host specified in this section, /not/ the machine on which PEcAn is running (unless models are running on localhost or this section is omitted).

The host section has the following tags:

  • name: [optional] name of host server where model is located and executed, if not specified localhost is assumed.
  • rundir: [optional/required] location where all the configuration files are written. For localhost this is optional (<outdir>/run is the default), for any other host this is required.
  • outdir: [optional/required] location where all the outputs of the model are written. For localhost this is optional (<outdir>/out is the default), for any other host this is required.
  • scratchdir: [optional] location where output is written. If specified the output from the model is written to this folder and copied to the outdir when the model is finished, this could significantly speed up the model execution (by using local or ram disk).
  • clearscratch: [optional] if set to TRUE the scratchfolder is cleaned up after copying the results to the outdir, otherwise the folder will be left. The default is to clean up after copying.
  • qsub: [optional] the command to submit a job to the queuing system. There are 3 parameters you can use when specifying the qsub command, you can add additional values for your specific setup (for example -l walltime to specify the walltime, etc). You can specify @NAME@ the pretty name, @STDOUT@ where to write stdout and @STDERR@, where to write stderr. You can specify an empty element (<qsub/>) in which case it will use the default value is qsub -V -N @NAME@ -o @STDOUT@ -e @STDERR@ -s /bin/bash.
  • qsub.jobid: [optional] the regular expression used to find the jobid returned from qsub. If not specified (and qsub is) it will use the default value is Your job ([0-9]+) .*
  • qstat: [optional] the command to execute to check if a job is finished, this should return DONE if the job is finished. There is one parameter this command should take @JOBID@ which is the ID of the job as returned by qsub.jobid. If not specified (and qsub is) it will use the default value is qstat -j @JOBID@ || echo DONE
  • job.sh: [optional] additional options to add to the job.sh at the top.
  • modellauncher: [optional] this is an experimental section that will allow you to submit all the runs as a single job to a HPC system.

The modellauncher section if specified will group all runs together and only submit a single job to the HPC cluster. This single job will leverage of a MPI program that will execute a single run. Some HPC systems will place a limit on the number of jobs that can be executed in parallel, this will only submit a single job (using multiple nodes). In case there is no limit on the number of jobs, a single PEcAn run could potentially submit a lot of jobs resulting in the full cluster running jobs for a single PEcAn run, preventing others from executing on the cluster.

The modellauncher has 2 arguements: * binary : [required] The full path to the binary modellauncher. Source code for this file can be found in pecan/contrib/modellauncher](https://github.com/PecanProject/pecan/tree/develop/contrib/modellauncher). * qsub.extra : [optional] Additional flags to pass to qsub besides those specified in the qsub tag in host. This option can be used to specify that the MPI environment needs to be used and the number of nodes that should be used.

11.0.9 Advanced features

11.0.10 ensemble: Ensemble Runs

As with meta.analysis, if this section is missing, then PEcAn will not do an ensemble analysis.

An alternative configuration is as follows:

Tags in this block can be broken down into two categories: Those used for setup (which determine how the ensemble analysis runs) and those used for post-hoc analysis and visualization (i.e. which do not affect how the ensemble is generated).

Tags related to ensemble setup are:

  • size : (required) the number of runs in the ensemble.
  • samplingspace: (optional) Contains tags for defining how the ensembles will be generated.

Each piece in the sampling space can potentially have a method tag and a parent tag. Method refers to the sampling method and parent refers to the cases where we need to link the samples of two components. When no tag is defined for one component, one sample will be generated and used for all the ensembles. This allows for partitioning/studying different sources of uncertainties. For example, if no met tag is defined then, one met path will be used for all the ensembles and as a result the output uncertainty will come from the variability in the parameters. At the moment no sampling method is implemented for soil and vegetation. Available sampling methods for parameters can be found in the documentation of the PEcAn.utils::get.ensemble.samples function. For the cases where we need simulations with a predefined set of parameters, met and initial condition we can use the restart argument. Restart needs to be a list with name tags of runid, inputs, new.params (parameters), new.state (initial condition), ensemble.id (ensemble ids), start.time, and stop.time.

The restart functionality is developed using model specific functions by called write_restart.modelname. You need to make sure first that this function is already exist for your desired model.

Note: if the ensemble size is set to 1, PEcAn will select the posterior median parameter values rather than taking a single random draw from the posterior

Tags related to post-hoc analysis and visualization are:

  • variable: (optional) name of one (or more) variables the analysis should be run for. If not specified, sensitivity.analysis variable is used, otherwise default is GPP (Gross Primary Productivity).

(NOTE: This static visualization functionality will soon be deprecated as PEcAn moves towards interactive visualization tools based on Shiny and htmlwidgets).

This information is currently used by the following PEcAn workflow functions:

  • PEcAn.<MODEL>::write.config.<MODEL> - See above.
  • PEcAn.uncertainty::write.ensemble.configs - Write configuration files for ensemble analysis
  • PEcAn.uncertainty::run.ensemble.analysis - Run ensemble analysis

11.0.11 sensitivity.analysis: Sensitivity analysis

Only if this section is defined a sensitivity analysis is done. This section will have <quantile> or <sigma> nodes. If neither are given, the default is to use the median +/- [1 2 3] x sigma (e.g. the 0.00135 0.0228 0.159 0.5 0.841 0.977 0.999 quantiles); If the 0.5 (median) quantile is omitted, it will be added in the code.

  • quantiles/sigma : [optional] The number of standard deviations relative to the standard normal (i.e. “Z-score”) for which to perform the ensemble analysis. For instance, <sigma>1</sigma> corresponds to the quantile associated with 1 standard deviation greater than the mean (i.e. 0.681). Use a separate <sigma> tag, all under the <quantiles> tag, to specify multiple quantiles. Note that we do not automatically add the quantile associated with -sigma – i.e. if you want +/- 1 standard deviation, then you must include both <sigma>1</sigma> and <sigma>-1</sigma>.
  • start.date : [required?] start date of the sensitivity analysis (in YYYY/MM/DD format)
  • end.date : [required?] end date of the sensitivity analysis (in YYYY/MM/DD format)
    • NOTE: start.date and end.date are distinct from values set in the run tag because this analysis can be done over a subset of the run.
  • variable : [optional] name of one (or more) variables the analysis should be run for. If not specified, sensitivity.analysis variable is used, otherwise default is GPP.
  • perpft : [optional] if TRUE a sensitivity analysis on PFT-specific outputs will be run. This is only possible if your model provides PFT-specific outputs for the variable requested. This tag only affects the output processing, not the number of samples proposed for the analysis nor the model execution.

This information is currently used by the following PEcAn workflow functions:

  • PEcAn.<MODEL>::write.configs.<MODEL> – See above
  • PEcAn.uncertainty::run.sensitivity.analysis – Executes the uncertainty analysis

11.0.12 Parameter Data Assimilation

The following tags can be used for parameter data assimilation. More detailed information can be found here: Parameter Data Assimilation Documentation

11.0.13 Multi-Settings

Multi-settings allows you to do multiple runs across different sites. This customization can also leverage site group distinctions to expedite the customization. It takes your settings and applies the same settings, changing only the site level tags across sites.

To start, add the multisettings tag within the <run></run> section of your xml


Additional tags for this section exist and can fully be seen here:


These tags correspond to different pecan analysis that need to know that there will be multiple settings read in.

Next you’ll want to add the following tags to denote the group of sites you want to use. It leverages site groups, which are defined in BETY.

If you add this tag, you must remove the <site> </site> tags from the <run> tag portion of your xml. The id of your sitegroup can be found by lookig up your site group within BETY.

You do not have to use the sitegroup tag. You can manually add multiple sites using the structure in the example below.

Lastly change the top level tag to <pecan.multi>, meaning the top and bootom of your xml should look like this:

<?xml version="1.0"?>

Once you have defined these tags, you can run PEcAn, but there may be further specifications needed if you know that different data sources have different dates available.

Run workflow.R up until

# Write pecan.CHECKED.xml
PEcAn.settings::write.settings(settings, outputfile = "pecan.CHECKED.xml")

Once this section is run, you’ll need to open pecan.CHECKED.xml. You will notice that it has expanded from your original pecan.xml.

  • The ... replaces the rest of the site settings for however many sites are within the site group.

Looking at the example above, take a close look at the <met.start></met.start> and <met.end></met.end>. You will notice that for both sites, the dates are different. In this example they were edited by hand to include the dates that are available for that site and source. You must know your source prior. Only the source CRUNCEP has a check that will tell you if your dates are outside the range available. PEcAn will automatically populate these dates across sites according the original setting of start and end dates.

In addition, you will notice that the <path></path> section contains the model specific meteorological data file. You can add that in by hand or you can you can leave the normal tags that met process workflow will use to process the data into your model specific format:


11.0.14 (experimental) State Data Assimilation

The following tags can be used for state data assimilation. More detailed information can be found here: State Data Assimilation Documentation

  • process.variance : [optional] TRUE/FLASE flag for if process variance should be estimated (TRUE) or not (FALSE). If TRUE, a generalized ensemble filter will be used. If FALSE, an ensemble Kalman filter will be used. Default is FALSE.
  • sample.parameters : [optional] TRUE/FLASE flag for if parameters should be sampled for each ensemble member or not. This allows for more spread in the intial conditions of the forecast.
  • NOTE: If TRUE, you must also assign a vector of trait names to pick.trait.params within the sda.enkf function.
  • state.variable : [required] State variable that is to be assimilated (in PEcAn standard format). Default is “AGB” - Above Ground Biomass.
  • spin.up : [required] start.date and end.date for model spin up.
  • NOTE: start.date and end.date are distinct from values set in the run tag because spin up can be done over a subset of the run.
  • forecast.time.step : [optional] start.date and end.date for model spin up.
  • start.date : [required?] start date of the state data assimilation (in YYYY/MM/DD format)
  • end.date : [required?] end date of the state data assimilation (in YYYY/MM/DD format)
  • NOTE: start.date and end.date are distinct from values set in the run tag because this analysis can be done over a subset of the run.

11.0.15 (experimental) Brown Dog

This section describes how to connect to Brown Dog. This facilitates processing and conversions of data.

  • url: (required) endpoint for Brown Dog to be used.
  • username: (optional) username to be used with the endpoint for Brown Dog.
  • password: (optional) password to be used with the endpoint for Brown Dog.

This information is currently used by the following R functions:

  • PEcAn.data.atmosphere::met.process – Generic function for processing meteorological input data.
  • PEcAn.benchmark::load_data – Generic, versatile function for loading data in various formats.

11.0.16 (experimental) Benchmarking

Coming soon…