11.2 Advanced features

11.2.1 `ensemble`: Ensemble Runs

As with meta.analysis, if this section is missing, then PEcAn will not do an ensemble analysis.

  <ensemble>
    <size>1</size>
    <variable>NPP</variable>
    <samplingspace>
      <parameters>
        <method>uniform</method>
      </parameters>
      <met>
        <method>sampling</method>
      </met>
    </samplingspace>
  </ensemble>

An alternative configuration is as follows:

<ensemble>
  <size>5</size>
  <variable>GPP</variable>
  <start.year>1995</start.year>
  <end.year>1999</end.year>
  <samplingspace>
  <parameters>
    <method>lhc</method>
  </parameters>
  <met>
    <method>sampling</method>
  </met>
  </samplingspace>
</ensemble>

Tags in this block can be broken down into two categories: Those used for setup (which determine how the ensemble analysis runs) and those used for post-hoc analysis and visualization (i.e. which do not affect how the ensemble is generated).

Tags related to ensemble setup are:

size : (required) the number of runs in the ensemble.
samplingspace: (optional) Contains tags for defining how the ensembles will be generated.

Shared sampling design: In multi-site workflows, PEcAn now generates one joint ensemble design and reuses the same input sample indices (parameters, meteorology, etc.) across all sites to ensure consistent draws; previously, inputs were sampled independently per site. This change does not introduce new XML tags and applies whenever multiple sites are processed together. The joint sampling design is created once at the start of configuration using generate_joint_ensemble_design(), called from the run configuration module, and the resulting indices are threaded through to write.ensemble.configs().

Each piece in the sampling space can potentially have a method tag and a parent tag. Method refers to the sampling method and parent refers to the cases where we need to link the samples of two components. When no tag is defined for one component, one sample will be generated and used for all the ensembles. This allows for partitioning/studying different sources of uncertainties. For example, if no met tag is defined then, one met path will be used for all the ensembles and as a result the output uncertainty will come from the variability in the parameters. At the moment no sampling method is implemented for soil and vegetation. Available sampling methods for parameters can be found in the documentation of the PEcAn.utils::get.ensemble.samples function. For the cases where we need simulations with a predefined set of parameters, met and initial condition we can use the restart argument. Restart needs to be a list with name tags of runid, inputs, new.params (parameters), new.state (initial condition), ensemble.id (ensemble ids), start.time, and stop.time.

The restart functionality is developed using model specific functions by called write_restart.modelname. You need to make sure first that this function is already exist for your desired model.

Note: if the ensemble size is set to 1, PEcAn will select the posterior median parameter values rather than taking a single random draw from the posterior

Tags related to post-hoc analysis and visualization are:

variable: (optional) name of one (or more) variables the analysis should be run for. If not specified, sensitivity.analysis variable is used, otherwise default is GPP (Gross Primary Productivity).

(NOTE: This static visualization functionality will soon be deprecated as PEcAn moves towards interactive visualization tools based on Shiny and htmlwidgets).

This information is currently used by the following PEcAn workflow functions:

PEcAn.<MODEL>::write.config.<MODEL> - See above.
PEcAn.uncertainty::write.ensemble.configs - Write configuration files for ensemble analysis
PEcAn.uncertainty::run.ensemble.analysis - Run ensemble analysis

11.2.2 `sensitivity.analysis`: Sensitivity analysis

Only if this section is defined a sensitivity analysis is done. This section will have <quantile> or <sigma> nodes. If neither are given, the default is to use the median +/- [1 2 3] x sigma (e.g. the 0.00135 0.0228 0.159 0.5 0.841 0.977 0.999 quantiles); If the 0.5 (median) quantile is omitted, it will be added in the code.

<sensitivity.analysis>
    <quantiles>
        <sigma>-3</sigma>
        <sigma>-2</sigma>
        <sigma>-1</sigma>
        <sigma>1</sigma>
        <sigma>2</sigma>
        <sigma>3</sigma>
    </quantiles>
  <variable>GPP</variable>
  <perpft>TRUE</perpft>
    <start.year>2004</start.year>
    <end.year>2006</end.year>
</sensitivity.analysis>

quantiles/sigma : [optional] The number of standard deviations relative to the standard normal (i.e. “Z-score”) for which to perform the ensemble analysis. For instance, <sigma>1</sigma> corresponds to the quantile associated with 1 standard deviation greater than the mean (i.e. 0.681). Use a separate <sigma> tag, all under the <quantiles> tag, to specify multiple quantiles. Note that we do not automatically add the quantile associated with -sigma – i.e. if you want +/- 1 standard deviation, then you must include both <sigma>1</sigma> and <sigma>-1</sigma>.
start.date : [required?] start date of the sensitivity analysis (in YYYY/MM/DD format)
end.date : [required?] end date of the sensitivity analysis (in YYYY/MM/DD format)
- NOTE: start.date and end.date are distinct from values set in the run tag because this analysis can be done over a subset of the run.
variable : [optional] name of one (or more) variables the analysis should be run for. If not specified, sensitivity.analysis variable is used, otherwise default is GPP.
perpft : [optional] if TRUE a sensitivity analysis on PFT-specific outputs will be run. This is only possible if your model provides PFT-specific outputs for the variable requested. This tag only affects the output processing, not the number of samples proposed for the analysis nor the model execution.

This information is currently used by the following PEcAn workflow functions:

PEcAn.<MODEL>::write.configs.<MODEL> – See above
PEcAn.uncertainty::run.sensitivity.analysis – Executes the uncertainty analysis

11.2.3 Parameter Data Assimilation

The following tags can be used for parameter data assimilation. More detailed information can be found here: Parameter Data Assimilation Documentation

11.2.4 Multi-Settings

Multi-settings allows you to do multiple runs across different sites. This customization can also leverage site group distinctions to expedite the customization. It takes your settings and applies the same settings, changing only the site level tags across sites.

To start, add the multisettings tag within the <run></run> section of your xml

<multisettings>
  <multisettings>run</multisettings>
<multisettings>

Additional tags for this section exist and can fully be seen here:

 <multisettings>
  <multisettings>assim.batch</multisettings>
  <multisettings>ensemble</multisettings>
  <multisettings>sensitivity.analysis</multisettings>
  <multisettings>run</multisettings>
 </multisettings>

These tags correspond to different pecan analysis that need to know that there will be multiple settings read in.

Next you’ll want to add the following tags to denote the group of sites you want to use. It leverages site groups, which are defined in BETY.

 <sitegroup>
   <id>1000000022</id>
 </sitegroup>

If you add this tag, you must remove the <site> </site> tags from the <run> tag portion of your xml. The id of your sitegroup can be found by lookig up your site group within BETY.

You do not have to use the sitegroup tag. You can manually add multiple sites using the structure in the example below.

Lastly change the top level tag to <pecan.multi>, meaning the top and bootom of your xml should look like this:

<?xml version="1.0"?>
<pecan.multi>
...
</pecan.multi>

Once you have defined these tags, you can run PEcAn, but there may be further specifications needed if you know that different data sources have different dates available.

Run workflow.R up until

# Write pecan.CHECKED.xml
PEcAn.settings::write.settings(settings, outputfile = "pecan.CHECKED.xml")

Once this section is run, you’ll need to open pecan.CHECKED.xml. You will notice that it has expanded from your original pecan.xml.

 <run>
  <settings.1>
   <site>
    <id>796</id>
    <met.start>2005/01/01</met.start>
    <met.end>2011/12/31</met.end>
    <name>Bartlett Experimental Forest (US-Bar)</name>
    <lat>44.06464</lat>
    <lon>-71.288077</lon>
   </site>
   <start.date>2005/01/01</start.date>
   <end.date>2011/12/31</end.date>
   <inputs>
    <met>
     <path>/fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_0-796/AMF_US-Bar_BASE_HH_4-1.2005-01-01.2011-12-31.clim</path>
    </met>
   </inputs>
  </settings.1>
  <settings.2>
   <site>
    <id>767</id>
    <met.start>2001/01/01</met.start>
    <met.end>2014/12/31</met.end>
    <name>Morgan Monroe State Forest (US-MMS)</name>
    <lat>39.3231</lat>
    <lon>-86.4131</lon>
   </site>
   <start.date>2001/01/01</start.date>
   <end.date>2014/12/31</end.date>
   <inputs>
    <met>
     <path>/fs/data1/pecan.data/dbfiles/AmerifluxLBL_SIPNET_site_0-767/AMF_US-MMS_BASE_HR_8-1.2001-01-01.2014-12-31.clim</path>
    </met>
   </inputs>
  </settings.2>
....
</run>

The ... replaces the rest of the site settings for however many sites are within the site group.

Looking at the example above, take a close look at the <met.start></met.start> and <met.end></met.end>. You will notice that for both sites, the dates are different. In this example they were edited by hand to include the dates that are available for that site and source. You must know your source prior. Only the source CRUNCEP has a check that will tell you if your dates are outside the range available. PEcAn will automatically populate these dates across sites according the original setting of start and end dates.

In addition, you will notice that the <path></path> section contains the model specific meteorological data file. You can add that in by hand or you can you can leave the normal tags that met process workflow will use to process the data into your model specific format:

<met>
  <source>AmerifluxLBL</source>
  <output>SIPNET</output>
  <username>pecan</username>
</met>

11.2.5 (experimental) State Data Assimilation

The following tags can be used for state data assimilation. More detailed information can be found here: State Data Assimilation Documentation

<state.data.assimilation>
    <process.variance>TRUE</process.variance>
    <aqq.Init>1</aqq.Init>
    <bqq.Init>1</bqq.Init>
  <sample.parameters>FALSE</sample.parameters>
  <adjustment>TRUE</adjustment>
  <censored.data>FALSE</censored.data>
  <FullYearNC>TRUE</FullYearNC>
  <NC.Overwrite>FALSE</NC.Overwrite>
  <NC.Prefix>sipnet.out</NC.Prefix>
  <q.type>SINGLE</q.type>
  <free.run>FALSE</free.run>
  <Localization.FUN>Local.support</Localization.FUN>
  <scalef>1</scalef>
  <chains>5</chains>
  <state.variables>
   <variable>
    <variable.name>AbvGrndWood</variable.name>
    <unit>MgC/ha</unit>
    <min_value>0</min_value>
    <max_value>9999</max_value>
   </variable>
   <variable>
    <variable.name>LAI</variable.name>
    <unit></unit>
    <min_value>0</min_value>
    <max_value>9999</max_value>
   </variable>
   <variable>
  <variable.name>SoilMoistFrac</variable.name>
  <unit></unit>
  <min_value>0</min_value>
  <max_value>100</max_value>
  </variable>
   <variable>
    <variable.name>TotSoilCarb</variable.name>
    <unit>kg/m^2</unit>
    <min_value>0</min_value>
    <max_value>9999</max_value>
   </variable>
  </state.variables>
  <Obs_Prep>
   <Landtrendr_AGB>
    <AGB_input_dir>/projectnb/dietzelab/dongchen/Multi-site/download_500_sites/AGB</AGB_input_dir>
    <allow_download>TRUE</allow_download>
    <export_csv>TRUE</export_csv>
    <timestep>
     <unit>year</unit>
     <num>1</num>
    </timestep>
   </Landtrendr_AGB>
   <MODIS_LAI>
    <search_window>30</search_window>
    <export_csv>TRUE</export_csv>
    <run_parallel>TRUE</run_parallel>
    <timestep>
     <unit>year</unit>
     <num>1</num>
    </timestep>
    <sd_threshold>20</sd_threshold>
    <boundary>
     <upper_boundary>0.95</upper_boundary>
     <lower_boundary>0.05</lower_boundary>
    </boundary>
   </MODIS_LAI>
   <SMAP_SMP>
    <search_window>30</search_window>
    <export_csv>TRUE</export_csv>
    <update_csv>FALSE</update_csv>
    <timestep>
     <unit>year</unit>
     <num>1</num>
    </timestep>
   </SMAP_SMP>
   <Soilgrids_SoilC>
    <timestep>
     <unit>year</unit>
     <num>1</num>
    </timestep>
   </Soilgrids_SoilC>
   <outdir>/projectnb/dietzelab/dongchen/All_NEON_SDA/test_OBS</outdir>
   <start.date>2012-07-15</start.date>
   <end.date>2021-07-15</end.date>
  </Obs_Prep>
  <spin.up>
    <start.date>2004/01/01</start.date>
      <end.date>2006/12/31</end.date>
  </spin.up>
  <forecast.time.step>1</forecast.time.step>
    <start.date>2004/01/01</start.date>
    <end.date>2006/12/31</end.date>
</state.data.assimilation>

process.variance : [optional] TRUE/FLASE flag for if process variance should be estimated (TRUE) or not (FALSE). If TRUE, a generalized ensemble filter will be used. If FALSE, an ensemble Kalman filter will be used. Default is FALSE.
aqq.Init : [optional] The initial value of aqq used for estimate the Q distribution, the default value is 1 (note that, the aqq.init and bqq.init right now only work on the VECTOR q type, and we didn’t account for the variabilities of them across sites or variables, meaning we initialize the aqq and bqq given single value).
bqq.Init : [optional] The initial value of bqq used for estimate the Q distribution, the default value is 1.
sample.parameters : [optional] TRUE/FLASE flag for if parameters should be sampled for each ensemble member or not. This allows for more spread in the intial conditions of the forecast.
adjustment : [optional] Bool variable decide if you want to adjust analysis results by the likelihood.
censored.data : [optional] Bool variable decide if you want to do MCMC sampling for the forecast ensemble space, the default is FALSE.
FullYearNC : [optional] Bool variable decide if you want to generate the full-year netcdf file when there is a overlap in time, the default is TRUE.
NC.Overwrite : [optional] Bool variable decide if you want to overwrite the previous netcdf file when there is a overlap in time, the default is FALSE.
NC.Prefix : [optional] The prefix for the generation of the full-year netcdf file, the default is sipnet.out.
q.type : [optional] The type of process variance that will be estimated, the default is SINGLE.
free.run : [optional] If it’s a free run without any observations, the default is FALSE.
Localization.FUN : [optional] The localization function name for the localization operation, the default is Local.support.
scalef : [optional] The scale parameter used for the localization operation, the smaller the value is, the sites are more isolated.
chains : [optional] The number of chains needed to be estimated during the MCMC sampling process.
NOTE: If TRUE, you must also assign a vector of trait names to pick.trait.params within the sda.enkf function.
state.variable : [required] State variable that is to be assimilated (in PEcAn standard format, with pre-specified variable name, unit, and range). Four variables can be assimilated so far: including Aboveground biomass (AbvGrndWood), LAI, SoilMoistFrac, and Soil carbon (TotSoilCarb).
Obs_Prep : [required] This section will be handled through the SDA_Obs_Assembler function, if you want to proceed with this function, this section is required.
spin.up : [required] start.date and end.date for model spin up.
NOTE: start.date and end.date are distinct from values set in the run tag because spin up can be done over a subset of the run.
forecast.time.step : [optional] start.date and end.date for model spin up.
start.date : [required?] start date of the state data assimilation (in YYYY/MM/DD format)
end.date : [required?] end date of the state data assimilation (in YYYY/MM/DD format)
NOTE: start.date and end.date are distinct from values set in the run tag because this analysis can be done over a subset of the run.

11.2.6 State Variables for State Data Assimilation

The following tags can be used for documentation of state variables required for the SDA workflow.

<state.data.assimilation>
  <state.variables>
   <variable>
    <variable.name>AbvGrndWood</variable.name>
    <unit>MgC/ha</unit>
    <min_value>0</min_value>
    <max_value>9999</max_value>
   </variable>
   <variable>
    <variable.name>LAI</variable.name>
    <unit></unit>
    <min_value>0</min_value>
    <max_value>9999</max_value>
   </variable>
   <variable>
  <variable.name>SoilMoistFrac</variable.name>
  <unit></unit>
  <min_value>0</min_value>
  <max_value>100</max_value>
  </variable>
   <variable>
    <variable.name>TotSoilCarb</variable.name>
    <unit>kg/m^2</unit>
    <min_value>0</min_value>
    <max_value>9999</max_value>
   </variable>
  </state.variables>
</state.data.assimilation>

variable : [required] Each of this section should include each state variables separately.
variable.name : [required] The name of the state variable (same as what we call it in the model output file. for example, “SIPNET.out” file).
unit : [required] The unit of the state variable (can be empty if it’s unitless).
min_value : [required] The minimum range of the state variable.
max_value : [required] The maximum range of the state variable.

11.2.7 Observation Preparation for State Data Assimilation

The following tags can be used for observation preparation for the SDA workflow.

<state.data.assimilation>
  <Obs_Prep>
   <Landtrendr_AGB>
    <AGB_input_dir>/projectnb/dietzelab/dongchen/Multi-site/download_500_sites/AGB</AGB_input_dir>
    <allow_download>TRUE</allow_download>
    <export_csv>TRUE</export_csv>
    <timestep>
     <unit>year</unit>
     <num>1</num>
    </timestep>
   </Landtrendr_AGB>
   
   <MODIS_LAI>
    <search_window>30</search_window>
    <export_csv>TRUE</export_csv>
    <run_parallel>TRUE</run_parallel>
    <timestep>
     <unit>year</unit>
     <num>1</num>
    </timestep>
    <sd_threshold>20</sd_threshold>
    <boundary>
     <upper_boundary>0.95</upper_boundary>
     <lower_boundary>0.05</lower_boundary>
    </boundary>
   </MODIS_LAI>
   
   <SMAP_SMP>
    <search_window>30</search_window>
    <export_csv>TRUE</export_csv>
    <update_csv>FALSE</update_csv>
    <timestep>
     <unit>year</unit>
     <num>1</num>
    </timestep>
   </SMAP_SMP>
   
   <Soilgrids_SoilC>
    <timestep>
     <unit>year</unit>
     <num>1</num>
    </timestep>
   </Soilgrids_SoilC>
   <outdir>/projectnb/dietzelab/dongchen/All_NEON_SDA/test_OBS</outdir>
   <start.date>2012-07-15</start.date>
   <end.date>2021-07-15</end.date>
  </Obs_Prep>
</state.data.assimilation>

The Obs_Prep section is specifically designed for the SDA_OBS_Assembler function within the state.data.assimilation section, which helps construct the obs.mean and obs.cov objects more efficiently. Within the Obs_Prep section, the user needs to specify data streams that will be proceeded in the assembler function as DataSource_AbbreviationOfData (for example: Landtrendr_AGB for Landtrendr aboveground biomass observations, or Soilgrids_SoilC for soilgrids soil organic carbon concentration). There are four functions that can be used for preparing observations: MODIS_LAI_prep, Soilgrids_SoilC_prep, SMAP_SMP_prep, and Landtrendr_AGB_prep functions. Other than that, the output directory, and start and end date need to be set within the Obs_Prep section, note that if you choose to export csv file for any data, you will get the corresponding csv file just under the outdir, for the Rdata of obs.mean and obs.cov objects, they will be stored whthin the Rdata folder within the outdir, if the Rdata folder is not created, then the assembler will help to create this folder automatically. * outdir : [required] This tag is for every observation preparation function, which points to the directory where the csv file associated with each data stream should be generated, read, or updated. * start.date : [required] This tag defines the exact start point of time, which differs from the start.date in the state.data.assimilation section in a way that it is the time when first observation should be taken. * end.date : [required] This tag defines the exact end point of time, which differs from the end.date in the state.data.assimilation section in a way that it is the time when last observation should be taken.

Within each data-specific section (for example, within Landtrendr_AGB section), there are few general settings need to be set like the export_csv and timestep arguments. * export_csv : [optional] This tag is for every observation preparation function, which decide if we want to export the csv file to the outdir. Normally the outdir should be given in order to set export_csv as TRUE. The default value is TRUE. * timestep : [optional] This tag defines the time unit and how many unit time per each time step. It supports year, month, week, and day as time unit. If you assign a general timestep just under the Obs_Prep section, all variable will be handled with the same timestep. If you assign each variable with each timestep, then the assembler function will create obs.mean and obs.cov with mixed temporal resolution. The default value is list(unit=“year”, num=1).

There are also several settings that are data-specific, for example, the AGB_indir only applies to Landtrendr AGB data that points to the directory where the AGB files are stored. Beyond that, the allow_download is also AGB specific, which decide if you want to download the AGB files if some are missing. * AGB_indir : [required] This tag is only for Landtrendr_AGB_prep function, which points to the directory where the pre-downloaded AGB files existed. * allow_download : [optional] This tag is only for Landtrendr_AGB_prep function, which decide if we want to proceed to download the AGB files (that will take a while, and a lot of space) if some of them are detected to be empty. The default value is FALSE.

For the MODIS LAI and SMAP soil moisture observations, the search_window specifies how many days you want to search for the data with desired dates. Beyond that, the update_csv is specifically used for the SMAP data, which decide if you want to reproduce formatted csv file based on the downloaded SMAP_gee.csv file. For the run_parallel in MODIS LAI settings, it specify if you want to proceed the function parallely. * search_window : [optional] This tag is used for LAI and SMAP observation preparation functions, which defines the search window within which the closest observation to the target date is selected. The default value is 30 days. * update_csv : [optional] This tag is used for SMAP preparation function only, which decide if we want to update the CSV file. Normally, we will only set it to be TRUE if we redownload the SMAP_GEE csv file from Google Earth Engine, the tutorial can be found in the SMAP_SMP_prep function itself. The default value is FALSE. * run_parallel : [optional] This tag defines if you want to proceed the MODIS LAI function parallely, the default value is FALSE. * sd_threshold : [optional] This tag defines the maximum allowable value for the MODIS LAI. Any records that have higher uncertainty will be removed. * boundary : [optional] This tag defines the upper and lower boundaries (in quantiles) for filtering the noisy LAI time series.

11.2.8 (experimental) Benchmarking

Coming soon…

11.2.9 Remote data module

This section describes the tags required for configuring remote_process.

  <remotedata>
  <out_get_data>...</out_get_data>
  <source>...</source>
  <collection>...</collection>
  <scale>...</scale>
  <projection>...</projection>
  <qc>...</qc>
  <algorithm>...</algorithm>
  <credfile>...</credfile>
  <out_process_data>...</out_process_data>
  <overwrite>...</overwrite>
  </remotedata>

out_get_data: (required) type of raw output requested, e.g, bands, smap
source: (required) source of remote data, e.g., gee or AppEEARS
collection: (required) dataset or product name as it is provided on the source, e.g. “COPERNICUS/S2_SR” for gee or “SPL3SMP_E.003” for AppEEARS
scale: (optional) pixel resolution required for some gee collections, recommended to use 10 for Sentinel 2
projection: (optional) type of projection. Only required for AppEEARS polygon AOI type
qc: (optional) quality control parameter, required for some gee collections
overwrite: (optional) if TRUE database checks will be skipped and existing data of same type will be replaced entirely. When processed data is requested, the raw data required for creating it will also be replaced. By default FALSE

These tags are only required if processed data is requested:

out_process_data: (optional) type of processed output requested, e.g, LAI
algorithm: (optional) algorithm used for processing data, currently only SNAP is implemented to estimate LAI from Sentinel-2 bands
credfile: (optional) absolute path to JSON file containing Earthdata username and password, only required for AppEEARS
pro_mimetype: (optional) MIME type of the processed file
pro_formatname: (optional) format name of the processed file

Additional information for the module are taken from the registration files located at data.remote/inst/registration

The output data from the module are returned in the following tags:

raw_id: input id of the raw file
raw_path: absolute path to the raw file
pro_id: input id of the processed file
pro_path: absolute path to the processed file