PEcAn.assim.batch Vignette
Istem Fer
Ryan Kelly
January 5, 2022
Source:vignettes/AssimBatchVignette.Rmd
AssimBatchVignette.RmdInstall package from Github
Only needs to be done the first time :
library(devtools)
install_github("PecanProject/pecan", subdir="all")Parameter Data Assimilation (a.k.a. calibration) in PEcAn
Currently, there are three ways of doing Parameter Data Assimilation (PDA) in PEcAn :
- bayesian.tools (bruteforce algorithms)
- emulator
- multi-site hierarchical PDA (HPDA) with emulators
Note that, there used to be Metropolis-Hastings Markov Chain
Monte Carlo (MCMC) algorithms implemented natively in PEcAn, namely
/modules/assim.batch/R/pda.mcmc.R and
pda.mcmc.bs.R. These are deprecated and they will no longer
be maintained.
Which one to use?
bayesian.tools : The
pda.bayesian.tools.R is the main script of this PDA
workflow that performs a bruteforce (MCMC methods that require running
your models at least couple of thousand of times) calibration. This
script uses the BayesianTools R-package (Hartig, Minuno and
Paul, 2019) that includes MCMC and SMC samplers and other tools for
Bayesian parameter calibration with external models. If you choose
bayesian.tools option, PEcAn framework will hand the PDA
calculations over to the BayesianTools package. Although
this package includes algorithms that are designed to explore the
parameter space more efficiently than the traditional MH-MCMC, you would
still need a relatively faster model to use these algorithms. Note that
the BayesianTools R-package itself is constantly under
development, and its integration with PEcAn may not be able to leverage
all BayesianTools functionality.
emulator : The pda.emulator.R is the
main script of this PDA workflow. When a model is slow, it is
practically not possible to run it hundreds of thousands of times
sequentially to explore the parameter space and draw enough samples to
converge the target distribution with bruteforce algorithms. Instead, we
can run the model for a relatively smaller number of times with
parameter combinations that have been carefully chosen to give good
coverage of parameter space. Then we can interpolate the likelihood
calculated for each of those runs to get a surface that “emulates” the
true likelihood and perform regular MCMC (just like the “bruteforce”
approach), except instead of actually running the model on every
iteration to obtain a likelihood value, this time we will just get an
approximation from the likelihood emulator. A lot more details about the
emulator approach can be found in Fer et al. 2018.
Multi-site HPDA : The pda.emulator.ms.R
is the main script of this PDA workflow which performs a multi-site
hierarchical Bayesian calibration using the emulator approach. In the
multi-site hierarhical calibration, global posteriors are fitted
simultaneously using an additional Gibbs update step, and used as the
new prior hyperparameters in the next iteration of MCMC to obtain
site-level hierarchical posteriors. This PDA workflow has its own
vignette (see
/modules/assim.batch/vignettes/MultiSitePDAVignette.Rmd)
and will not be one of the examples here. For more information also see
Fer et al. 2021
bioRxiv preprint.
Adding PDA tags to pecan.xml
The easiest way to use PEcAn’s parameter data assimilation module is
to add an <assim.batch> block to pecan.xml, load the
file with read.settings, and pass the resulting settings
object to pda.bayesian.tools() or
pda.emulator(). There are some differences in the settings
for using different PDA methods (we will go through test cases, at this
point just familiarize yourself with the general pattern), but here is
an example <assim.batch> block :
<assim.batch>
<iter>100000</iter>
<method>emulator</method> <--- WHICH PDA APPROACH
<chain>3</chain>
<param.names>
<temperate.coniferous> <--- YOUR PFT(S)
<param>PARAM1</param> <--- PARAMETER(S) YOU TARGET IN PDA
<param>PARAM2</param>
</temperate.coniferous>
</param.names>
<jump>
<adapt>250</adapt>
<adj.min>0.1</adj.min>
<ar.target>0.3</ar.target>
</jump>
<inputs>
<file>
<input.id> ... </input.id> <--- YOUR CONSTRAINT ID FROM BETY DB
<path>
<path> ... </path> <--- YOUR CONSTRAINT FILE PATH(S)
<path> ... </path>
</path>
<likelihood>Laplace</likelihood> <--- LIKELIHOOD FORM FOR THIS CONSTRAINT
<variable.id>298</variable.id> <--- (CONSTRAINT) VARIABLE ID FROM BETY DB
<variable.name>
<variable.name>LE</variable.name> <--- (CONSTRAINT) VARIABLE
<variable.name>UST</variable.name> <--- HELPER VARIABLE SPECIFIC TO FLUX CASE
</variable.name>
</file>
<file>
<input.id>...</input.id>
<path>
<path>...</path>
</path>
<likelihood>multipGauss</likelihood>
<hyper.pars>
<parama>0.001</parama> <--- HYPERPARAMETERS SPECIFIC TO GAUSSIAN FORM
<paramb>0.001</paramb>
</hyper.pars>
<ss.positive>TRUE</ss.positive>
<variable.id>...</variable.id>
<variable.name>
<variable.name>...</variable.name>
</variable.name>
</file>
</inputs>
</assim.batch>
Here are some more details about the PDA tags:
<iter>Specifies the number of MCMC iterations to run. If continuing a previous MCMC, this is the number of additional iterations, which will be added to the previous total. Defaults to 100 if missing.<chain>Specifies the number of MCMC chains to be run. Defaults to 2. Note that some of the post-processing functions will look for at least 2 chains for diagnostics.-
<prior>Identifies the prior to be used for PDA. Can be one of either:+ `<posterior.id>` A posterior ID in BETY specifying the posterior from a previous PEcAn analysis (e.g., meta-analysis or previous PDA) to be used as the prior for PDA. Defaults to the most recent relevant posterior in the database if omitted (and no `<path>` specified instead; see below). + `<path>` As an alternative to using a posterior ID, can specify a file path to either a `prior.distns.Rdata` or `post.distns.Rdata` file generated from an earlier analysis. Conceptually, using a posterior distribution as the prior for PDA is preferred, as this allows the multiple analyses to work together to iteratively constrain parameters. In practice, previous analyses may have over-constrained parameters to ranges that do not actually optimize model outputs, so using a less informative prior for PDA might yield better results. NOTE: It is recommended to leave the `<prior>` tag empty if you are doing this for the first time. PEcAn workflow will handle it for you. <param.names>The names of parameters to be constrained by assimilation, listed in individual<param>tags. These must be the standard names given in BETY DB (e.g. wood_density), not your model specific parameter names (e.g. wooddens), i.e. check out theidcolumn of thetrait.dictionary, i.e. :
## id figid
## 1 plant_min_temp Plant Min Temp
## 2 GDD Growing Degree Days
## 3 leafC Leaf C conc.
## 4 c2n_leaf Leaf C:N
## 5 leaf_respiration_rate_m2 Leaf Respiration Rate
## 6 dark_respiration_factor Dark Respiration Factor
You can inspect your write.config.MODEL.R script for
standard parameter names and their mappings to model parameter
names.
NOTE : `<param.names>` chunk should always be on your xml file regardless of the method you use in the PDA.
-
<inputs>Observation (constraint) data to be compared to the model. In principle, can be one or more datasets, specified in a variety of ways. In practice, the code is tested for assimilating flux datasets numerous times using the NEE/FC or LE variables.-
<file>Denotes a set of tags for a single input (constraint data). Would be repeated for multiple datasets/variables, e.g. in this case note the differences in<variable.id>and<variable.name>:
... <inputs> <file> <input.id>15000000028</input.id> <path> <path>/data/dbfiles/ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv</path> </path> <likelihood>Laplace</likelihood> <variable.id>297</variable.id> <variable.name> <variable.name>NEE</variable.name> <variable.name>UST</variable.name> </variable.name> </file> <file> <input.id>15000000028</input.id> <path> <path>/data/dbfiles/ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv</path> </path> <likelihood>Laplace</likelihood> <variable.id>298</variable.id> <variable.name> <variable.name>LE</variable.name> <variable.name>UST</variable.name> </variable.name> </file> </inputs> ...<input.id>BETY input ID for looking up the input.<path>File path to the input. Both<id>and<path>of the observation data should be supplied for the PDA.<source>A standardized source of input data (e.g., Ameriflux). Not implemented yet, but the idea would be similar to the met workflow, PEcAn would be able to use standard data sources automatically where available. Only used if no<id>or<path>is given.<likelihood>Identifier for the likelihood to use. E.g., the Ameriflux NEE/FC and LE data use a Laplacian likelihood.<hyper.pars>Optional. Hyperparameters for your likelihood. E.g. Prior parameters of the precision for Gaussian likelihood. Defaults to scaled values if not provided.<ss.positive>When using the emulator, it is important to let the algorithm know that you have a likelihood whose sufficient statistics is zero bound.-
<variable.id>The BETY variable ID associated with this dataset. The idea is that specific preprocessing steps (e.g., estimating heteroskedastic error for tower NEE) would be associated with particular IDs. Could automate further by assigning default<likelihood>to variable.id values (allowing<likelihood>to be omitted from pecan.xml).NOTE : `<inputs>` chunk should always be on your xml file regardless of the method you use in the PDA.
-
-
<jump>Settings for the specifics of the proposal schema and adaptation functionality. It is used in theemulatorapproach, it’s possible to set adaptation options usingbayesian.toolsspecific tags as well (see below for<bt.settings>).-
<ar.target>Target acceptance rate for the adaptive jump algorithm. Defaults to 0.5 if missing. -
<adapt>Number of iterations between jump variance adaptations. Defaults tofloor(iter/10)if missing. If set equal to the number of<iter>, basically turns off the adaptation functionality, it is not recommended to turn-off this functionality for the emulator approach. -
<adj.min>Minimum factor by which to reduce jump variance when adapting. Prevents jump variances from degenerating to 0. Defaults to 0.1 if missing.
-
Method specific settings
Now we will go through 4 examples of how you could arrange your
<assim.batch> tags and run a PDA in PEcAn:
- Using existing PEcAn data pipelines (e.g. Ameriflux) and variables in the emulator-based calibration.
- Using existing PEcAn data pipelines and variables in the BayesianTools-based (bruteforce) calibration.
- Using your own pre-formatted data and variables in the emulator-based calibration.
- Using your own pre-formatted data and variables in the BayesianTools-based (bruteforce) calibration.
1. PEcAn data pipeline with emulator
Assuming you already familiarized yourself with Demo 1 and Demo 2,
and ran SIPNET at Niwot Ridge (US-NR1), in this section we will go
through the pecan.xml tags for calibrating SIPNET with
Ameriflux NEE data with the emulator approach at the US-NR1 site. The
SIPNET PFT for this site was temperate.coniferous, let’s
assume the top three parameters contributing to the model NEE output
uncertainty (Demo 2) were
psnTOpt,half_saturation_PAR, and
Vm_low_temp, so we decided to constrain these in the PDA.
Go to your workflow directory where you performed the uncertainty
analysis, open up your pecan.CONFIGS.xml, insert the
following tags at the top of your xml after the
<pecan> tag (don’t forget to remove
<--- comments), modify the paths according to
your setup, and save the file as
pecan.PDA.xml:
<assim.batch>
<method>emulator</method>
<n.knot>60</n.knot> <--- 20 x p (= no. of paramaters you target)
<iter>100000</iter>
<chain>3</chain>
<param.names>
<temperate.coniferous>
<param>psnTOpt</param>
<param>half_saturation_PAR</param>
<param>Vm_low_temp</param>
</temperate.coniferous>
</param.names>
<jump>
<adapt>250</adapt>
<adj.min>0.1</adj.min>
<ar.target>0.3</ar.target>
</jump>
<inputs>
<file>
<input.id>1000011238</input.id> <--- your input ID (of the DB record of where the AMF csv file is)
<path>
<path>/.../AmerifluxLBL_site_0-772/AMF_US-NR1_BASE_HH_12-5.csv</path> <--- your path
</path>
<likelihood>Laplace</likelihood>
<variable.id>1000000042</variable.id>
<variable.name>
<variable.name>FC</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
</inputs>
</assim.batch>
For reference, these are the tags for a similar workflow for SIPNET at ICOS-NEE (FI-Var site) combination. Note the differences in PFT name, input ID and path, variable ID and name:
<assim.batch>
<method>emulator</method>
<n.knot>60</n.knot>
<iter>100000</iter>
<chain>3</chain>
<param.names>
<boreal.coniferous>
<param>psnTOpt</param>
<param>half_saturation_PAR</param>
<param>Vm_low_temp</param>
</boreal.coniferous>
</param.names>
<jump>
<adapt>250</adapt>
<adj.min>0.1</adj.min>
<ar.target>0.3</ar.target>
</jump>
<inputs>
<file>
<input.id>15000000028</input.id>
<path>
<path>/.../ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv</path>
</path>
<likelihood>Laplace</likelihood>
<variable.id>297</variable.id>
<variable.name>
<variable.name>NEE</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
</inputs>
</assim.batch>
Now you can read in your pecan.PDA.xml and run PDA (This
will take some time, go grab a drink. The first round especially takes
time because this is when the constraint data is loaded and its
effective sample size is calculated):
settings <- read.settings("pecan.PDA.xml")
settings_PDA_round1 <- pda.emulator(settings)
The PDA workflow has populated you settings list object
with couple of relevant tags. Now settings_PDA_round1 has
these additional tags (note that this post-PDA settings
list is also automatically written to your workflow directory as
pecan.pdaR1_ENS_ID.xml). These tags look like this:
<prior>
<prior.id>
<boreal.coniferous>INITIAL_PDA_PRIOR_ID</boreal.coniferous>
</prior.id>
</prior>
<ensemble.id>
<id>R1_ENS_ID</id>
</ensemble.id>
<emulator.path>/.../PEcAn_WORKFLOWID/emulator.pdaR1_ENS_ID.Rdata</emulator.path>
<ss.path>/.../PEcAn_WORKFLOWID/ss.pdaR1_ENS_ID.Rdata</ss.path>
<mcmc.path>/.../PEcAn_WORKFLOWID/mcmc.list.pdaR1_ENS_ID.Rdata</mcmc.path>
<resume.path>/.../PEcAn_WORKFLOWID/resume.pdaR1_ENS_ID.Rdata</resume.path>
<round_counter>1</round_counter>
<extension>round</extension>
<prior.id>This is the initial prior that went into your first PDA round. Even though the PDA workflow updates the<posteriorid>under your PFT block, if you do another emulator round only the initial prior will be used in the MCMC (otherwise you use data twice!). Hence, we keep track of its ID in the xml. But it is still perfectly fine to propose the new training points for the next emulator round from the posterior of the previous round, so that you can refine your emulator surface, that’s when the PDA workflow uses<posteriorid>under the hood.<ensemble.id>Each emulator round is assigned a new ensemble ID. You can see this ID in every file recorded for the associated emulator round, e.g.:history.pdaR1_ENS_ID.Rdata,pecan.pdaR1_ENS_ID.xml,post.distns.pda.YOURPFT_R1_ENS_ID.Rdataetc.
Rest of the tags are for the emulator workflow to be able to find
previous rounds’ information and understand what to do in the next
round. The emulator workflow automatically adds
<extension>round</extension> because if you
forget to add this tag manually after the first round, you will just be
repeating the first round while you almost certainly intend to do an
iterative round if you are using the settings of your previous round.
Forgetting to add this tag has happened enough times that we now add
this tag automatically. But note that in PDA we have a second type of
extension which is
<extension>longer</extension>. This is to run
your MCMC chain longer in case it didn’t converge. This will be more of
an issue for the bruteforce calibration (see below), but it is also
possible to run your MCMC chains longer if your emulator-based MCMC also
didn’t converge:
-
<extension>Specifies the extension type of additional emulator runs, should be skipped if this a first emulator run of its own. Otherwise it is possible to extend emulator PDA runs in two ways:<extension>longer</extension>using the same emulator, this extension run takes the MCMC sampling from where it was left in the previous emulator run and runs a longer MCMC chain.<extension>round</extension>this extension run proposes new points in the parameter space in addition to the previous ones, and builds a new emulator including these additional points for a new MCMC sampling. These new points can come from both your inital PDA prior and the posterior of your first round of emulator run. You can determine the percentage of new knots coming from the posterior of your previous run in the<knot.par>tag. If you leave it empty, 90% of your new points will be drawn from the posterior of your previous run by default.
Now you can either pass the post PDA settings in your environment to
the pda.emulator() function again (yes you can put this in
a loop) or you can re-read your pecan.pdaR1_ENS_ID.xml at
any time to perform an iterative emulator round.
### settings_PDA_round1 <- read.settings("pecan.pdaR1_ENS_ID.xml")
settings_PDA_round2 <- pda.emulator(settings_PDA_round1)
Before moving on to the BayesianTools case, here is what
your pecan.PDA.xml could look like for a case where you
constrain parameters of more than one model PFT with more than one
Ameriflux data stream, note the ordering in
<assim.batch><param.names> and
<pfts>:
...
<assim.batch>
...
<param.names>
<PFT1>
<param>PFT1_PDA_param1</param>
<param>PFT1_PDA_param2</param>
</PFT1>
<PFT2>
<param>PFT2_PDA_param1</param>
<param>PFT2_PDA_param2</param>
<param>PFT2_PDA_param3</param>
</PFT2>
</param.names>
...
<inputs>
<file>
<input.id>YOURINPUTID</input.id>
<path>
<path>/YOURPATH/AmerifluxLBL_site_***/AMF_***_BASE_HH_***.csv</path>
</path>
<likelihood>Laplace</likelihood>
<variable.id>297</variable.id>
<variable.name>
<variable.name>NEE</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
<file>
<input.id>YOURINPUTID</input.id>
<path>
<path>/YOURPATH/AmerifluxLBL_site_***/AMF_***_BASE_HH_***.csv</path>
</path>
<likelihood>Laplace</likelihood>
<variable.id>298</variable.id>
<variable.name>
<variable.name>LE</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
</inputs>
...
</assim.batch>
...
<pfts>
<pft>
<name>PFT1</name>
<outdir>/YOUR/WORKFLOWS/PATH/PEcAn_WORKFLOWID/pft/PFT1</outdir>
<posteriorid>PFT1_POSTID</posteriorid>
</pft>
<pft>
<name>PFT2</name>
<outdir>/YOUR/WORKFLOWS/PATH/PEcAn_WORKFLOWID/pft/PFT2</outdir>
<posteriorid>PFT2_POSTID</posteriorid>
</pft>
</pfts>
...
2. PEcAn data pipeline with BayesianTools
Using the bruteforce PDA approach with one of the MCMC-flavors as
implemented in the BayesianTools R-package requires slightly different
tags to configure its setup. pda.bayesian.tools() would
look for sampler specific settings that can be passed through the
pecan.xml as a block under the <bt.settings> tag.
Currently, the available samplers in the BayesianTools package are:
- Metropolis
- Standard MH-MCMC
- With pre-optimization
- Adaptive MCMC
- Delayed rejection
- Gibbs updating
- M : Another implementation of standard MH-MCMC.
- AM : Adaptive Metropolis
- DR : Delayed Rejection
- DRAM : Delayed Rejection Adaptive Metropolis
- DE : Differential Evolution
- DEzs : Differential Evolution with a snooker updater
- DREAM : Differential Evolution Adaptive Metropolis
- DREAMzs : Differential Evolution Adaptive Metropolis with a snooker updater
- Twalk : “traverse” or “thoughtful” walk, a general purpose sampling algorithm
- SMC : Sequential Monte Carlo
Note that it is not possible to use some of the samplers in this package for univariate cases. The ones that you can use for univariate cases are: “Metropolis”, “DE”, “DEzs” and “Twalk”. Some of the samplers are also restartable : “Metropolis”, “DE”, “DEzs”, “DREAM”, “DREAMzs”, “Twalk”. For further details please see the BayesianTools documentation which itself has a helpful vignette.
The name of the chosen sampler would be passed under
<sampler> tag within the
<bt.settings> block. E.g. for Metropolis
variations:
...
<method>bayesian.tools</method>
<bt.settings>
<iter>10000</iter>
<sampler>Metropolis</sampler>
<DRlevels>1</DRlevels> <-- set 2 for Delayed Rejection
<optimize>FALSE</optimize> <-- set TRUE for pre-optimization
<adapt>FALSE</adapt> <-- set TRUE for Adaptive Metropolis
<adaptationNotBefore>500</adaptationNotBefore> <-- set for Adaptive Metropolis
</bt.settings>
...
The rest of the <assim.batch> block is the same
(except you can now omit the <jump> tag as it takes
no effect in this case). While all the algorithms can be accessed by the
“Metropolis” sampler in BayesianTools package by specifying the
sampler’s settings, it could be easier to directly pass the sampler name
explicitly. Below is an example for DREAMzs (recommended) algorithm:
<assim.batch>
<method>bayesian.tools</method>
<bt.settings>
<sampler>DREAMzs</sampler>
<iter>10000</iter>
<chain>3</chain>
</bt.settings>
<param.names>
<soil>
<param>som_respiration_rate</param>
</soil>
<boreal.coniferous>
<param>psnTOpt</param>
<param>half_saturation_PAR</param>
<param>Vm_low_temp</param>
</boreal.coniferous>
</param.names>
<inputs>
<file>
<input.id>15000000028</input.id>
<path>
<path>/data/dbfiles/ICOS_site_1-266/FLX_FI-Var_FLUXNET2015_FULLSET_HH_2016-2018_beta-3.csv</path>
</path>
<likelihood>Laplace</likelihood>
<variable.id>297</variable.id>
<variable.name>
<variable.name>NEE</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
</inputs>
</assim.batch>
Assuming you save these tags under a pecan.PDA.xml
settings <- read.settings("pecan.PDA.xml")
settings_postPDA <- pda.bayesian.tools(settings)
In case of no convergence and/or if you would like to run the same chains longer, you can add the extension option:
settings_postPDA$assim.batch$extension <- 'longer'
settings_postPDA_longer <- pda.bayesian.tools(settings_postPDA)
3. Own data with emulator
Sometimes you might want to use your own local data (e.g. from a
tower that is not necessarily part of any flux tower network PEcAn has
access to, or some chamber measurements etc.) for constraining your
model. Although we still do recommend using formal PEcAn data
registration and processing pipelines, sometimes you may not have access
to the DB and/or may want to bypass other data processing steps to
perform a quick PDA analysis to plan ahead. In that case, you can use
external.* arguments of the pda.emulator()
function. You can prepare the external.* arguments
manually, but the easiest way to generate them is to use
pda.generate.externals() function that you can find under
the script with the same name. pda.emulator() function
accepts the following external.* arguments:
external.dataThis is a list containing your constraining variables in the format expected by the PDA functions (see alsopda.generate.externals()function). The length of this list is the number of your data streams (e.g. if you are using NEE and LE in calibration, it will have two sublists). Hint: You may want to perform case #1 and load thehistory.pdafile to check the format of theinputsobject out, because ifexternal.datais passed to thepda.emulator()it will simply be assined to this variable,inputs <- external.data. To be able to prepareexternal.datawith thepda.generate.externals()function you need your data as a(n ordered) list where each sublist corresponds to a data frame of your constraining variable with two columns, variable name - posix.external.formatsThis is a list of variable names in the format expected by the PDA functions (see alsopda.generate.externals()function). The length of this list is the number of your data streams (e.g. if you are using NEE and LE in calibration, it will have two sublists). Depending on expressions provided in this object, PDA functions are capable of performing simple derivations on data. Hence, this list has a slightly more complex structure than a simple list of names, but don’t worry about it. To be able to prepareexternal.formatswith thepda.generate.externals()it is enough to passvarnvector that has the standard variable names (when in doubt, check the standard variable names under/base/utils/data/standard_vars.csv).external.priorsThis is a list of pecan probability distribution dataframes to be used as PDA priors, one sublist per PFT, e.g.:
[[1]]
distn parama paramb n
frozenSoilEff beta 1.000 1.00 NA
soilRespMoistEffect unif 0.000 2.00 NA
...
[[2]]
distn parama paramb n
growth_resp_factor beta 2.630 6.5e+00 0
root_respiration_rate norm 8.700 9.5e-01 NA
...
-
external.knotsThis is a list of dataframes containing emulator training points. Only pass it if you want to build the emulator on specific knots over which you want to have control, otherwise knots are generated from the PDA priors using Latin Hypercube Sampling. Each sublist should be anknot x nPFTparamdataframe, e.g. if you are building an emulator with 250 knots and your PFT has 30 parameters, this will be a 250 x 30 dataframe where each row is a parameter vector. Therefore only the columns of parameters that you are targeting in PDA should have varying parameter values. E.g. if you are calibrating thepsnTOptparameter, it would look like this:
> external.knots
$temperate.deciduous
... half_saturation_PAR fine_root_respiration_Q10 psnTOpt AmaxFrac ...
[1,] ... 15.5 3.2 24.83626 0.8605507 ...
[2,] ... 15.5 3.2 25.18056 0.8605507 ...
[3,] ... 15.5 3.2 26.10204 0.8605507 ...
[4,] ... 15.5 3.2 25.03353 0.8605507 ...
... ... ... ... ... ... ...
Once you prepare external.data and
external.format, now you can omit the
<input.id></input.id> and
<path></path> tags in the
<assim.batch> block (they will be ignored even if you
pass them when external.data is not NULL:
<assim.batch>
<method>emulator</method>
<n.knot>60</n.knot>
<iter>100000</iter>
<chain>3</chain>
<param.names>
<boreal.coniferous>
<param>psnTOpt</param>
<param>half_saturation_PAR</param>
<param>Vm_low_temp</param>
</boreal.coniferous>
</param.names>
<jump>
<adapt>250</adapt>
<adj.min>0.1</adj.min>
<ar.target>0.3</ar.target>
</jump>
<inputs>
<file>
<likelihood>Laplace</likelihood>
<variable.id>297</variable.id>
<variable.name>
<variable.name>NEE</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
</inputs>
</assim.batch>
Save and read your pecan.PDA.xml accordingly. Next, pass
the external data and formats you prepared explicitly to the
pda.emulator() function:
settings <- read.settings("pecan.PDA.xml")
pda.externals <- pda.generate.externals(external.data = TRUE, obs = obs, varn = "NEE", varid = 297, external.formats = TRUE)
settings_postPDA <- pda.emulator(settings, external.data = pda.externals$external.data, external.formats = pda.externals$external.formats)
Note that this way we provided external constraint data and its
format information to the PDA functions but we are still using database
connection. If you want to use PDA functions with no database
connection, then you need to provide prior list to the function as an
external object as well and run it in remote = TRUE mode.
In that case, it is also recommended to pass an ensemble.id
as an identifier for your PDA run (normally this is given by the PDA
workflow), it can even be a string:
settings <- read.settings("pecan.PDA.xml")
prior.list <- list(data.frame(distn = c("unif", "unif", "norm"),
parama = c(5, 4, 0),
paramb = c(40, 27, 3),
n = rep(NA, 3),
row.names = c("psnTOpt", "half_saturation_PAR", "Vm_low_temp")))
pda.externals <- pda.generate.externals(external.data = TRUE, obs = obs, varn = "NEE", varid = 297, external.formats = TRUE, external.priors = TRUE, prior.list = prior.list)
settings_postPDA <- pda.emulator(settings, external.data = pda.externals$external.data,
external.formats = pda.externals$external.formats,
external.priors = pda.externals$external.priors,
remote = TRUE, ensemble.id = "PDA_VIGNETTE")
Two more notes here before moving onto the next case. Above, in the
prior.list object, we only passed prior distribution for
the three parameters we target. In this case rest of the model
parameters will be kept as their defaults (whichever way it is
implemented in the model package), they won’t be kept at the medians
of their prior range as it is normally the case in PDA because we
haven’t passed a prior (range) for them. If you don’t want that to
happen, pass a more comprehensive list or try to use prior (or meta
analysis posterior) objects generated by PEcAn workflows. Finally, we
haven’t used the external.knots argument yet. We only use
it when we need to overwrite and enforce the training points for the
emulator and don’t necessarily need to populate this argument when doing
PDA without database connection. We leave testing with
external.knots as an exercise.
4. Own data with BayesianTools
It’s all very much the same as the cases above, in combination of #2
(prepare BayesianTools specific xml tags) and #3 (prepare
external arguments).
<assim.batch>
<method>bayesian.tools</method>
<bt.settings>
<sampler>DREAMzs</sampler>
<iter>10000</iter>
<chain>3</chain>
</bt.settings>
<param.names>
<boreal.coniferous>
<param>psnTOpt</param>
<param>half_saturation_PAR</param>
<param>Vm_low_temp</param>
</boreal.coniferous>
</param.names>
<inputs>
<file>
<likelihood>Laplace</likelihood>
<variable.id>297</variable.id>
<variable.name>
<variable.name>NEE</variable.name>
<variable.name>UST</variable.name>
</variable.name>
</file>
</inputs>
</assim.batch>
Save and read your pecan.PDA.xml accordingly. Next, pass
the external data and formats you prepared explicitly to the
pda.bayesian.tools() function:
settings <- read.settings("pecan.PDA.xml")
pda.externals <- pda.generate.externals(external.data = TRUE, obs = obs, varn = "NEE", varid = 297, external.formats = TRUE, external.priors = TRUE, prior.list = prior.list)
settings_postPDA <- pda.bayesian.tools(settings, external.data = pda.externals$external.data,
external.formats = pda.externals$external.formats,
external.priors = pda.externals$external.priors,
remote = TRUE, ensemble.id = "PDA_VIGNETTE")