43 Parameter Data Assimilation
- Gain hands-on experience in using Bayesian MCMC techniques to calibrate a simple ecosystem model using parameter data assimilation (PDA)
- Set up and run a PDA in PEcAn using model emulation technique, assimilating NEE data from Niwot Ridge
- Examine outputs from a PDA for the SIPNET model and evaluation of the calibrated model against i) data used to constrain model, ii) additional data for the same site
43.2 Larger Context
Parameter data assimilation (PDA) occurs as one step in the larger process of model calibration, validation, and application. The goal of PDA is to update our estimates of the posterior distributions of the model parameters using data that correspond to model outputs. This differs from our previous use of PEcAn to constrain a simple model using data that map directly to the model parameters. Briefly, the recommended overall approach for model calibration and validation consists of the following steps:
- Assemble and process data sets required by the model as drivers
- Perform an initial test-run of the model as a basic sanity check
- Were there errors in drivers? (return to 1)
- Is the model in the same ballpark as the data?
- Construct priors for model parameters
- Collect/assemble the data that can be used to constrain model parameters and outputs
- Sensitivity analysis (SA)
- Variance Decomposition (VD)
- Determine what parameters need further constraint
- Does this data exist in the literature? (repeat 4-8)
- Can I collect this data in the field? (repeat 4-8)
- Ensemble Analysis
- Is reality within the range of the uncertainty in the model?
- Evaluate/estimate uncertainties in the data
- Parameter Data Assimilation:
- Propose new parameter values
- Evaluate L(data | param) & prior(param)
- Accept or reject the proposed parameter values
- Repeat many times until a histogram of accepted parameter values approximates the true posterior distribution.
- Model evaluation [preferably ensemble based]
- Against data used to constrain model
- Against additional data for this site
- Same variable, different time
- Different variables
- Against data at a new site
- Do I need more data? Repeat 4-9 (direct data constraint) or 6-11 (parameter data assimilation).
- Application [preferably ensemble forecast or hindcast]
43.3 Connect to Rstudio
Today, we’re again going to work mostly in Rstudio, in order to easily edit advanced PEcAn settings and browse files. So if you haven’t already, connect now to the Rstudio server on your VM ([URL]/rstudio).
This tutorial assumes you have successfully completed an ensemble and a sensitivity analysis (Demo 2) before.
43.4 Defining variables
The following variables need to be set specific to the site being run and the workflow being run
43.5 Initial Ensemble Analysis
A good place to start when thinking about a new PDA analysis is to look at the current model fit to observed data. In fact, we want to compare data to a full ensemble prediction from the model. This is important because our current parameter distributions will be the priors for PDA. While the analysis will translate these priors into more optimal (in terms of producing model output that matches observations) and more confident (i.e. narrower) posterior distributions, these results are inherently constrained by the current parameter distributions. Thus, if reality falls far outside the prior ensemble confidence interval (which reflects the current uncertainty of all model parameters), data assimilation will not be able to fix this. In such cases, the prior parameter estimates must already be over-constrained, or there are structural errors in the model itself that need fixing. To begin, let’s load up some NEE observations so that we can plot them along with our ensemble predictions. In the code below the elements in bold may vary depending on site and your previous runs.
Now let’s load up our ensemble outputs from the previous ensemble analysis (Demo 2) and plot our ensemble predictions against our NEE observations.
When interpreting your results it is important to remember the difference between a confidence interval, which just includes parameter uncertainties, and a predictive interval, which includes parameter and residual uncertainties. Your ensemble analysis plot illustrates the former—i.e., the confidence in the mean NEE. By contrast, the data reflect both changes in mean NEE, and random variability. As such, we can’t expect all the data to fall within the CI; in fact, if we had unlimited data to constrain mean NEE, the CI would collapse to a single line and none of the data would be contained! However, your plot will give you an idea of how much uncertainty there is in your model currently, and help to identify systematic errors like bias (values consistently too high or low) or poorly represented seasonal patterns.
- Does your ensemble agree well with the data?
- If so, how much room for improvement is there, in terms of tightening the CI?
- If not, what are the greatest discrepancies?
- What are some of the problems (with model, data, and/or PEcAn) that might explain the data-model disparity you see?
43.6 Choosing Parameters
Beyond exploratory exercises, the first step of PDA analysis is to choose the model parameters you will target for optimization. PDA is computationally expensive (even when using an emulator), and the cost increases exponentially with the number of parameters targeted. The number you can handle in any given analysis completely depends on the complexity of the model and your available computational resources, but in practice it’s going to be rather small (~1–10) relative to the large number of parameters in a mechanistic ecosystem model (~10–100). Given this limitation, it is important to target parameters that can contribute substantially to improving model fit. If you recall, identifying those parameters was the goal of the uncertainty analysis you conducted previously, in the second PEcAn demo. Let’s revisit that analysis now. Open your variance decomposition graph from Demo 2 From this figure decide which variables you will target with PDA. As noted, an obvious criterion is that the parameter should be contributing a large amount of uncertainty to the current model, because otherwise it simply can’t change the model output much no matter how much you try to optimize it. But there are other considerations too. For example, if two parameters have similar or competing roles in the model, you may have trouble optimizing both simultaneously. In practice, there will likely be some guess-and-testing involved, though a good understanding of how the model works will help. It may also help to look at the shape of the Sensitivity responses and details of model fit to data (your ensemble analysis from the previous section). For the purposes of this demo, choose eight to ten parameters (in total, if you have more than one PFT) that contribute high uncertainty to model output and/or seem like good choices for some other rational reason.
- Which parameters did you choose, and why?
43.7 Editing PEcAn settings
Now let’s add settings to tell PEcAn how to run the PDA with emulator, we will come to the details of model emulation later. Open up the pecan.CONFIGS.xml file you located previously, and choose
File > Save as... from the menu to save a new copy as pecan.PDA.xml. Now add the block of XML listed below to the file, immediately after the
<param.names><param> tags to identify the parameters you’ve chosen for PDA (it’s up to you to choose the number of parameters you want to constrain, then you can set the
<n.knot> to be >= 10 per parameter you choose, e.g. 200 knots for 10 parameters). Here, you need to use PEcAn’s standard parameter names, which are generally not the same as what’s printed on your variance decomposition graph. To find your parameters look at the row names in the
prior.distns.csv file for each PFT under the PFT pulldown menu. Insert the variable name (exactly, and case sensitive) into the
<param> tags of the XML code below.
In addition, you may need to edit
<inputs><file><path>, depending on the site and year you ran previously. The rest of the settings control options for the PDA analysis (how long to run, etc.), and also identify the data to be used for assimilation. For more details, see the assim.batch vignette on the PEcAn GitHub page (https://goo.gl/9hYVPQ).
<?xml version="1.0"?> <-- These lines are already in there. Don't duplicate them, <pecan> <-- just paste the <assim.batch> block below right after them. <assim.batch> <method>emulator</method> <n.knot>160</n.knot> <-- FILL IN <iter>25000</iter> <chain>3</chain> <param.names> <soil> <param>YOUR_PFT_1_PARAM_1</param> <-- FILL IN <param>YOUR_PFT_1_PARAM_2</param> <-- FILL IN </soil> <temperate.coniferous> <param>YOUR_PFT_2_PARAM_1</param> <-- FILL IN <param>YOUR_PFT_2_PARAM_2</param> <-- FILL IN <param>YOUR_PFT_2_PARAM_3</param> <-- FILL IN <param>YOUR_PFT_2_PARAM_4</param> <-- FILL IN <param>YOUR_PFT_2_PARAM_5</param> <-- FILL IN <param>YOUR_PFT_2_PARAM_6</param> <-- FILL IN </temperate.coniferous> </param.names> <jump> <adapt>100</adapt> <adj.min>0.1</adj.min> <ar.target>0.3</ar.target> </jump> <inputs> <file> <path> <path>/home/carya/output/dbfiles/AmerifluxLBL_site_0-772/AMF_US-NR1_BASE_HH_9-1.csv</path> </path> <format>5000000002</format> <input.id>1000011238</input.id> <-- FILL IN, from BETY inputs table, this is *NOT* the workflow ID <likelihood>Laplace</likelihood> <variable.name> <variable.name>NEE</variable.name> <variable.name>UST</variable.name> </variable.name> <variable.id>297</variable.id> </file> </inputs> </assim.batch>
Once you’ve made and saved the changes to your XML, load the file and check that it contains the new settings:
If the printed list contains everything you just added to pecan.PDA.xml, you’re ready to proceed.
43.8 Investigating PEcAn function pda.emulator (optional)
Before we run the data assimilation, let’s take a high-level look at the organization of the code. Use the Rstudio file browser to open up
~/pecan/modules/assim.batch/R/pda.emulator.R. This code works in much the same way as the pure statistical models that we learned about earlier in the week, except that the model being fit is a statistical model that emulates a complicated process-based computer simulation (i.e., an ecosystem model). We could have directly used the ecosystem model (indeed PEcAn’s other PDA functions perform MCMC by actually running the ecosystem model at each iteration, see pda.mcmc.R script as an example), however, this would require a lot more computational time than we have today. Instead here we will use a technique called model emulation. This technique allows us to run the model for a relatively smaller number of times with parameter values that have been carefully chosen to give a good coverage of parameter space. Then we can interpolate the likelihood calculated for each of those runs to get a surface that “emulates” the true likelihood and perform regular MCMC, except instead of actually running the model on every iteration to get a likelihood, this time we will just get an approximation from the likelihood emulator. The general algorithm of this method can be further expressed as:
- Propose initial parameter set sampling design
- Run full model for each parameter set
- Evaluate the likelihoods
- Construct emulator of multivariate likelihood surface
- Use emulator to estimate posterior parameter distributions
- (Optional) Refine emulator by proposing new design points, goto 2)
For now, we just want you to get a glimpse at the overall structure of the code, which is laid out in the comment headers in
pda.emulator(). Most of the real work gets done by the functions this code calls, which are all located in the file
~/pecan/modules/assim.batch/R/pda.utils.R and the MCMC will be performed by the
mcmc.GP() function in
~/pecan/modules/emulator/R/minimize.GP.R. To delve deeper into how the code works, take a look at these files when you have the time.
43.9 Running a demo PDA
Now we can go ahead and run a data assimilation MCMC with emulator. Since you’ve already loaded the settings containing your modified <assim.batch> XML block, all you need to do to start the PDA is run
pda.emulator(settings). But, in the emulator workflow, there is a bit of a time consuming step where we calculate the effective sample size of the input data, and we have already done this step for you. You could load it up and pass it to the function explicitly in order to skip this step:
After executing the code above, you will see print-outs to the console. The code begins with loading the prior values which in this case are the posterior distributions coming from your previous meta analysis. Then, normally, it loads the observational data and carries out necessary conversions and formatting to align it with model outputs, as we did separately above, but today it will skip this step as we are passing data externally. After this step, you will see a progress bar where the actual model is run n.knot times with the proposed parameter sets and then the outputs from these runs are read. Next, this model output is compared to the specified observational data, and the likelihood is calculated using the heteroskedastic Laplacian discussed previously. Once we calculate the likelihoods, we fit an emulator which interpolates the model output in parameter space between the points where the model has actually been run. Now we can put this emulator in the MCMC algorithm instead of the model itself. Within the MCMC loop the code proposes new parameter value from a multivariate normal jump distribution. The corresponding likelihood will be approximated by the emulator and the new parameter value is accepted or rejected based on its posterior probability relative to the current value.
43.10 Outputs from PEcAn’s Parameter Data Assimilation
When the PDA is finished, a number of outputs are automatically produced that are either the same as or similar to posterior outputs that we’ve seen before. These are located in the
PEcAn_[workflow_id]/pft/* output directory and are identified by
pda.[PFT]_[workflow_id] in the filenames:
- posteriors.pda.[PFT]*.pdf shows the posterior distributions resulting from your PDA
- trait.mcmc.pda.[PFT]*.Rdata contains all the parameter samples contained in the PDA posterior
- mcmc.pda.[PFT]*.Rdata is essentially the same thing in a different format
- mcmc.diagnostics.pda.[PFT]*.pdf shows trace plots and posterior densities for each assimilated parameter, as well as pairs plots showing all pairwise parameter correlations.
Together, these files allow you to evaluate whether a completed PDA analysis has converged and how the posterior distributions compare to the priors, and to use the posterior samples in further analyses, including additional PDA. If you haven’t done so already, take a look at all of the outputs described here.
- Do the diagnostic figures indicate that your likelihood at least improved over the course of the analysis?
- Does the MCMC appear to have converged?
- Are the posterior distributions well resolved?
43.11 Post-PDA analyses
In addition to the outputs of the PDA itself, you may want to conduct ensemble and/or sensitivity analyses based on the posteriors of the data assimilation, in order to check progress towards improved model fit and/or changing sensitivity. For this, you need to generate new model runs based on parameters sampled from the updated (by PDA) posterior, which is a simple matter of rerunning several steps of the PEcAn workflow.
The PDA you ran has automatically produced an updated XML file (
pecan.pda***.xml) that includes the posterior id to be used in the next round of runs. Locate this file in your run directory and load the file for the post-pda ensemble/sensitivity analysis (if you already have the
settings list in your working environment you don’t need to re-read the settings):
Now you can check the new figures produced by your analyses under
PEcAn_[workflow_id]/pft/*/sensitivity.analysis.*.pdf, and compare them to the previous ones. Also, take a look at the comparison of model outputs to data when we run SIPNET with pre- and post-PDA parameter (mean) values under
- Looking at the ensemble analysis outputs in order (i.e., in order of increasing ID in the filenames), qualitatively how did the model fit to data change over the course of the analysis?
- Based on the final ensemble analysis, what are the major remaining discrepancies between model and data?
- Can you think of the processes / parameters that are likely contributing to the differences?
- What would be your next steps towards evaluating or improving model performance?