9 Parameter Data Assimilation
9.1 All functions pertaining to Parameter Data Assimilation are housed within: pecan/modules/assim.batch
This is the main PDA code. It performs Bayesian MCMC on model parameters by proposing parameter values, running the model, calculating a likelihood (between model output and supplied observations), and accepting or rejecting the proposed parameters (Metropolis algorithm). Additional notes:
The first argument is settings, followed by others that all default to NULL.settings is a list used throughout Pecan, which contains all the user options for whatever analyses are being done. The easiest thing to do is just pass that whole object all around the Pecan code and let different functions access whichever settings they need. That’s what a lot of the rest of the Pecan code does. But the flexibility to override most of the relevant settings in settings is there by providing them directly as arguments to the function.
The if(FALSE)… : If you’re trying to step through the function you probably will have the settings object around, but those other variables will be undefined. If you set them all to NULL then they’ll be ignored without causing errors. It is there for debugging purposes.
The next step calls pda.settings(), which is in the file pda.utils.R (see below). It checks whether any settings are being overridden by arguments, and in most cases supplies default values if it can’t find either.
- In the MCMC setup section
- The code is set up to allow you to start a new MCMC chain, or to continue a previous chain as specified in settings.
- The code writes a simple text file of parameter samples at every iteration, which lets you get some results and even re-start an MCMC that fails for some reason.
- The code has adaptive jump distributions. So you can see some initialization of the jump distributions and associated variables here.
- Finally, note that after all this setup a new XML settings file is saved. The idea is that the original pecan.xml you create is preserved for provenance, and then periodically throughout the workflow the settings (likely containing new information) are re-saved with descriptive filenames.
- MCMC loop
- Periodically adjust jump distribution to make acceptance rate closer to target
- Propose new parameters one at a time. For each:
- First, note that Pecan may be handling many more parameters than are actually being targeted by PDA. Pecan puts priors on any variables it has information for (in the BETY database), and then these get passed around throughout the analysis and every step (meta-, sensitivity, ensemble analyses, etc.). But for PDA, you specify a separate list of probably far fewer parameters to constrain with data. These are the ones that get looped over and varied here. The distinction between all parameters and only those dealt with in PDA is dealt with in the setup code above.
- First a new value is proposed for the parameter of interest.
- Then, a new model run is set up, identical to the previous except with the new proposed value for the one parameter being updated on this run.
- The model run is started, and outputs collected after waiting for it to finish.
- A new likelihood is calculated based on the model outputs and the observed dataset provided.
- Standard Metropolis acceptance criteria is used to decide whether to keep the proposed parameter.
- Periodically (at interval specified in settings), a diagnostic figure is saved to disk so you can check on progress.
- This works only for NEE currently
This file is basically identical to pda.mcm.R, but rather than propose parameters one at a time, it proposes new values for all parameters at once (“bs” stands for “block sampling”). You choose which option to use by specifying settings\(assim.batch\)method: * “bruteforce” means sample parameters one at a time * “bruteforce.bs” means use this version, sampling all parameters at once * “emulator” means use the emulated-likelihood version
This version of the PDA code again looks quite similar to the basic “bruteforce” one, but its mechanics are very different. The basic idea is, rather than running thousands of model iterations to explore parameter space via MCMC, run a relatively smaller number of runs that have been carefully chosen to give good coverage of parameter space. Then, basically interpolate the likelihood calculated for each of those runs (actually, fit a Gaussian process to it), to get a surface that “emulates” the true likelihood. Now, perform regular MCMC (just like the “bruteforce” approach), except instead of actually running the model on every iteration to get a likelihood, just get an approximation from the likelihood emulator. Since the latter step takes virtually no time, you can run as long of an MCMC as you need at little computational cost, once you have done the initial model runs to create the likelihood emulator.
This function is for recovering a failed PDA MCMC run.
This file contains most of the individual functions used by the main PDA functions (pda.mcmc.*.R).
- assim.batch is the main function Pecan calls to do PDA. It checks which method is requested (bruteforce, bruteforce.bs, or emulator) and call the appropriate function described above.
- pda.setting handles settings. If a setting isn’t found, the code can usually supply a reasonable default.
- pda.load.priors is fairly self explanatory, except that it handles a lot of cases and gives different options priority over others. Basically, the priors to use for PDA parameters can come from either a Pecan prior.distns or post.distns object (the latter would be, e.g., the posteriors of a meta-analysis or previous PDA), or specified either by file path or BETY ID. If not told otherwise, the code tries to just find the most recent posterior in BETY, and use that as prior for PDA.
- pda.create.ensemble gets an ensemble ID for the PDA. All model runs associated with an individual PDA (any of the three methods) are considered part of a single ensemble. This function does is register a new ensemble in BETY, and return the ID that BETY gives it.
- pda.define.prior.fn creates R functions for all of the priors the PDA will use.
- pda.init.params sets up the parameter matrix for the run, which has one row per iteration, and one column per parameter. Columns include all Pecan parameters, not just the (probably small) subset that are being updated by PDA. This is for compatibility with other Pecan components. If starting a fresh run, the returned matrix is just a big empty matrix to fill in as the PDA runs. If continuing an existing MCMC, then it will be the previous params matrix, with a bunch of blank rows added on for filling in during this round of PDA.
- pda.init.run This is basically a big wrapper for Pecan’s write.config function (actually functions [plural], since every model in Pecan has its own version). For the bruteforce and bruteforce.bs methods this will be run once per iteration, whereas the emulator method knows about all its runs ahead of time and this will be a big batch of all runs at once.
- pda.adjust.jumps tweaks the jump distributions for the standard MCMC method, and pda.adjust.jumps.bs does the same for the block-sampled version.
- pda.calc.llik calculates the log-likelihood of the model given all datasets provided to compare it to.
- pda.generate.knots is for the emulator version of PDA. It uses a Latin hypercube design to sample a specified number of locations in parameter space. These locations are where the model will actually be run, and then the GP interpolates the likelihood surface in between.
- pda.plot.params provides basic MCMC diagnostics (trace and density) for parameters being sampled.
- pda.postprocess prepares the posteriors of the PDA, stores them to files and the database, and performs some other cleanup functions.
- pda.load.data.r This is the function that loads in data that will be used to constrain the PDA. It’s supposed to be eventually more integrated with Pecan, which will know how to load all kinds of data from all kinds of sources. For now, it can do NEE from Ameriflux.
- pda.define.llik.r A simple helper function that defines likelihood functions for different datasets. Probably in the future this should be queried from the database or something. For now, it is extremely limited. The original test case of NEE assimilation uses a heteroskedastic Laplacian distribution.
- pda.get.model.output.R Another function that will eventually grow to handle many more cases, or perhaps be replaced by a better system altogether. For now though, it again just handles Ameriflux NEE.