# 9 Settings-configured analyses

These analyses can be run through the web interface, but lack graphical interfaces and currently can only be configured throughthe XML settings. To run these analyses use the Edit pecan.xml checkbox on the Input configuration page. Eventually, these modules will be integrated into the web user interface.

• Parameter Data Assimilation (PDA)
• State Data Assimilation (SDA)
• MultiSettings
• Benchmarking

## 9.1 Parameter data assimilation (PDA)

All functions pertaining to Parameter Data Assimilation are housed within: pecan/modules/assim.batch

### 9.1.1pda.mcmc.R

This is the main PDA code. It performs Bayesian MCMC on model parameters by proposing parameter values, running the model, calculating a likelihood (between model output and supplied observations), and accepting or rejecting the proposed parameters (Metropolis algorithm). Additional notes:

• The first argument is settings, followed by others that all default to NULL.settings is a list used throughout Pecan, which contains all the user options for whatever analyses are being done. The easiest thing to do is just pass that whole object all around the Pecan code and let different functions access whichever settings they need. That’s what a lot of the rest of the Pecan code does. But the flexibility to override most of the relevant settings in settings is there by providing them directly as arguments to the function.

• The if(FALSE)… : If you’re trying to step through the function you probably will have the settings object around, but those other variables will be undefined. If you set them all to NULL then they’ll be ignored without causing errors. It is there for debugging purposes.

• The next step calls pda.settings(), which is in the file pda.utils.R (see below). It checks whether any settings are being overridden by arguments, and in most cases supplies default values if it can’t find either.

• In the MCMC setup section
• The code is set up to allow you to start a new MCMC chain, or to continue a previous chain as specified in settings.
• The code writes a simple text file of parameter samples at every iteration, which lets you get some results and even re-start an MCMC that fails for some reason.
• The code has adaptive jump distributions. So you can see some initialization of the jump distributions and associated variables here.
• Finally, note that after all this setup a new XML settings file is saved. The idea is that the original pecan.xml you create is preserved for provenance, and then periodically throughout the workflow the settings (likely containing new information) are re-saved with descriptive filenames.
• MCMC loop
• Periodically adjust jump distribution to make acceptance rate closer to target
• Propose new parameters one at a time. For each:
• First, note that Pecan may be handling many more parameters than are actually being targeted by PDA. Pecan puts priors on any variables it has information for (in the BETY database), and then these get passed around throughout the analysis and every step (meta-, sensitivity, ensemble analyses, etc.). But for PDA, you specify a separate list of probably far fewer parameters to constrain with data. These are the ones that get looped over and varied here. The distinction between all parameters and only those dealt with in PDA is dealt with in the setup code above.
• First a new value is proposed for the parameter of interest.
• Then, a new model run is set up, identical to the previous except with the new proposed value for the one parameter being updated on this run.
• The model run is started, and outputs collected after waiting for it to finish.
• A new likelihood is calculated based on the model outputs and the observed dataset provided.
• Standard Metropolis acceptance criteria is used to decide whether to keep the proposed parameter.
• Periodically (at interval specified in settings), a diagnostic figure is saved to disk so you can check on progress.
• This works only for NEE currently

### 9.1.2pda.mcmc.bs.R

This file is basically identical to pda.mcm.R, but rather than propose parameters one at a time, it proposes new values for all parameters at once (“bs” stands for “block sampling”). You choose which option to use by specifying settings$$assim.batch$$method: * “bruteforce” means sample parameters one at a time * “bruteforce.bs” means use this version, sampling all parameters at once * “emulator” means use the emulated-likelihood version

### 9.1.3pda.emulator

This version of the PDA code again looks quite similar to the basic “bruteforce” one, but its mechanics are very different. The basic idea is, rather than running thousands of model iterations to explore parameter space via MCMC, run a relatively smaller number of runs that have been carefully chosen to give good coverage of parameter space. Then, basically interpolate the likelihood calculated for each of those runs (actually, fit a Gaussian process to it), to get a surface that “emulates” the true likelihood. Now, perform regular MCMC (just like the “bruteforce” approach), except instead of actually running the model on every iteration to get a likelihood, just get an approximation from the likelihood emulator. Since the latter step takes virtually no time, you can run as long of an MCMC as you need at little computational cost, once you have done the initial model runs to create the likelihood emulator.

### 9.1.4pda.mcmc.recover.R

This function is for recovering a failed PDA MCMC run.

### 9.1.5pda.utils.R

This file contains most of the individual functions used by the main PDA functions (pda.mcmc.*.R).

• assim.batch is the main function Pecan calls to do PDA. It checks which method is requested (bruteforce, bruteforce.bs, or emulator) and call the appropriate function described above.
• pda.setting handles settings. If a setting isn’t found, the code can usually supply a reasonable default.
• pda.load.priors is fairly self explanatory, except that it handles a lot of cases and gives different options priority over others. Basically, the priors to use for PDA parameters can come from either a Pecan prior.distns or post.distns object (the latter would be, e.g., the posteriors of a meta-analysis or previous PDA), or specified either by file path or BETY ID. If not told otherwise, the code tries to just find the most recent posterior in BETY, and use that as prior for PDA.
• pda.create.ensemble gets an ensemble ID for the PDA. All model runs associated with an individual PDA (any of the three methods) are considered part of a single ensemble. This function does is register a new ensemble in BETY, and return the ID that BETY gives it.
• pda.define.prior.fn creates R functions for all of the priors the PDA will use.
• pda.init.params sets up the parameter matrix for the run, which has one row per iteration, and one column per parameter. Columns include all Pecan parameters, not just the (probably small) subset that are being updated by PDA. This is for compatibility with other Pecan components. If starting a fresh run, the returned matrix is just a big empty matrix to fill in as the PDA runs. If continuing an existing MCMC, then it will be the previous params matrix, with a bunch of blank rows added on for filling in during this round of PDA.
• pda.init.run This is basically a big wrapper for Pecan’s write.config function (actually functions [plural], since every model in Pecan has its own version). For the bruteforce and bruteforce.bs methods this will be run once per iteration, whereas the emulator method knows about all its runs ahead of time and this will be a big batch of all runs at once.
• pda.adjust.jumps tweaks the jump distributions for the standard MCMC method, and pda.adjust.jumps.bs does the same for the block-sampled version.
• pda.calc.llik calculates the log-likelihood of the model given all datasets provided to compare it to.
• pda.generate.knots is for the emulator version of PDA. It uses a Latin hypercube design to sample a specified number of locations in parameter space. These locations are where the model will actually be run, and then the GP interpolates the likelihood surface in between.
• pda.plot.params provides basic MCMC diagnostics (trace and density) for parameters being sampled.
• pda.postprocess prepares the posteriors of the PDA, stores them to files and the database, and performs some other cleanup functions.
• pda.load.data.r This is the function that loads in data that will be used to constrain the PDA. It’s supposed to be eventually more integrated with Pecan, which will know how to load all kinds of data from all kinds of sources. For now, it can do NEE from Ameriflux.
• pda.define.llik.r A simple helper function that defines likelihood functions for different datasets. Probably in the future this should be queried from the database or something. For now, it is extremely limited. The original test case of NEE assimilation uses a heteroskedastic Laplacian distribution.
• pda.get.model.output.R Another function that will eventually grow to handle many more cases, or perhaps be replaced by a better system altogether. For now though, it again just handles Ameriflux NEE.

### 9.1.6get.da.data.*.R, plot.da.R

Old codes written by Carl Davidson. Defunct now, but may contain good ideas so currently left in.

## 9.3 State data assimilation (SDA)

sda.enkf.R is housed within: /pecan/modules/assim.sequential/R The tree ring tutorial is housed within: /pecan/documentation/tutorials/StateAssimilation

### 9.3.1sda.enkf.R Description

This is the main ensemble Kalman filter and generalized filter code. Originally, this was just ensemble Kalman filter code. Mike Dietze and Ann Raiho added a generalized ensemble filter to avoid filter divergence. The output of this function will be all the of run outputs, a PDF of diagnostics, and an Rdata object that includes three lists:

• FORECAST will be the ensemble forecasts for each year
• ANALYSIS will be the updated ensemble sample given the NPP observations
• enkf.params contains the prior and posterior mean vector and covariance matrix for each time step.

### 9.3.2sda.enkf.R Arguments

• settings - (required) pecan.SDA.xml settings object

• obs.mean - (required) a list of observation means named with dates in YYYY/MM/DD format

• obs.cov - (required) a list of observation covariances names with dates in YYYY/MM/DD format

• IC - (optional) initial condition matrix (dimensions: ensemble memeber # by state variables). Default is NULL.

• Q - (optional) process covariance matrix (dimensions: state variable by state variables). Defualt is NULL.

### 9.3.3 State Data Assimilation Workflow

Before running sda.enkf, these tasks must be completed (in no particular order),

• Read in a pecan.SDA.xml settings file with tags listed below. i.e. read.settings(‘pecan.SDA.xml’)

• Load data means (obs.mean) and covariances (obs.cov) as lists with PEcAn naming and unit conventions. Each observation must have a date in YYYY/MM/DD format (optional time) associated with it. If there are missing data, the date must still be represented in the list with an NA as the list object.

• Create initial conditions matrix (IC) that is state variables columns by ensemble members rows in dimension. sample.IC.MODEL can be used to create the IC matrix, but it is not required. This IC matrix is fed into write.configs for the initial model runs.

The main parts of the SDA function are:

Setting up for initial runs:

• Set parameters

• Load initial run inputs via split.inputs.MODEL

• Open database connection

• Get new workflow ids

• Create ensemble ids

Performing the initial set of runs

Set up for data assimilation

Loop over time

• read.restart.MODEL - read model restart files corresponding to start.time and stop.time that you want to assimilate data into

• Analysis - There are four choices based on if process variance is TRUE or FALSE and if there is data or not. See explaination below.

• write.restart.MODEL - This function has two jobs. First, to insert adjusted state back into model restart file. Second, to update start.time, stop.time, and job.sh.

• run model

Save outputs

Create diagnostics

### 9.3.4 State Data Assimilation Tags Example

<state.data.assimilation>
<n.ensemble>25</n.ensemble>
<process.variance>FALSE</process.variance>
<sample.parameters>FALSE</sample.parameters>
<state.variables>
<variable>
<variable.name>AGB.pft</variable.name>
<unit>MgC/ha/yr</unit>
<min_value>0</min_value>
<max_value>100000000</max_value>
</variable>
<variable>
<variable.name>TotSoilCarb</variable.name>
<unit>KgC/m^2</unit>
<min_value>0</min_value>
<max_value>100000000</max_value>
</variable>
</state.variables>
<spin.up>
<start.date>1950/01/01</start.date>
<end.date>1960/12/31</end.date>
</spin.up>
<forecast.time.step>1</forecast.time.step>
<start.date>1961/01/01</start.date>
<end.date>2010/12/31</end.date>
</state.data.assimilation>

### 9.3.5 State Data Assimilation Tags Descriptions

• n.ensemble : [required] Number of ensemble members. Should be much larger than number of state variables you are assimilating.
• process.variance : [optional] TRUE/FLASE flag for if process variance should be estimated (TRUE) or not (FALSE). If TRUE, a generalized ensemble filter will be used. If FALSE, an ensemble Kalman filter will be used. Default is FALSE.
• sample.parameters : [optional] TRUE/FLASE flag for if parameters should be sampled for each ensemble member or not. This allows for more spread in the initial conditions of the forecast.
• state.variable : [required] State variable that is to be assimilated (in PEcAn standard format).
• spin.up : [required] start.date and end.date for initial model runs.
• NOTE: start.date and end.date are distinct from values set in the run tag because initial runs can be done over a subset of the full run.
• forecast.time.step : [optional] In the future, this will be used to allow the forecast time step to vary from the data time step.
• start.date : [optional] start date of the state data assimilation (in YYYY/MM/DD format)
• end.date : [optional] end date of the state data assimilation (in YYYY/MM/DD format)
• NOTE: start.date and end.date are distinct from values set in the run tag because this analysis can be done over a subset of the run.

### 9.3.6 Model Specific Functions for SDA Workflow

The purpose of read.restart is to read model restart files and return a matrix that is site rows by state variable columns. The state variables must be in PEcAn names and units. The arguments are:

• outdir - output directory

• runid - ensemble member run ID

• stop.time - used to determine which restart file to read (in POSIX format)

• settings - pecan.SDA.xml settings object

• var.names - vector with state variable names with PEcAn standard naming. Example: c(‘AGB.pft’, ‘TotSoilCarb’)

• params - parameters used by ensemble member (same format as write.configs)

#### 9.3.6.2 write.restart.MODEL.R

This model specific function takes in new state and new parameter matrices from sda.enkf.R after the analysis step and translates new variables back to the model variables. Then, updates start.time, stop.time, and job.sh so that start.model.runs() does the correct runs with the new states. In write.restart.LINKAGES and write.restart.SIPNET, job.sh is updated by using write.configs.MODEL.

• outdir - output directory

• runid - run ID for ensemble member

• start.time - beginning of model run (in POSIX format)

• stop.time - end of model run (in POSIX format)

• settings - pecan.SDA.xml settings object

• new.state - matrix from analysis of updated state variables with PEcAn names (dimensions: site rows by state variables columns)

• new.params - In the future, this will allow us to update parameters based on states (same format as write.configs)

• inputs - model specific inputs from split.inputs.MODEL used to run the model from start.time to stop.time

• RENAME - [optional] Flag used in write.restart.LINKAGES.R for development.

#### 9.3.6.3 split.inputs.MODEL.R

This model specific function gives the correct met and/or other model inputs to settings$$run$$inputs. This function returns settings$$run$$inputs to an inputs argument in sda.enkf.R. But, the inputs will not need to change for all models and should return settings$$run$$inputs unchanged if that is the case.

• settings - pecan.SDA.xml settings object

• start.time - start time for model run (in POSIX format)

• stop.time - stop time for model run (in POSIX format)

#### 9.3.6.4 sample.IC.MODEL.R

This model specific function is optional. But, it can be used to create initial condition matrix (IC) with # state variables columns by # ensemble rows. This IC matrix is used for the initial runs in sda.enkf.R in the write.configs.MODEL function.

• ne - number of ensemble members

• state - matrix of state variables to get initial conditions from

• year - used to determine which year to sample initial conditions from

### 9.3.7 Analysis Options

There are four options depending on whether process variance is TRUE/FALSE and whether or not there is data or not.

• If there is no data and process variance = FALSE, there is no analysis step.

• If there is no data and process variance = TRUE, process variance is added to the forecast.

• If there is data and process variance = TRUE, the generalized ensemble filter is implemented with MCMC.

• If there is data and process variance = FALSE, the Kalman filter is used and solved analytically.

### 9.3.8 The Generalized Ensemble Filter

An ensemble filter is a sequential data assimilation algorithm with two procedures at every time step: a forecast followed by an analysis. The forecast ensembles arise from a model while the analysis makes an adjustment of the forecasts ensembles from the model towards the data. An ensemble Kalman filter is typically suggested for this type of analysis because of its computationally efficient analytical solution and its ability to update states based on an estimate of covariance structure. But, in some cases, the ensemble Kalman filter fails because of filter divergence. Filter divergence occurs when forecast variability is too small, which causes the analysis to favor the forecast and diverge from the data. Models often produce low forecast variability because there is little internal stochasticity. Our ensemble filter overcomes this problem in a Bayesian framework by including an estimation of model process variance. This methodology also maintains the benefits of the ensemble Kalman filter by updating the state vector based on the estimated covariance structure.

This process begins after the model is spun up to equilibrium.

The likelihood function uses the data vector $$\left(\boldsymbol{y_{t}}\right)$$ conditional on the estimated state vector $$\left(\boldsymbol{x_{t}}\right)$$ such that

$$\boldsymbol{y}_{t}\sim\mathrm{multivariate\:normal}(\boldsymbol{x}_{t},\boldsymbol{R}_{t})$$

where $$\boldsymbol{R}_{t}=\boldsymbol{\sigma}_{t}^{2}\boldsymbol{I}$$ and $$\boldsymbol{\sigma}_{t}^{2}$$ is a vector of data variances. To obtain an estimate of the state vector $$\left(\boldsymbol{x}_{t}\right)$$, we use a process model that incorporates a process covariance matrix $$\left(\boldsymbol{Q}_{t}\right)$$. This process covariance matrix differentiates our methods from past ensemble filters. Our process model contains the following equations

$$\boldsymbol{x}_{t} \sim \mathrm{multivariate\: normal}(\boldsymbol{x}_{model_{t}},\boldsymbol{Q}_{t})$$

$$\boldsymbol{x}_{model_{t}} \sim \mathrm{multivariate\: normal}(\boldsymbol{\mu}_{forecast_{t}},\boldsymbol{P}_{forecast_{t}})$$

where $$\boldsymbol{\mu}_{forecast_{t}}$$ is a vector of means from the ensemble forecasts and $$\boldsymbol{P}_{forecast_{t}}$$ is a covariance matrix calculated from the ensemble forecasts. The prior for our process covariance matrix is $$\boldsymbol{Q}_{t}\sim\mathrm{Wishart}(\boldsymbol{V}_{t},n_{t})$$ where $$\boldsymbol{V}_{t}$$ is a scale matrix and $$n_{t}$$ is the degrees of freedom. The prior shape parameters are updated at each time step through moment matching such that

$$\boldsymbol{V}_{t+1} = n_{t}\bar{\boldsymbol{Q}}_{t}$$

$$n_{t+1} = \frac{\sum_{i=1}^{I}\sum_{j=1}^{J}\frac{v_{ijt}^{2}+v_{iit}v_{jjt}}{Var(\boldsymbol{\bar{Q}}_{t})}}{I\times J}$$

where we calculate the mean of the process covariance matrix $$\left(\bar{\boldsymbol{Q}_{t}}\right)$$ from the posterior samples at time t. Degrees of freedom for the Wishart are typically calculated element by element where $$v_{ij}$$ are the elements of $$\boldsymbol{V}_{t}$$. $$I$$ and $$J$$ index rows and columns of $$\boldsymbol{V}$$. Here, we calculate a mean number of degrees of freedom for $$t+1$$ by summing over all the elements of the scale matrix $$\left(\boldsymbol{V}\right)$$ and dividing by the count of those elements $$\left(I\times J\right)$$. We fit this model sequentially through time in the R computing environment using R package ‘rjags.’

## 9.4 MultiSettings

(TODO: Under construction…)

## 9.5 Benchmarking

Benchmarking is the process of comparing model outputs against either experimental data or against other model outputs as a way to validate model performance. We have a suit of statistical comparisons that provide benchmarking scores as well as visual comparisons that help in diagnosing data-model and/or model-model differences.

### 9.5.1 Data Preparation

All data that you want to compare with model runs must be registered in the database. This is currently a step that must be done by hand either from the command line or through the online BETY interface. The data must have three records:

1. An input record (Instructions here)

2. A database file record (Instructions here)

3. A format record (Instructions here)

### 9.5.2 Model Runs

Model runs can be setup and executed - Using the PEcAn web interface online or with a VM (see setup) - By hand using the pecan.xml

### 9.5.3 The Benchmarking Shiny App

The entire benchmarking process can be done through the Benchmarking R Shiny app.

When the model run has completed, navigate to the workflow visualization Shiny app.

• Select the workflow and run id
• Again make sure that you can see the uploaded data plotted alongside the model output. In the future there will be more tools for double checking that your uploaded data is appropriate for benchmarking, but for now you may need to do the sanity checks by hand.

#### 9.5.3.1 Create a reference run record

• Navigate to the Benchmarking tab
• The first step is to register the new model run as a reference run in the database. Benchmarking cannot be done before this step is completed. When the reference run record has been created, additional menus for benchmarking will appear.

#### 9.5.3.2 Setup Benchmarks and metrics

• The variables in the uploaded data that you wish to compare with model output.
• The numerical metrics you would like to use in your comparison.
• Additional comparison plots that you would like to see.
• Note: All these selections populate the benchmarking section of the pecan.BENCH.xml which is then saved in the same location as the original run output. This xml is purely for reference.
##### 9.5.3.2.1 Benchmarking Output
• All benchmarking results are stored in the benchmarking directory which is created in the same folder as the original model run.
• The benchmaking directory contains subdirectories for each of the datasets compared with the model output. The names of these directories are the same as the corresponding data set’s input id in BETY.
• Each input directory contains benchmarking.output.Rdata, an Rdata file contianing all the results of the benchmarking workflow. load(benchmarking.output.Rdata) loads a list called result.out which contains the following:
• bench.results: a data frame of all numeric benchmarking scores
• format: a data frame that can be used to see how the input data was transformed to make it comparable to the model output. This involves converting from the original variable names and units to the internal pecan standard.
• aligned.dat: a data frame of the final aligned model and input values.
• All plots are saved as pdf files with names with “benchmark_plot-type_variable_input-id.pdf”

• To view interactive results, naviage to the Benchmarking Plots tab in the shiny app.

### 9.5.4 Benchmarking in pecan.xml

Before reading this section, it is recommended that you familiarize yourself with basics of the pecan.xml file.

The pecan.xml has an optional benchmarking section. Below are all the tags in the benchmarking section explained. Many of these field are filled in automatically during the benchmarking process when using the benchmarking shiny app.

The only time one should edit the benchmarking section by hand is for performing clone runs. See clone run documentation.

<benchmarking> settings:

• ensemble_id: the id of the ensemble that you will be using - the settings from this ensemble will be saved in a reference run record and then ensemble_id will be replaced with reference_run_id
• new_run: TRUE = create new run, FALSE = use existing run (required, default FALSE)

It is possible to look at more than one benchmark with a particular run. The specific settings related to each benchmark are in a sub section called benchmark

• input_id: the id of the benchmarking data (required)
• variable_id: the id of the variable of interest within the data. If you leave this blank, all variables that are shared between the input and model output will be used.
• metric_id: the id(s) of the metric(s) to be calculated. If you leave this blank, all metrics will be used.

Example: In this example, - we are using a pre-existing run from ensemble_id = 1000010983 (new_run = FALSE) - the output will be compared to data from input_id = 1000013743, specifically two variables of interest: variable_id = 411, variable_id = 18 - for variable_id = 411 we will perform only one metric of comparison metric_id = 1000000001 - for for variable_id = 18 we will perform two metrics of comparison metric_id = 1000000001, metric_id = 1000000002

<benchmarking>
<ensemble_id>1000010983</ensemble_id>
<new_run>FALSE</new_run>
<benchmark>
<input_id>1000013743</input_id>
<variable_id>411</variable_id>
<site_id>853</site_id>
<metrics>
<metric_id>1000000001</metric_id>
</metrics>
</benchmark>
<benchmark>
<input_id>1000013743</input_id>
<variable_id>18</variable_id>
<site_id>853</site_id>
<metrics>
<metric_id>1000000001</metric_id>
<metric_id>1000000002</metric_id>
</metrics>
</benchmark>
</benchmarking>