Common Analyses with betydata

Author

David LeBauer and Akash B V

Published

March 1, 2026

What you will learn
  • How to extract and summarize yield data for specific genera
  • How to link management practices (fertilization, planting) to yield observations
  • Patterns for site-level aggregation, author-based queries, and variable lookups

Setup

library(betydata)
library(dplyr)

Extracting Yield Data for a Genus

A common starting point is pulling yield observations for a particular genus and summarizing them. The Ayield trait represents above-ground annual yield in Mg/ha.

miscanthus_yields <- traitsview |>
  filter(
    genus == "Miscanthus",
    trait == "Ayield"
  ) |>
  select(id, mean, date, sitename, scientificname)

miscanthus_yields
nrow(miscanthus_yields)
[1] 1021
Tibble Printing

All tables are tibbles, which display the first 10 rows by default. With key columns ordered first (trait, mean, units, scientificname, genus), the default output is immediately informative without needing head() or column subsetting.

Working with Management Practices

Management practices (planting dates, fertilization rates, harvest methods) are stored in the managements table and linked to experimental treatments through the managements_treatments junction table. This linkage connects management details to yield observations in traitsview.

mgmt_treat <- managements_treatments |>
  left_join(
    managements |> select(id, mgmttype, level, units, date),
    by = c("management_id" = "id")
  )

grass_yields <- traitsview |>
  filter(
    genus %in% c("Miscanthus", "Panicum"),
    trait == "Ayield"
  ) |>
  left_join(mgmt_treat, by = "treatment_id", relationship = "many-to-many")

grass_yields |>
  filter(!is.na(mgmttype)) |>
  count(genus, mgmttype, sort = TRUE)

Nitrogen Fertilization Rates

Extracting nitrogen application rates and joining them with yield data enables exploration of yield–nitrogen relationships. Nitrogen management is recorded as fertilizer_N or fertilizer_N_rate in the mgmttype column.

nitrogen_rates <- managements |>
  filter(mgmttype %in% c("fertilizer_N", "fertilizer_N_rate")) |>
  left_join(
    managements_treatments |> select(management_id, treatment_id),
    by = c("id" = "management_id")
  ) |>
  select(treatment_id, nrate = level, units)

yields_with_n <- traitsview |>
  filter(
    trait == "Ayield",
    genus %in% c("Miscanthus", "Panicum")
  ) |>
  left_join(nitrogen_rates, by = "treatment_id", relationship = "many-to-many")

yields_with_n |>
  filter(!is.na(nrate)) |>
  summarise(
    n = n(),
    mean_N = round(mean(nrate, na.rm = TRUE), 1),
    mean_yield = round(mean(mean, na.rm = TRUE), 1),
    .by = genus
  ) |>
  knitr::kable(col.names = c("Genus", "N obs", "Mean N rate", "Mean Yield (Mg/ha)"))
Genus N obs Mean N rate Mean Yield (Mg/ha)
Panicum 4385 105.1 10.7
Miscanthus 1536 71.6 11.7

Site-Level Aggregation

Aggregating trait data by site is useful for spatial analysis and mapping data density across research locations.

site_summary <- traitsview |>
  filter(!is.na(lat), !is.na(lon)) |>
  summarise(
    n_records = n(),
    n_traits = n_distinct(trait),
    n_species = n_distinct(species_id),
    .by = c(site_id, sitename, lat, lon)
  )

site_summary |>
  arrange(desc(n_records)) |>
  head(15) |>
  knitr::kable()
Table 1: Top research sites by number of records
site_id sitename lat lon n_records n_traits n_species
2.000e+09 Barrow Environmental Observatory (NGEE-Arctic) 71.279875 -156.60848 6723 8 12
7.600e+01 EBI Energy farm 40.063700 -88.20200 2357 12 32
2.740e+02 Ihinger Hof 48.740000 8.92400 1253 11 7
2.000e+09 Kougarok (NGEE-Arctic) 65.163451 -164.81695 745 5 5
1.226e+03 Luquillo 18.310000 -65.74000 715 5 142
3.900e+02 Macknade Research Station -18.700000 146.20000 640 3 1
5.110e+02 Bambaroo -18.858889 146.19139 630 25 1
2.000e+09 Santa Cruz Experimental Field Facility 9.119870 -79.70506 600 10 4
1.038e+03 Harwood Mill Farm - Mizer -29.426000 153.24100 514 1 1
2.000e+09 PA-PNM 8.994504 -79.54296 486 7 36
2.760e+02 Gutenzell-duplicate 48.700000 9.20000 482 21 1
5.830e+02 G. and R. Zanetti’s Farm -19.500000 147.30000 476 8 1
8.520e+02 AspenFACE 45.667000 -89.61670 417 18 3
1.041e+03 La Mercy -29.350000 31.07000 410 1 1
6.940e+02 Centro de Tecnologia Canavieira -22.700000 -47.55000 402 9 1
Geographic Data

All sites with coordinates have lat and lon columns in both traitsview and the sites table. The sites table additionally contains mat (mean annual temperature) and map (mean annual precipitation) for sites where climate data is available.

Finding Data by Author

lebauer_data <- traitsview |>
  filter(grepl("LeBauer", author, ignore.case = TRUE))

lebauer_data |>
  count(trait, author, citation_year, sort = TRUE)

Most Data-Rich Citations

traitsview |>
  count(citation_id, author, citation_year, sort = TRUE) |>
  head(10) |>
  knitr::kable()
Table 2: Top 10 citations by number of records
citation_id author citation_year n
2.00e+09 Alistair Rogers 2017 6723
7.61e+02 Laredo 2003 3461
1.99e+02 Feng 2010 2742
1.89e+02 Clifton-Brown 2002 1232
2.00e+09 Serbin and Rogers 2016 835
2.00e+09 Lianhong Gu 2016 816
7.82e+02 Xiaohui Feng 2015 715
2.00e+09 Slot and Winter 2017 600
3.25e+02 Inman-Bamber 2000 589
1.57e+02 Lewandowski 1998 549

Variable and Trait Lookups

The variables table provides units, descriptions, and valid ranges for each measured trait. This is useful for understanding what a trait measures and checking whether observed values are within expected bounds.

variables |>
  filter(name %in% c("SLA", "Vcmax", "leaf_respiration_rate_m2", "Ayield")) |>
  select(name, units, description, min, max) |>
  knitr::kable()
name units description min max
SLA m2 kg-1 Specific Leaf Area 0.1 100
Ayield Mg ha-1 yr-1 annual aboveground yield 0 Infinity
Vcmax umol [CO2] m-2 s-1 maximum rubisco carboxylation capacity 0 500
leaf_respiration_rate_m2 umol [CO2] m-2 s-1 Rd; leaf dark respiration 0 500

Performance

Since all tables are loaded in memory as R data frames, filtering and joining operations run at in-memory speed with no network overhead.

system.time({
  result <- traitsview |>
    filter(
      genus %in% c("Miscanthus", "Panicum", "Populus"),
      trait %in% c("SLA", "Vcmax", "Ayield"),
      checked == 1
    ) |>
    summarise(
      n = n(),
      mean = mean(mean, na.rm = TRUE),
      .by = c(genus, trait)
    )
})
   user  system elapsed 
  0.006   0.000   0.006 

References

  • LeBauer, D. S., et al. (2018). BETYdb: a yield, trait, and ecosystem service database applied to second-generation bioenergy feedstock production. GCB Bioenergy. doi:10.1111/gcbb.12420