View on GitHub

PEcAn

Home News People Tutorials Demo Download Documentation Workshop GSoC Contact Us

GSoC - PEcAn Project Ideas

Ecosystem science has many components, so does PEcAn! Some of those components where you can contribute includes but not limited to:

  • High performance computing – PEcAn takes the Bayesian approach in bringing ecosystem models and data together. Help us implement and optimize our algorithms for high performance computing (HPC) environments.
  • Database Optimization – PEcAn preserves a huge range of information to make any science done with PEcAn reproducible. Help us optimise our database and our database management.
  • Scientific Visualization – Help us improve the stability of our current visualizations, and find creative new ways of interactively exploring data and results.
  • BETY port from RUBY to Python – The current version of the BETY website is created using Ruby On Rails, since more people are becoming familiar with python we would like to use python instead of Ruby on Rails.
  • Remote Execution – The current version of the PEcAn uses qsub and ssh to run models on remote hosts, we want to use docker and singularity to do in the future (including gnu parallel).
  • Distributed Computing – Solidify PEcAn's distributed computing capabilities.
  • Linking Databases and data types - Leverage PEcAn’s ability to store meta-data and process data on the fly in order to solidify a common ontology of ecological data our community can use.
  • High performance computing

    PEcAn provides a Bayesian data assimilation framework in order to synthesize the data with the models. However, such workflows can be computationally very expensive due to complexity of models and Bayesian methods. PEcAn already has a data assimilation module implemented in R that works locally. Our goal is to 1) make this workflow compatible with HPC environments (e.g. remote execution, parallelization) and 2) implement massively parallelizable particle filter methods using Graphics Processing Units (GPUs).

    Expected outcome:Implementation of data assimilation module in HPC environments, and application of massively parallelizable particle filter algorithms in data assimilation module using GPU programming

    Prerequisites: R and C/C++ experience is required, experience or willingness to learn GPU programming, Bayesian statistics background is not required but candidates with a background will be preferred.

    Contact person: Istem Fer, istfer[at]bu.edu

    Database Optimization

    At PEcAn, we want to build tools that make science more reproducible. A huge part of reproducibility is archiving information: we archive data, model configurations, results, -everything you’d need to recreate the experiment and then some. This means that our database needs to be efficient, manageable, and well-curated. Right now, our database errs towards archival, and creates redundant or unused records. With your help we can create a smarter archival & creation standard.

    Expected outcome: A set of database management tools for pre-release database curation.

    Prerequisites:Proficiency in R and SQL (PostgreSQL preferred). Interest in data provenance.

    Related GitHub issues: #1496 - #1630 #245 -#230

    Contact person: Tess McCabe, tmccabe[at]bu.edu

    Scientific Visualization

    Our mission is to create an ecosystem modeling toolbox that is accessible to a non-technical audience (e.g., a high school ecology classroom) while retaining sufficient power and versatility to be valuable to scientific programmers (e.g. ecosystem model developers). However, the diversity of ecosystem models and associated analyses supported by PEcAn poses logistical challenges for presentation of results, especially given the wide range of targeted users. Web-based interactive visualizations can be a powerful tool for exploring model outputs and data as well as a fun learning tool in educational environments.

    Currently, PEcAn has basic support for interactive visualizations of outputs using R Shiny. We are looking for a student interested in addressing any of the following areas:

  • Improving Shiny application stability and performance, for instance through more efficient caching or lazy-loading of large outputs and data, or leveraging more efficient interactive visualization frameworks.
  • Enhancing the visual elements of our interface for starting model runs, including visualization of existing sites and input data and better UI elements for setting run options.
  • Developing novel interactive visualization tools that leverage more advanced statistical techniques, such as visualizing and applying machine learning algorithms to outputs and model-data residuals, exploring results in multivariate space.
  • Expected outcome: A more robust set of web-based interactive visualization tools for model simulations and user-provided data.

    Prerequisites: Familiarity with R Shiny including the ability to work with and debug these tools in a remote, Unix-based CLI environment is a requirement. Demonstrated proficiency with SQL (especially PostgreSQL), HDF/NetCDF formats, and/or advanced statistics (e.g. multivariate regression, time series analysis, information theory) is preferred. Experience with other web-based interactive visualization frameworks, such as Javascript’s D3, is a plus.

    Contact person: Alexey Shiklomanov, ashiklom[at]bu.edu

    BETY port from RUBY to Python

    The BETY database is used by multiple projects, such as PEcAn, and TERRA-REF. The database schema is still good to use, however the front end is written in Ruby on Rails. The front end has had some work done on it to improve as well as updating the software however has been the same for the last 6 years. Currently the group does not have any full time ruby developers, however we do have a lot of python development expertise. The goal is to take the Ruby frontend and convert this to a different language, such as python.

    Expected outcome: The frontend of BETY is changed from Ruby to another language such as python as well as a new front end. The frontend should be configurable to match different projects and should be skinnable.

    Prerequisites: Experience with python, an ORM framework such as django, ruby experience helpful.

    Related GitHub issues: see BETY github issues

    Contact person: Rob Kooper, kooper[at]illinois.edu

    Remote Execution

    We are currently in the process of converting PEcAn to use docker containers. We would like to use either kubernetes or HPC resources to run and scale multiple instances of these containers. To be able to run on HPC resources we will need to use either singularity, or find resources that allow to run docker containers, such as Google App Engine. To be able to do this we need to securely set up a pipe that allows us to start and stop the right containers on HPC environments, or an easy way to scale containers on a cloud computing platform.

    Expected outcome: The user can launch an ensemble of 1000 runs, and the framework will automatically either scale the cloud platform to handle this many requests (and scale down afterwards) or use singularity to submit a job on a HPC site.

    Prerequisites: Experience with docker, google App Engine and/or singularity + HPC is a plus

    Related GitHub issues: #1391

    Contact person: Rob Kooper, kooper[at]illinois.edu ; Alexey Shiklomanov, ashiklom[at]bu.edu

    Distributed Computing

    Currently the database that sits underneath PEcAn is build on top of postgresql. Recently, we have built capability to sync databases on multiple machines within our network using a THREDDS API. We wish to solidify this infrastructure within the context of a multi-model uncertainty analysis project that is currently underway. The goal is to run multiple models, with different sets of data from different sites, on multiple machines and leverage the distributed network to pull and push files between machines at different institutions. This is a critical feature of the PEcAn cyberinfrastructure as the sharing of files across a network of machines is a critical development goal toward a true distributed network . Beyond solidifying the existing infrastructure, we wish to add the ability to process files and share intermediary files as well. Pulling and pushing subsets of files to grab only specific variables from results files and requesting calculations be performed on files on remote machines are a few feature we wish to add.

    Expected outcome: Execute Model runs on multiple machines and use the THREDDS API to pull result files all onto one machine and perform calculations on those files.

    Prerequisites: Database, PostgreSQL, R, experience with git is a plus.

    Contact person: Tony Gardella, tonygard[at]bu.edu

    Linking databases and data types

    Ecosystem modeling relies heavily on fusing data from multiple sources. Whether it be data to calibrate a model or benchmark a model result, they come from different sources that are varying in their formats and naming conventions. The difference in semantics creates a bottleneck as a central ontology does not exist to translate and relate the measurements from different sites, experiments, and/or databases. To alleviate this issue, this project’s goal is to systemize PEcAn’s framework to synthesize data with ecosystem models. In this project we want to cross-link to databases such as DataONE, NEOTOMA, PANGAEA and The International Tree-Ring Data Bank to be able to pull in ecological data and leverage PEcAn’s ability to store meta-data and process data on the fly in order to solidify a common ontology of ecological data our community can use.

    Expected outcome: A framework to support external database connections to pull in ecological data to PEcAn workflow for calibration and validation of model runs.

    Prerequisites: Experience with database management systems and R is required.

    Contact person: Tony Gardella, tonygard[at]bu.edu