16.6 Database maintentance

All databases need maintenance performed on them. Depending upon the database type this can happen automatically, or it needs to be run through a scheduler or manually. The BETYdb database is Postgresql and it needs to be reindexed and vacuumed on a regular basis. Reindexing introduces efficiencies back into the database by reorganizing the indexes. Vacuuming the database frees up resources to the database by rearranging and compacting the database. Both of these operations are necessary and safe. As always if there’s a concern, a backup of the database should be made ahead of time. While running the reindexing and vacuuming commands, users will notice a slowdown at times. Therefore it’s better to run these maintenance tasks during off hours.

16.6.1 Reindexing the database

As mentioned above, reindexing allows the database to become more efficient. Over time as data gets updated and deleted, the indexes become less efficient. This has a negative inpact on executed statements. Reindexing makes the indexes efficient again (at least for a while) allowing faster statement execution and reducing the overall load on the database.

The reindex.bety.sh script is provided to simplify reindexing the database.

reindex.bety.sh -h
./reindex.bety.sh [-c datalog] [-d database] [-h] [-i table names] [-p psql options] [-q] [-s] [-t tablename]
 -c catalog, database catalog name used to search for tables, default is bety
 -d database, default is bety
 -h this help page
 -i table names, list of space-separated table names to skip over when reindexing
 -p additional psql command line options, default is -U bety
 -q the reindexing should be quiet
 -s reindex the database after reindexing the tables (this should be done sparingly)
 -t tablename, the name of the one table to reindex

If the database is small enough it’s reasonable to reindex the entire database at one time. To do this manually run or schedule the REINDEX statement. For example:

reindex.bety.sh -s

For larger databases it may be desireable to reindex entire tables at a time. An efficient way to do this is to reindex the larger tables and then the entire database. For example:

reindex.bety.sh -t traits; reindex.bety.sh -t yields;
reindex.bety.sh -s

For very large databases it may be desirable to reindex one or more individual indexes before reindexing tables and the databases. In this case running specific psql commands to reindex those specific indexes, followed by reindexing the table is a possible approach. For example:

psql -U bety -c "REINDEX INDEX index_yields_on_citation_id; REINDEX INDEX index_yields_on_cultivar_id;"
reindex.bety.sh -t yields;

Splitting up the indexing commands over time allows the database to operate efficiently with minimal impact on users. One approach is to schedule the reindexing of large, complex tables at a spcific off-time during the week, followed by a general reindexing and excluding those large tables on a weekend night.

Please refere to the Automation section above for information on using cron to schedule reindexing commands.

16.6.2 Vacuuming the database

Vacuuming the BETYdb Postgresql database reduces the amount of resources it uses and introduces its own efficiencies.

Over time, modified and deleted records leave ‘holes’ in the storage of the database. This is a common feature for most databases. Each database has its own way of handing this, in Postgresql it’s the VACUUM command. The VACUUM command performs two main operations: cleaning up tables to make memory use more efficient, and analyze tables for optimum statement execution. The use of the keyword ANALYZE indicates the second operation should take place.

The vacuum.bety.sh script is provided to simplify vacuuming the database.

vacuum.bety.db -h
./vacuum.bety.sh [-c datalog] [-d database] [-f] [-h] [-i table names] [-n] [-p psql options] [-q] [-s] [-t tablename] [-z]
 -c catalog, database catalog name used to search for tables, default is bety
 -d database, default is bety
 -f perform a full vacuum to return resources to the system. Specify rarely, if ever
 -h this help page
 -i table names, list of space-separated table names to skip over when vacuuming
 -n only vacuum the tables and do not analyze, default is to first vacuum and then analyze
 -p additional psql command line options, default is -U bety
 -q the export should be quiet
 -s skip vacuuming the database after vacuuming the tables
 -t tablename, the name of the one table to vacuum
 -z only perform analyze, do not perform a regular vacuum, overrides -n and -f, sets -s

For small databases with light loads it may be possible to set aside a time for a complete vacuum. During this time, commands executed against the database might fail (a temporary condition as the database gets cleaned up). The following commands can be used to perform all the vaccum operations in one go.

vacuum.bety.sh -f

Generally it’s not desireable to have down time. If the system running the database doesn’t need resources that the database is using returned to it, a FULL vacuum can be avoided. This is the default behavior of the script

vacuum.bety.sh

In larger databases, vacuuming the entire database can take a long time causing a negative impact on users. This means that individual tables need to be vacuumed. How often a vacuum needs to be performed is dependent upon a table’s activity. The more frequently updates and deletes occur on a table, the more frequent the vaccum should be. For large tables it may be desireable to separate the table cleanup from the analysis. An example for completely vacuuming and analyzing a table is:

psql -U bety -c "VACUUM traits; VACUUM ANALYZE traits;"

Similar to indexes, vacuuming the most active tables followed by general database vacuuming and vacuum analyze may be a desireable approach.

Also note that it isn’t necessary to run VACUUM ANALYZE for each vacuum performed. Separating the commands and performing a VACUUM ANALYZE after several regular vacuums may be sufficient, with less load on the database.

If the BETYdb database is running on a system with limited resources, or with resources that have become limited, the VACCUM command can return resources to the system from the database. The normal vacuuming process releases resources back to the database for reuse, but not to the system; generally this isn’t a problem. Postgresql has a VACUUM keyword FULL that returns resources back to the system. Requesting a FULL vacuum will lock the table being vacuumed while it is being re-written preventing any statements from being executed against it. If performing VECUUM FULL against the entire database, only the table being actively worked on is locked.

To minimize the impact a VACUUM FULL has on users, it’s best to perform a normal vacuum before a FULL vacuum. If this approach is taken, there sould be a minimal time gap between the normal VACUUM and the VACUUM FULL commands. A normal vacuum allows changes to be made thus requiring the full vacuum to handle those changes, extending it’s run time. Reducing the time between the two commands lessens the work VACUUM FULL needs to do.

psql -U bety -c "VACUUM yields; VACUUM FULL yields; VACUUM ANALYZE yields;"

Give its impact, it’s typically not desireable to perform a VACUUM FULL after every normal vacuum; it should be done on an “as needed” basis or infrequently.