Catalog maker

Build Status Documentation

A package to build intake catalogs for cmip5, cmip6 and cordex data holdings

  • Free software: BSD - see LICENSE file in top-level package directory

Installing

Create a clone of the repository:

git clone https://github.com/roocs/catalog-maker.git

go into the catalog-maker directory

cd catalog-maker

Install the package into your virtual environment:

pip install -e .

Creating an Intake Catalog

Catalog maker provides tools for writing data catalogs of the known data holdings in a csv format, described by a YAML file.

For each project in catalog_maker/etc/roocs.ini there are options to set the file paths for the inputs and outputs of this catalog maker. A list of datasets to include needs to be provided. The path to this list for each project can be set in catalog_maker/etc/roocs.ini. The datasets in this list must be what you want in the ds_id column of the csv file.

If creating a c3s-cmip6 inventory, make sure the dataset ids start with ‘c3s-cmip6’ instead of CMIP6

The data catalog is created using a database backend to store the results of the scans, from which the csv and YAML files will be created. For this, a postgresql database is required. Once you have a database, you need to export an environment variable called $ABCUNIT_DB_SETTINGS:

$ export ABCUNIT_DB_SETTINGS="dbname=<name> user=<user> host=<host> password=<pwd>"

The table created will be named after the project you are creating a catalog for in the format <project_name>_catalog_results e.g. c3s_cmip6_catalog_results

For the below commands you must be in the catalog-maker directory

Creating batches

Once the list of datasets is collated a number of batches must be created:

$ python catalog_maker/cli.py create-batches -p c3s-cmip6

The option -p is required to specify the project.

Creating catalog entries

Once the batches are created, the catalog maker can be run - either locally or on lotus. The settings for how many datasets to be included in a batch and the maximum duration of each job on lotus can also be changed in catalog_maker/etc/roocs.ini.

Each batch can be run idependently, e.g. running batch 1 locally:

$ python catalog_maker/cli.py run -p c3s-cmip6 -b 1 -r local

or running all batches on lotus:

$ python catalog_maker/cli.py run -p c3s-cmip6 -r lotus

This creates a table in the database containing an ordered dictionary of the entry for each file in each dataset if successful, or the error traceback if there is an Exception raised.

It is also possible to force a rescan of datasets that have already been scanned. To do this use the -f flag.

e.g.

$ python catalog_maker/cli.py run -p c3s-cmip6 -r lotus -f

Viewing entries and errors

To view the records:

$ python catalog_maker/cli.py list -p c3s-cmip6

With many entries, this may take a while.

To just get a count of how many files have been scanned:

$ python catalog_maker/cli.py list -p c3s-cmip6 -c

To see any errors:

$ python catalog_maker/cli.py show-errors -p c3s-cmip6

To see just a count of errors:

$ python catalog_maker/cli.py show-errors -p c3s-cmip6 -c

Each count will show how many files and how many datasets have been successful/failed.

The list count will also show the total numbers of datasets/files in the database - including errors. The error count will show whether there are any datasets that have files which have succeeded and failed i.e. that are partially scanned. You can then use the delete command explained below to delete the entries for these partially scanned datasets if required.

Deleting entries

It is possible to delete entries by dataset id:

$ python catalog_maker/cli.py delete -p c3s-cmip6 -d <ds_id>

You can also provide a list of dataset ids to the -d option. This command only deletes successful entries and will leave errors for the datasets specified in the database.

To delete all entries, including errors for specific dataset ids, use the command:

$ python catalog_maker/cli.py delete -p c3s-cmip6 -d <ds_id> -e

Writing to CSV

The final command is to write the entries to a csv file.

$ python catalog_maker/cli.py write -p c3s-cmip6

The flag -c will compress the output csv file. e.g.

$ python catalog_maker/cli.py write -p c3s-cmip6 -c

The csv file will be generated in the csv_dir specified in catalog_maker/etc/roocs.ini and will have the name “{project}_{version_stamp}.csv.gz” if compressed e.g. c3s-cmip6_v20210414.csv.gz or “{project}_{version_stamp}.csv” if not compressed.

A yaml file will be created the catalog_dir specified in catalog_maker/etc/roocs.ini. It will have the name c3s.yml and will contain the below for each project scanned and which is using the same catalog_dir:

sources:
  c3s-cmip6:
    args:
      urlpath:
    csv_kwargs:
      blocksize: null
      compression: gzip
      dtype:
        level: object
    description: c3s-cmip6 datasets
    driver: intake.source.csv.CSVSource
    metadata:
      last_updated:

urlpath and last_updated for a project will be updated very time the csv file is written for the project.

Deleting the table of results

In order to delete all entries in the table of results:

$ python catalog_maker/cli.py clean -p c3s-cmip6

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.