Registering a dataset using the API

Who should register datasets in this way?

Registering a dataset using the API may be appropriate for an institution with many datasets and an existing software system to manage them.

If this does not apply see other dataset registration options for alternatives.

Prerequisites

You will need a webserver, VM, cloud hosting service or similar to host your datasets. This must be accessible to GBIF’s servers, and the data should be backed up regularly.

You will also need a user account on GBIF.org to handle the registrations. Ideally, this should be an account for your institution or software system, rather than a personal account.

You should also create an account on GBIF-UAT.org which you can use for testing.

Please do not create test datasets on GBIF.org! They will be assigned DOIs, which can never be deleted.

Once you have created the accounts, contact helpdesk@gbif.org to ask for editor_rights permissions for your organization.

You can also test on GBIF-UAT.org using the username ws_client_demo and password Demo123. This has permission to create datasets owned by Test Organization #1, which has a Test HTTP installation.

Process

Registration requires two REST calls — the first creates a new dataset, then the second adds an endpoint (HTTP or HTTPS location) which GBIF will use to access the dataset.

First, record the mandatory dataset metadata in a JSON file, dataset.json.

Mandatory dataset metadata

{
  "publishingOrganizationKey": "0a16da09-7719-40de-8d4f-56a15ed52fb6", (1)
  "installationKey": "92d76df5-3de1-4c89-be03-7a17abad962a", (1)
  "type": "METADATA", (2)
  "title": "Example dataset registration",
  "description": "The dataset is registered with minimal metadata, which is overwritten once GBIF can access the file.",
  "language": "eng",
  "license": "http://creativecommons.org/publicdomain/zero/1.0/legalcode" (3)
}

1	The publishing organization and installation must already exist in the GBIF Registry.
2	See the enumeration API for the accepted values for the type (DatasetType)…
3	…and licence.

POST this JSON to GBIF using the Registry API:

curl -Ssf --user ws_client_demo:Demo123 -H "Content-Type: application/json" -X POST --data @dataset.json https://api.gbif-uat.org/v1/dataset | tr -d '"' | tee dataset.registration

dataset=$(cat dataset.registration)

Notice the API returns the new dataset’s UUID, and we have recorded this in the file dataset.registration. The UUID is then stored in a variable.

Next define the endpoint in endpoint.json:

Endpoint definition

{
  "type": "EML", (1)
  "url": "https://techdocs.gbif.org/en/data-publishing/_attachments/test-dataset.eml"
}

1	See other values for type (EndpointType), this will be `DWC_ARCHIVE` for normal occurrence, checklist or sampling event datasets.

Add this endpoint to the dataset:

curl -Ssf --user ws_client_demo:Demo123 -H "Content-Type: application/json" -X POST --data @endpoint.json https://api.gbif-uat.org/v1/dataset/$dataset/endpoint

Result

The dataset should be visible in the GBIF Registry:

firefox https://registry.gbif-uat.org/dataset/$dataset

After 1-2 minutes the dataset metadata will be updated. After a further 1-60 minutes, depending on the size of the dataset and the number of other datasets being processed, occurrence and/or checklist data should be retrieved from your system and shown on GBIF’s system. You can follow the progress under "Crawling history" and "Ingestion history" for your dataset, and see the length of the queue at "Running crawls" (UAT version) and "Running ingestions" (UAT version). Here, "crawling" refers to GBIF’s system downloading data from your server, and also tracks processing metadata and checklists. "Ingestion" handles occurrence data.

Shell script

This is a very minimal procedure, without any error checking or proper recording of the dataset UUID.

Shell script (click to expand)

Download this script.

#!/bin/bash -eu
#
# This is an outline Bash script to show updating a set of datasets registered with GBIF.org.
#
# The script isn't recommended for production use; it shows the steps involved, but does not have any error checking.
#
# Using a process like this, you should make sure you store the UUID GBIF assigns to your dataset, so you don't accidentally
# re-register existing datasets as new ones.

# Starting point:
# - One or more datasets as DarwinCore Archives or EML files in the current directory.
#   - This directory has an EML file, which is sufficient for a metadata-only dataset.
# - A web server exposing these files
#   - We can use the GitHub view of this directory for that:
ACCESS_ENDPOINT=https://techdocs.gbif.org/en/data-publishing/_attachments
# - An organization registered in GBIF
# - An installation registered in GBIF (represents the server this script runs on)
# - A GBIF.org user account with publishing rights for the registered organization.
#   - These are the test values available for use on GBIF-UAT.org
ORGANIZATION=0a16da09-7719-40de-8d4f-56a15ed52fb6
INSTALLATION=92d76df5-3de1-4c89-be03-7a17abad962a
GBIF_USER=ws_client_demo
GBIF_PASSWORD=Demo123

# Loop through all the DWCA and EML files:

shopt -s extglob
for dataset_file in *.@(zip|eml) ; do

    # Guess dataset type (script doesn't handle checklists or sampling event datasets)
    case $dataset_file in
        *.eml)
            dataset_type=METADATA
            endpoint_type=EML
            ;;
        *)
            dataset_type=OCCURRENCE
            endpoint_type=DWC_ARCHIVE
            ;;
    esac

    # Check if the dataset is already registered -- we have a local file recording the UUID if that is the case.
    if [[ -e $dataset_file.registration ]]; then
        dataset=$(cat $dataset_file.registration)
        echo "Dataset $dataset_file is already registered at $dataset"
    else

        echo "Registering dataset $dataset_file"

        # Make a JSON object representing the minimum metadata necessary to register a dataset.  The rest of the metadata will
        # be added when GBIF.org retrieves the dataset for indexing.

        # The license isn't essential, but is very helpful to us if you can provide it (correctly) in the initial registration.

        # If using your own DOIs, add "doi": "10.xxxx/xxxx" to this JSON object.

        cat > $dataset_file.registration_json <<-EOF
        {
            "publishingOrganizationKey": "$ORGANIZATION",
            "installationKey": "$INSTALLATION",
            "type": "$dataset_type",
            "title": "Example dataset registration",
            "description": "The dataset is registered with minimal metadata, which is overwritten once GBIF can access the file.",
            "language": "eng",
            "license": "http://creativecommons.org/publicdomain/zero/1.0/legalcode"
        }
        EOF

        # Send the request by HTTP:
        curl -Ssf --user $GBIF_USER:$GBIF_PASSWORD -H "Content-Type: application/json" -X POST --data @$dataset_file.registration_json https://api.gbif-uat.org/v1/dataset | tr -d '"' > $dataset_file.registration
        dataset=$(cat $dataset_file.registration)
    fi

    if [[ -e $dataset_file.endpoint ]]; then
        echo "  Endpoint is already set"
    else

        # Add an endpoint, the location GBIF.org will retrieve the archive (or EML) file from:
        cat > $dataset_file.endpoint_json <<-EOF
        {
            "type": "$endpoint_type",
            "url": "$ACCESS_ENDPOINT/$dataset_file"
        }
        EOF

        curl -Ssf --user $GBIF_USER:$GBIF_PASSWORD -H "Content-Type: application/json" -X POST --data @$dataset_file.endpoint_json https://api.gbif-uat.org/v1/dataset/$dataset/endpoint > $dataset_file.endpoint

        echo "Dataset registered, see https://registry.gbif-uat.org/dataset/$dataset or https://api.gbif-uat.org/v1/dataset/$dataset"
    fi

done