Occurrence download formats

Data downloads are available from GBIF in three primary formats:

  • Simple. This format contains a selection of commonly used terms, after the data has been aligned to GBIF’s taxonomic and geographic indices and structured vocabularies

    • Downloads created on www.gbif.org or through the API using the format SIMPLE_CSV are produced in a tab-separated text format, suitable for use with spreadsheets and programming/scripting languages

    • Occurrence data accessed through cloud services, or with the API format SIMPLE_PARQUET, are produced in Apache Parquet format. The fields are the same as for tab-separated text format.

  • Darwin Core Archive (API: DWCA). This is a compressed Zip file, containing data in tab-separated text format, and metadata in XML format.

    • occurrence.txt contains occurrence data after interpretation by GBIF’s systems.

    • multimedia.txt contains information on multimedia (images, audio, video) relating to the occurrences.

    • verbatim.txt contains the original, uninterpreted data, without modifications by GBIF’s systems.

    • optionally, additional verbatim Darwin Core Archive extensions. The data are as-received from the publisher. See GBIF Registered Extensions for documentation of these — note not all of them are maintained by GBIF.

  • Species List (API: SPECIES_LIST). This is a summary format containing the distinct list of species names returned by the filter.

The header row (first row) of all these files contain the short name of the terms they contain. Most of the terms are defined by the Darwin Core standard. For example, the column catalogNumber contains data of the Darwin Core term http://rs.tdwg.org/dwc/terms/catalogNumber.

Simple download – Term definitions

The definitions marked with 24 are from the Darwin Core standard.

The definitions marked with 24 are from GBIF, and may reflect the result of interpretation and data quality procedures applied by GBIF, or they may not be part of Darwin Core.

Unresolved include directive in modules/ROOT/pages/download-formats.adoc - include::partial$download-simple-terms-table.adoc[]

DWCA downloads

Darwin Core Archive downloads from gbif.org contain the following files:

occurrence.txt

Occurrence data after interpretation by GBIF. Described in detail below.

multimedia.txt

Occurrence multimedia data after interpretation by GBIF. Described in detail below.

verbatim.txt

Occurrence data without interpretation by GBIF. Described in detail below.

verbatim/*.txt

Occurrence extension data without interpretation by GBIF. See GBIF Registered Extensions for documentation of these — note not all of them are maintained by GBIF.

meta.xml

The Darwin Core Archive metafile, describing the structure of the archive — the file formats, column names and their terms.

metadata.xml

Metadata about the download in Ecological Metadata Language (EML).

rights.txt

Licence information for all the datasets with occurrences in the download.

citations.txt

Citations for all the datasets with occurrences in the download.

dataset/*.xml

EML metadata for every dataset with occurrences in the download.

The data may be read without any special tools, including by spreadsheets such as Microsoft Excel and LibreOffice Calc (see the FAQ). The .txt files are tab-delimited, and all files are in UTF-8 encoding with Unix-style (\n) line endings.

There are libraries to read Darwin Core Archives in these programming languages:

Interpreted term definitions (occurrence.txt)

This is the Darwin Core Archive core entity, with row type Occurrence. Values are tab-delimited and in UTF-8 encoding.

Unresolved include directive in modules/ROOT/pages/download-formats.adoc - include::partial$download-dwca-interpreted-terms-table.adoc[]

Multimedia term definitions (multimedia.txt)

Unresolved include directive in modules/ROOT/pages/download-formats.adoc - include::partial$download-dwca-multimedia-terms-table.adoc[]

Verbatim term definitions (verbatim.txt)

Data in this table is not modified by GBIF interpretation processes, except for conversion to Unicode and possible changes to whitespace (spaces, tabs, newlines etc).

Unresolved include directive in modules/ROOT/pages/download-formats.adoc - include::partial$download-dwca-verbatim-terms-table.adoc[]

Verbatim extensions (verbatim/*.txt)

Data in these tables is not modified by GBIF interpretation processes, except for conversion to Unicode and possible changes to whitespace (spaces, tabs, newlines etc).

See the GBIF Registered Extensions for documentation of the extensions.

Species list downloads – Term definitions

Species list downloads are a summary format containing the distinct list of species names returned by the filter.

The definitions marked with 24 are from GBIF, and may reflect the result of interpretation and data quality procedures applied by GBIF, or they may not be part of Darwin Core.

Unresolved include directive in modules/ROOT/pages/download-formats.adoc - include::partial$download-species-list-terms-table.adoc[]