Explore a GBIF dataset – PostgreSQL/Grafana

In this article I am going to explain how to setup an open-source visualization stack to explore a GBIF dataset using PostgreSQL and Grafana. The Global Biodiversity Information Facility (GBIF) is an international organisation that focuses on making scientific data on biodiversity available via the Internet using web services. The dataset that I use in this demo can be found here.

Dataset characteristics:

  • 287.671 records
  • 249 columns
  • 51% with coordinates

Download this docker-compose project to try this setup on your local computer using Docker. After starting the containers the dataset will be downloaded into the PostgreSQL container and the queries shown below will be applied. The end result is a Grafana dashboard that uses PostgreSQL as backend. This is a screenshot of the dashboard that shows specimens found in Indonesia between 1800 and 1855:

Create queries and tables

Download and unpack the dataset using the following commands:

wget https://api.gbif.org/v1/occurrence/download/request/0107080-200613084148143.zip
unzip 0107080-200613084148143.zip

The file containing the data we are interested in is named occurrence.txt.

Convert column names in header to lowercase:

awk 'NR==1{$0=tolower($0)} 1' occurrence.txt > occurrence.tsv

Export the first 10 lines as reference records:

head -n 10 occurrence.tsv > occurrence_reference.tsv

Remove header line:

tail -n +2 occurrence.tsv > gbif_raw.tsv

Use csvkit to create a CREATE TABLE statement. Copy the resulting query and execute it on your PostgreSQL database instance:

csvsql --tabs --dialect postgresql --no-constraints --db-schema explore --tables gbif_raw occurrence_reference.tsv > create_table.sql

Set the table to UNLOGGED to prevent creating excessive WAL files when running UPDATE statements. Run on your PostgreSQL instance:

ALTER TABLE explore.gbif_raw SET UNLOGGED;

Create SQL code to set all columns to datatype ‘text’. Copy the resulting queries and execute them on your PostgreSQL database instance:

SELECT 'ALTER TABLE ' || table_schema || '.' || table_name || ' ALTER COLUMN "' || column_name || '" TYPE text;' FROM information_schema.columns WHERE table_name = 'gbif_raw' AND table_schema = 'explore';

Create a schema and import the data. Run on your PostgreSQL instance:

CREATE SCHEMA explore;
COPY explore.gbif_raw FROM '/tmp/gbif_raw.tsv' DELIMITER E'\t';

Create SQL code to set empty strings to null on a subset of columns. Copy the resulting queries and execute them on your PostgreSQL database instance:

SELECT 'UPDATE ' || table_schema || '.' || table_name || ' SET "' || column_name || '" = null WHERE "' || column_name || '" = '''';' FROM information_schema.columns WHERE table_name = 'gbif_raw' AND table_schema = 'explore' AND column_name IN (
 'gbifid', 
 'eventdate',
 'year',
 'month',
 'day',
 'catalognumber', 
 'occurrenceid', 
 'recordedby', 
 'sex', 
 'lifestage', 
 'preparations', 
 'locality', 
 'stateprovince', 
 'countrycode', 
 'higherclassification', 
 'kingdom', 
 'phylum', 
 'class', 
 'order', 
 'family', 
 'genus', 
 'specificepithet',
 'species',
 'genericname',
 'scientificname',
 'decimallatitude',
 'decimallongitude'
 );

Create the table and some indexes. Run on your PostgreSQL instance::

CREATE EXTENSION postgis;

CREATE TABLE explore.gbif (
 gbifid bigint PRIMARY KEY,
 eventdate timestamptz,
 year int,
 month int,
 day int,
 catalognumber text,
 occurrenceid text,
 recordedby text,
 sex text,
 lifestage text,
 preparations text,
 locality text,
 stateprovince text,
 countrycode text,
 higherclassification text,
 kingdom text,
 phylum text,
 class text,
 "order" text,
 family text,
 genus text,
 specificepithet text,
 species text,
 genericname text,
 scientificname text,
 decimallatitude float,
 decimallongitude float,
 geom geometry(POINT,4326),
 location text,
 tsv tsvector
);

CREATE INDEX ON explore.gbif USING BTREE (eventdate);
CREATE INDEX ON explore.gbif USING GIST (geom);

Create function with trigger to populate the geom column, to be able to perform postgis queries. Also create a function with a trigger to populate the location column for use in Elasticsearch. Run on your PostgreSQL instance:

CREATE OR REPLACE FUNCTION explore.update_geom() RETURNS TRIGGER AS 
 $$
   BEGIN 
     NEW.geom := ST_SetSRID(ST_Makepoint(NEW.decimallongitude,NEW.decimallatitude),4326);
     RETURN NEW;
   END;
 $$ LANGUAGE 'plpgsql';

CREATE TRIGGER update_geom BEFORE INSERT OR UPDATE ON explore.gbif FOR EACH ROW EXECUTE PROCEDURE explore.update_geom();
CREATE OR REPLACE function explore.update_location() RETURNS TRIGGER AS 
 $$
   BEGIN
     IF NEW.decimallatitude IS NOT NULL AND NEW.decimallongitude IS NOT NULL THEN
       NEW.location := concat_ws(',', NEW.decimallatitude, NEW.decimallongitude);
     END IF;
     RETURN NEW;
   END;
 $$ LANGUAGE 'plpgsql';

CREATE TRIGGER update_location BEFORE INSERT OR UPDATE ON explore.gbif FOR EACH ROW EXECUTE PROCEDURE explore.update_location();

Insert the data

Insert data into table explore.gbif from table explore.gbif_raw. Run on your PostgreSQL instance::

INSERT INTO explore.gbif (SELECT
 gbifid::bigint, 
 eventdate::timestamptz,
 year::int,
 month::int,
 day::int,
 catalognumber, 
 occurrenceid, 
 recordedby, 
 sex, 
 lifestage, 
 preparations, 
 locality, 
 stateprovince, 
 countrycode, 
 higherclassification, 
 kingdom, 
 phylum, 
 class, 
 "order", 
 family, 
 genus, 
 specificepithet,
 species,
 genericname,
 scientificname,
 decimallatitude::float,
 decimallongitude::float
 FROM explore.gbif_raw
);

Grafana

The Grafana dashboard consists of three panels, the trackmap panel, a histogram panel and a table panel that can be used to browse through the data based on time and location. I chose the trackmap panel created by the Alexandra Institute above the worldmap panel from Grafana Labs because of the map bounds functionality.

This map shows clusters of specimens found in Southern Africa using the following query:

WITH a AS (
 SELECT
 eventdate AS time,
 decimallatitude,
 decimallongitude,
 geom,
 COUNT(*) OVER() AS cnt
 FROM explore.gbif
 WHERE eventdate BETWEEN '1780-01-01T23:00:00Z' AND '2020-12-31T23:00:00Z'
 AND decimallatitude >= -45.82879925192133
 AND decimallatitude <= 45.583289756006316
 AND decimallongitude >= -248.20312500000003
 AND decimallongitude <= 248.55468750000003
 ), b AS (
 SELECT
 time,
 decimallatitude,
 decimallongitude,
 ST_ClusterKMeans(geom, LEAST(cnt::integer, 150)) OVER() AS clusters
 FROM a
 )
 SELECT
 clusters,
 COUNT(clusters) AS tooltip,
 AVG(decimallatitude) AS latitude,
 AVG(decimallongitude) AS longitude
 FROM b
 GROUP BY 1
 ORDER BY 1;

Zoom to a specific area and show the records in a table:

The two specimens shown in the tabel are the oldest in this dataset. Use the link in the occurrenceid column to check the information and see an image of these specimens, gathered by François Levaillan in 1781.

Happy exploring!

Leave a comment