Preparing Apache Superset with Real-World Data: Loading the dunnhumby “The Complete Journey” Dataset into PostgreSQL

Posted on 2026-06-262026-06-30 by Zaien Aji Trahutomo Posted in AnalyticsTagged superset

TL;DR

When learning Apache Superset, sample datasets are useful for understanding basic features, but they rarely reflect the challenges found in production environments. To fully explore Superset’s capabilities, it is better to work with real-world datasets that contain realistic business transactions, customer behavior, and marketing activities.

One excellent source is the dunnhumby “The Complete Journey” dataset, available from: Source Files – dunnhumby

This retail dataset includes customer transactions, products, coupons, campaigns, demographics, and promotional data, making it an ideal dataset for building dashboards and analytical reports in Apache Superset.

Loading Data into PostgreSQL

My preferred approach is to load the CSV files into PostgreSQL first and then connect Apache Superset to PostgreSQL.

Most files included in the dataset can be imported directly into PostgreSQL using DBeaver’s CSV Import Wizard without any issues. However, one file requires special handling:

causal_data.csv

Problem during import data: with causal_data.csv

When importing causal_data.csv, DBeaver may incorrectly detect the data type of the display column. As a result, the import process can fail due to data type conversion errors.

To avoid this issue, create the table manually before performing the import.

Use the following DDL:

CREATE TABLE public.causal_data (

    "PRODUCT_ID" int4 NULL,

    "STORE_ID" int4 NULL,

    "WEEK_NO" int4 NULL,

    display varchar NULL,

    mailer varchar(50) NULL

);

CREATE TABLE public.causal_data (

    "PRODUCT_ID" int4 NULL,

    "STORE_ID" int4 NULL,

    "WEEK_NO" int4 NULL,

    display varchar NULL,

    mailer varchar(50) NULL

);

The key point is defining the display column as VARCHAR.

Importing the CSV with DBeaver

After creating the table, import the data using DBeaver:

Open DBeaver.
Connect to your PostgreSQL database.
Locate the causal_data table.
Right-click the table.
Select Import Data.
Choose CSV as the source.
Select the causal_data.csv file.
Verify the column mapping.
Start the import process.

Because the table already exists and the column types are defined correctly, the import should complete successfully.

Why This Happens

CSV files do not contain schema information. During import, DBeaver attempts to infer data types from the file contents. In the case of causal_data.csv, the display column contains text values that are not always interpreted correctly by automatic type detection.

By creating the table beforehand and explicitly setting the column as VARCHAR, you eliminate the ambiguity and prevent import errors.

Other Tables

Fortunately, the rest of the files in the dunnhumby dataset are straightforward to import. Tables such as:

transaction_data
product
campaign_table
campaign_desc
coupon
coupon_redempt
hh_demographic

can typically be imported directly using DBeaver’s default settings.

Enhance the dimension and metrics for better analytics

Add date dimension and time dimension

Download date dimension and time dimension CSV: Google Drive

Add date dimension

make dim_date table:

-- public.dim_date definition

-- Drop table

-- DROP TABLE public.dim_date;

CREATE TABLE public.dim_date (
	date_id int4 NULL,
	"date" date NULL,
	"year" int4 NULL,
	"month" int4 NULL,
	"day" int4 NULL,
	day_of_week int4 NULL,
	week_num int4 NULL,
	quarter varchar(50) NULL,
	half_year varchar(50) NULL
);

-- public.dim_date definition

-- Drop table

-- DROP TABLE public.dim_date;

CREATE TABLE public.dim_date (
	date_id int4 NULL,
	"date" date NULL,
	"year" int4 NULL,
	"month" int4 NULL,
	"day" int4 NULL,
	day_of_week int4 NULL,
	week_num int4 NULL,
	quarter varchar(50) NULL,
	half_year varchar(50) NULL
);

import dim_date CSV using DBEaver!

now you can join transaction_data with dim_date:

select
dd.date as "Transaction Date", dd.year as "Year", dd.month as "Month", dd.day as "Day of Month", dd.day_of_week as "Day of Week", dd.quarter as "Quarter", dd.half_year as "Half Year",
d.classification_1 as "Age Group", d.classification_2 as "Income Tier", d.classification_3  as "Household Size", d.classification_4  as "Children Count", d.classification_5 as "Spending Power Group",
p."MANUFACTURER" as "Manufacturer", p."DEPARTMENT" as "Department", p."BRAND" as "Brand", p."COMMODITY_DESC" as "Category", p."SUB_COMMODITY_DESC" as "Sub Category", p."CURR_SIZE_OF_PRODUCT" as "Unit of Measure",
t."DAY", t."TRANS_TIME" as "Transaction Time", t."WEEK_NO" as "Week No",
t."BASKET_ID" as "Transaction", t.household_key as "Customer", t."STORE_ID" as "Store", 
d."HOMEOWNER_DESC" as "Home Owner", d."KID_CATEGORY_DESC" as "Kid Category",
t."QUANTITY" as "Unit Sold", t."SALES_VALUE" as "Revenue",
t."RETAIL_DISC" as "Retail Discount", t."COUPON_DISC" as "Coupon Discount", t."COUPON_MATCH_DISC" as "Coupon Match Discount"

from transaction_data t
left join product p on p."PRODUCT_ID" = t."PRODUCT_ID"
left join hh_demographic d on d.household_key = t.household_key
left join dim_date dd on dd.date_id = t."DAY"

select
dd.date as "Transaction Date", dd.year as "Year", dd.month as "Month", dd.day as "Day of Month", dd.day_of_week as "Day of Week", dd.quarter as "Quarter", dd.half_year as "Half Year",
d.classification_1 as "Age Group", d.classification_2 as "Income Tier", d.classification_3  as "Household Size", d.classification_4  as "Children Count", d.classification_5 as "Spending Power Group",
p."MANUFACTURER" as "Manufacturer", p."DEPARTMENT" as "Department", p."BRAND" as "Brand", p."COMMODITY_DESC" as "Category", p."SUB_COMMODITY_DESC" as "Sub Category", p."CURR_SIZE_OF_PRODUCT" as "Unit of Measure",
t."DAY", t."TRANS_TIME" as "Transaction Time", t."WEEK_NO" as "Week No",
t."BASKET_ID" as "Transaction", t.household_key as "Customer", t."STORE_ID" as "Store", 
d."HOMEOWNER_DESC" as "Home Owner", d."KID_CATEGORY_DESC" as "Kid Category",
t."QUANTITY" as "Unit Sold", t."SALES_VALUE" as "Revenue",
t."RETAIL_DISC" as "Retail Discount", t."COUPON_DISC" as "Coupon Discount", t."COUPON_MATCH_DISC" as "Coupon Match Discount"

from transaction_data t
left join product p on p."PRODUCT_ID" = t."PRODUCT_ID"
left join hh_demographic d on d.household_key = t.household_key
left join dim_date dd on dd.date_id = t."DAY"

Add time dimension

make dim_time table:

-- public.dim_time definition

-- Drop table

-- DROP TABLE public.dim_time;

CREATE TABLE public.dim_time (
	time_id int4 NULL,
	"time" varchar(50) NULL,
	"hour" int4 NULL,
	half_hour int4 NULL
);

-- public.dim_time definition

-- Drop table

-- DROP TABLE public.dim_time;

CREATE TABLE public.dim_time (
	time_id int4 NULL,
	"time" varchar(50) NULL,
	"hour" int4 NULL,
	half_hour int4 NULL
);

import dim_time CSV using DBEaver!

add column time_id to transaction_data table:

ALTER TABLE public.transaction_data ADD time_id int NULL;

ALTER TABLE public.transaction_data ADD time_id int NULL;

populate transaction_date.time_id:

UPDATE transaction_data
SET time_id =
(
    (
        CAST(SUBSTRING(LPAD("TRANS_TIME"::text,4,'0'),1,2) AS INTEGER) * 60
        +
        CAST(SUBSTRING(LPAD("TRANS_TIME"::text,4,'0'),3,2) AS INTEGER)
    ) / 30
) + 1;

UPDATE transaction_data
SET time_id =
(
    (
        CAST(SUBSTRING(LPAD("TRANS_TIME"::text,4,'0'),1,2) AS INTEGER) * 60
        +
        CAST(SUBSTRING(LPAD("TRANS_TIME"::text,4,'0'),3,2) AS INTEGER)
    ) / 30
) + 1;

now you can join transaction_data with dim_time:

select
dd.date as "Transaction Date", dd.year as "Year", dd.month as "Month", dd.day as "Day of Month", dd.day_of_week as "Day of Week", dd.quarter as "Quarter", dd.half_year as "Half Year",
dt.hour as "Transaction Hour", dt.half_hour as "Transaction Half Hour",
d.classification_1 as "Age Group", d.classification_2 as "Income Tier", d.classification_3  as "Household Size", d.classification_4  as "Children Count", d.classification_5 as "Spending Power Group",
p."MANUFACTURER" as "Manufacturer", p."DEPARTMENT" as "Department", p."BRAND" as "Brand", p."COMMODITY_DESC" as "Category", p."SUB_COMMODITY_DESC" as "Sub Category", p."CURR_SIZE_OF_PRODUCT" as "Unit of Measure",
t."DAY", t."TRANS_TIME" as "Transaction Time", t."WEEK_NO" as "Week No",
t."BASKET_ID" as "Transaction", t.household_key as "Customer", t."STORE_ID" as "Store", 
d."HOMEOWNER_DESC" as "Home Owner", d."KID_CATEGORY_DESC" as "Kid Category",
t."QUANTITY" as "Unit Sold", t."SALES_VALUE" as "Revenue",
t."RETAIL_DISC" as "Retail Discount", t."COUPON_DISC" as "Coupon Discount", t."COUPON_MATCH_DISC" as "Coupon Match Discount"
from transaction_data t
left join product p on p."PRODUCT_ID" = t."PRODUCT_ID"
left join hh_demographic d on d.household_key = t.household_key
left join dim_date dd on dd.date_id = t."DAY"
left join dim_time dt on dt.time_id = t.time_id

select
dd.date as "Transaction Date", dd.year as "Year", dd.month as "Month", dd.day as "Day of Month", dd.day_of_week as "Day of Week", dd.quarter as "Quarter", dd.half_year as "Half Year",
dt.hour as "Transaction Hour", dt.half_hour as "Transaction Half Hour",
d.classification_1 as "Age Group", d.classification_2 as "Income Tier", d.classification_3  as "Household Size", d.classification_4  as "Children Count", d.classification_5 as "Spending Power Group",
p."MANUFACTURER" as "Manufacturer", p."DEPARTMENT" as "Department", p."BRAND" as "Brand", p."COMMODITY_DESC" as "Category", p."SUB_COMMODITY_DESC" as "Sub Category", p."CURR_SIZE_OF_PRODUCT" as "Unit of Measure",
t."DAY", t."TRANS_TIME" as "Transaction Time", t."WEEK_NO" as "Week No",
t."BASKET_ID" as "Transaction", t.household_key as "Customer", t."STORE_ID" as "Store", 
d."HOMEOWNER_DESC" as "Home Owner", d."KID_CATEGORY_DESC" as "Kid Category",
t."QUANTITY" as "Unit Sold", t."SALES_VALUE" as "Revenue",
t."RETAIL_DISC" as "Retail Discount", t."COUPON_DISC" as "Coupon Discount", t."COUPON_MATCH_DISC" as "Coupon Match Discount"
from transaction_data t
left join product p on p."PRODUCT_ID" = t."PRODUCT_ID"
left join hh_demographic d on d.household_key = t.household_key
left join dim_date dd on dd.date_id = t."DAY"
left join dim_time dt on dt.time_id = t.time_id

Using the Dataset in Apache Superset

Once all tables are loaded into PostgreSQL, connect the database to Apache Superset and start building datasets, charts, and dashboards.

Some useful analyses include:

Weekly sales trends
Product performance analysis
Store comparison dashboards
Campaign effectiveness tracking
Coupon redemption analytics
Customer purchasing behavior analysis

Because the dataset contains multiple related business entities, it provides a realistic environment for learning data modeling, SQL exploration, and dashboard development in Apache Superset.

Conclusion

The dunnhumby “The Complete Journey” dataset is an excellent real-world dataset for evaluating and demonstrating the capabilities of Apache Superset. Most of the CSV files can be imported directly into PostgreSQL without modification. However, the causal_data.csv file requires a small adjustment: the table should be created manually, with the display column defined as VARCHAR before importing the data.

After this one-time setup, the dataset loads successfully into PostgreSQL and provides a rich foundation for building insightful visualizations and interactive dashboards in Apache Superset. To further enhance data exploration and analytical capabilities, additional dim_date and dim_time dimension tables were introduced. These dimensions enable more flexible time-based analysis, allowing users to easily explore trends, seasonality, promotional impacts, and customer purchasing behavior across different dates and time periods.

Overall, the combination of the dunnhumby dataset, PostgreSQL, Apache Superset, and supplementary date/time dimensions creates a robust environment for retail analytics, business intelligence, and data visualization experimentation.