TL;DR
When learning Apache Superset, sample datasets are useful for understanding basic features, but they rarely reflect the challenges found in production environments. To fully explore Superset’s capabilities, it is better to work with real-world datasets that contain realistic business transactions, customer behavior, and marketing activities.
One excellent source is the dunnhumby “The Complete Journey” dataset, available from: Source Files – dunnhumby
This retail dataset includes customer transactions, products, coupons, campaigns, demographics, and promotional data, making it an ideal dataset for building dashboards and analytical reports in Apache Superset.
Loading Data into PostgreSQL
My preferred approach is to load the CSV files into PostgreSQL first and then connect Apache Superset to PostgreSQL.
Most files included in the dataset can be imported directly into PostgreSQL using DBeaver’s CSV Import Wizard without any issues. However, one file requires special handling:
causal_data.csv
Problem with causal_data.csv
When importing causal_data.csv, DBeaver may incorrectly detect the data type of the display column. As a result, the import process can fail due to data type conversion errors.
To avoid this issue, create the table manually before performing the import.
Use the following DDL:
CREATE TABLE public.causal_data (
"PRODUCT_ID" int4 NULL,
"STORE_ID" int4 NULL,
"WEEK_NO" int4 NULL,
display varchar NULL,
mailer varchar(50) NULL
);The key point is defining the display column as VARCHAR.
Importing the CSV with DBeaver
After creating the table, import the data using DBeaver:
- Open DBeaver.
- Connect to your PostgreSQL database.
- Locate the
causal_datatable. - Right-click the table.
- Select Import Data.
- Choose CSV as the source.
- Select the
causal_data.csvfile. - Verify the column mapping.
- Start the import process.
Because the table already exists and the column types are defined correctly, the import should complete successfully.
Why This Happens
CSV files do not contain schema information. During import, DBeaver attempts to infer data types from the file contents. In the case of causal_data.csv, the display column contains text values that are not always interpreted correctly by automatic type detection.
By creating the table beforehand and explicitly setting the column as VARCHAR, you eliminate the ambiguity and prevent import errors.
Other Tables
Fortunately, the rest of the files in the dunnhumby dataset are straightforward to import. Tables such as:
- transaction_data
- product
- campaign_table
- campaign_desc
- coupon
- coupon_redempt
- hh_demographic
can typically be imported directly using DBeaver’s default settings.
Using the Dataset in Apache Superset
Once all tables are loaded into PostgreSQL, connect the database to Apache Superset and start building datasets, charts, and dashboards.
Some useful analyses include:
- Weekly sales trends
- Product performance analysis
- Store comparison dashboards
- Campaign effectiveness tracking
- Coupon redemption analytics
- Customer purchasing behavior analysis
Because the dataset contains multiple related business entities, it provides a realistic environment for learning data modeling, SQL exploration, and dashboard development in Apache Superset.
Conclusion
The dunnhumby “The Complete Journey” dataset is an excellent real-world dataset for evaluating Apache Superset. While most CSV files can be imported directly into PostgreSQL, the causal_data.csv file requires a small adjustment: create the table manually and define the display column as VARCHAR before importing.
After this one-time setup, the dataset loads successfully and becomes a rich source of data for creating meaningful visualizations and dashboards in Apache Superset.
