Skip to content
Transformations

DuckDB transformation reference

Lookup reference for DuckDB SQL transformations in Keboola — configuration settings, backend sizes, versions, sync actions, block orchestration, Parquet and type inference, case sensitivity, and SQL extensions.

Reference material for DuckDB SQL transformations. To create one, see the how-to; for when to choose DuckDB, see the explanation.

Set these on the right-side panel of the transformation configuration:

SettingDescriptionDefault
TimeoutMaximum execution time.1 hour
Backend sizeMemory allocated (see Backend sizes).Small
DuckDB versionWhich DuckDB version runs the transformation (see DuckDB version).latest
Automatic data typesAutomatically assign data types to output tables.
Use parquet for input tablesLoad inputs as Parquet instead of CSV (see Parquet format).Off
Infer input table data typesInfer types from input tables (see Infer input table data types).Off
Debug modeEnable debug logging for troubleshooting.Off

Select the DuckDB version used to run the transformation. Use latest (default) to always run on the most recent supported version, or pin a specific supported version — 1.5.2 or 1.4.4 — for stability. Each supported version runs in its own isolated environment.

A larger backend allocates more memory. See the how-to for how to change it.

Backend sizeMemoryRecommended for
XSmall8 GBSmall datasets, testing
Small (default)16 GBMost use cases
Medium32 GBLarge datasets (5 GB+)
Large113.6 GBVery large datasets (10 GB+)

Dynamic backends are not available on the Free Plan (Pay As You Go).

DuckDB automatically detects available CPU and memory. You can also set resource limits manually with the threads and max_memory_mb parameters in the transformation configuration.

DuckDB transformations organize and execute SQL with block-based orchestration:

  • Blocks run sequentially (one after another).
  • Scripts (code pieces) within a block run in parallel when they have no dependencies on each other.
  • The system uses SQLGlot to analyze SQL and build a DAG of dependencies, then optimizes execution order automatically.

Four sync actions help you debug and visualize without running the full transformation, available from the configuration page:

ActionNameWhat it does
Syntax checksyntax_checkValidates SQL syntax without executing queries.
Lineage visualizationlineage_visualizationMarkdown diagram of data dependencies (how tables flow through).
Execution plan visualizationexecution_plan_visualizationShows the planned execution order (blocks and batches).
Expected input tablesexpected_input_tablesLists the input tables the transformation expects, based on SQL analysis.

By default, input tables are loaded as CSV. Enabling Use parquet for input tables loads them as Parquet, which is much faster, uses less memory, and is columnar (optimized for analytics). Recommended for datasets larger than 1 GB.

Keboola Storage tables can be non-typed (all columns VARCHAR). With type inference off, every input value is a string, so functions like SUM() fail because they expect numeric types.

Enable Infer input table data types to have DuckDB detect the real types (for example INTEGER, FLOAT, DATE) so aggregate and type-specific operations work and output columns are properly typed.

Each SQL statement must end with a semicolon (;). Separate multiple statements in one script:

-- Correct: each statement ends with a semicolon
CREATE TABLE "output_a" AS SELECT * FROM "input_a";
CREATE TABLE "output_b" AS SELECT * FROM "input_b";

Missing semicolons cause syntax errors.

DuckDB handles case differently from Snowflake:

  • Unquoted table names are folded to lowercase (SELECT * FROM MyTable references mytable).
  • Quoted table names are case-sensitive (SELECT * FROM "MyTable" references exactly MyTable).
  • Columns are always case-sensitive, regardless of quoting (columnName and ColumnName are different columns).

Use consistent casing, and quote names with mixed case or special characters: "TaBlE-stage". Input table names are typically lowercase unless quoted.

With Infer input table data types disabled, all input columns load as VARCHAR and you must cast explicitly:

CREATE TABLE "result" AS
SELECT
CAST("amount" AS DECIMAL) AS "amount",
CAST("created_at" AS TIMESTAMP) AS "created_at"
FROM "source";

With inference enabled, DuckDB assigns the correct types and you can use them directly.

DuckDB adds quality-of-life SQL extensions:

-- GROUP BY ALL: group by all non-aggregated columns
SELECT product, category, SUM(sales) FROM orders GROUP BY ALL;
-- EXCLUDE: select all columns except some
SELECT * EXCLUDE (password, ssn, credit_card) FROM users;
-- ASOF JOIN: match nearest (e.g. time-series where timestamps don't align)
SELECT s.player_id, s.score, w.temperature
FROM scores s
ASOF JOIN weather w ON s.score_time >= w.timestamp;
-- SUMMARIZE: quick profiling (min, max, null %, unique counts)
SUMMARIZE SELECT * FROM my_table;
  • Filter and project early — apply WHERE at the source and select only the columns you need, to reduce scanned data.
  • Use EXPLAIN — prefix a query with EXPLAIN to see the execution plan and find expensive operations.
EXPLAIN SELECT product_category, SUM(price) AS total_revenue
FROM sales
WHERE sale_date >= '2023-01-01'
GROUP BY product_category
ORDER BY total_revenue DESC;

For datasets larger than 10 GB, configure on-disk processing with PRAGMA settings:

PRAGMA memory_limit='8GB';
PRAGMA temp_directory='/tmp/duckdb_temp';
PRAGMA threads=4;
PRAGMA enable_object_cache;
Ask Kai

Ask anything about Keboola — I'll search the docs and cite the pages I use.