DuckDB transformation reference
Lookup reference for DuckDB SQL transformations in Keboola — configuration settings, backend sizes, versions, sync actions, block orchestration, Parquet and type inference, case sensitivity, and SQL extensions.
Reference material for DuckDB SQL transformations. To create one, see the how-to; for when to choose DuckDB, see the explanation.
Configuration settings
Section titled “Configuration settings”Set these on the right-side panel of the transformation configuration:
| Setting | Description | Default |
|---|---|---|
| Timeout | Maximum execution time. | 1 hour |
| Backend size | Memory allocated (see Backend sizes). | Small |
| DuckDB version | Which DuckDB version runs the transformation (see DuckDB version). | latest |
| Automatic data types | Automatically assign data types to output tables. | |
| Use parquet for input tables | Load inputs as Parquet instead of CSV (see Parquet format). | Off |
| Infer input table data types | Infer types from input tables (see Infer input table data types). | Off |
| Debug mode | Enable debug logging for troubleshooting. | Off |
DuckDB version
Section titled “DuckDB version”Select the DuckDB version used to run the transformation. Use latest (default) to always run on the most recent supported version, or pin a specific supported version — 1.5.2 or 1.4.4 — for stability. Each supported version runs in its own isolated environment.
Backend sizes
Section titled “Backend sizes”A larger backend allocates more memory. See the how-to for how to change it.
| Backend size | Memory | Recommended for |
|---|---|---|
| XSmall | 8 GB | Small datasets, testing |
| Small (default) | 16 GB | Most use cases |
| Medium | 32 GB | Large datasets (5 GB+) |
| Large | 113.6 GB | Very large datasets (10 GB+) |
Dynamic backends are not available on the Free Plan (Pay As You Go).
Auto-resource detection
Section titled “Auto-resource detection”DuckDB automatically detects available CPU and memory. You can also set resource limits manually with the threads and max_memory_mb parameters in the transformation configuration.
Block-based orchestration
Section titled “Block-based orchestration”DuckDB transformations organize and execute SQL with block-based orchestration:
- Blocks run sequentially (one after another).
- Scripts (code pieces) within a block run in parallel when they have no dependencies on each other.
- The system uses SQLGlot to analyze SQL and build a DAG of dependencies, then optimizes execution order automatically.
Sync actions
Section titled “Sync actions”Four sync actions help you debug and visualize without running the full transformation, available from the configuration page:
| Action | Name | What it does |
|---|---|---|
| Syntax check | syntax_check | Validates SQL syntax without executing queries. |
| Lineage visualization | lineage_visualization | Markdown diagram of data dependencies (how tables flow through). |
| Execution plan visualization | execution_plan_visualization | Shows the planned execution order (blocks and batches). |
| Expected input tables | expected_input_tables | Lists the input tables the transformation expects, based on SQL analysis. |
Parquet format
Section titled “Parquet format”By default, input tables are loaded as CSV. Enabling Use parquet for input tables loads them as Parquet, which is much faster, uses less memory, and is columnar (optimized for analytics). Recommended for datasets larger than 1 GB.
Infer input table data types
Section titled “Infer input table data types”Keboola Storage tables can be non-typed (all columns VARCHAR). With type inference off, every input value is a string, so functions like SUM() fail because they expect numeric types.
Enable Infer input table data types to have DuckDB detect the real types (for example INTEGER, FLOAT, DATE) so aggregate and type-specific operations work and output columns are properly typed.
Semicolons between statements
Section titled “Semicolons between statements”Each SQL statement must end with a semicolon (;). Separate multiple statements in one script:
-- Correct: each statement ends with a semicolonCREATE TABLE "output_a" AS SELECT * FROM "input_a";
CREATE TABLE "output_b" AS SELECT * FROM "input_b";Missing semicolons cause syntax errors.
Identifier case sensitivity
Section titled “Identifier case sensitivity”DuckDB handles case differently from Snowflake:
- Unquoted table names are folded to lowercase (
SELECT * FROM MyTablereferencesmytable). - Quoted table names are case-sensitive (
SELECT * FROM "MyTable"references exactlyMyTable). - Columns are always case-sensitive, regardless of quoting (
columnNameandColumnNameare different columns).
Use consistent casing, and quote names with mixed case or special characters: "TaBlE-stage". Input table names are typically lowercase unless quoted.
Working with data types
Section titled “Working with data types”With Infer input table data types disabled, all input columns load as VARCHAR and you must cast explicitly:
CREATE TABLE "result" ASSELECT CAST("amount" AS DECIMAL) AS "amount", CAST("created_at" AS TIMESTAMP) AS "created_at"FROM "source";With inference enabled, DuckDB assigns the correct types and you can use them directly.
SQL extensions
Section titled “SQL extensions”DuckDB adds quality-of-life SQL extensions:
-- GROUP BY ALL: group by all non-aggregated columnsSELECT product, category, SUM(sales) FROM orders GROUP BY ALL;
-- EXCLUDE: select all columns except someSELECT * EXCLUDE (password, ssn, credit_card) FROM users;
-- ASOF JOIN: match nearest (e.g. time-series where timestamps don't align)SELECT s.player_id, s.score, w.temperatureFROM scores sASOF JOIN weather w ON s.score_time >= w.timestamp;
-- SUMMARIZE: quick profiling (min, max, null %, unique counts)SUMMARIZE SELECT * FROM my_table;Query optimization
Section titled “Query optimization”- Filter and project early — apply
WHEREat the source and select only the columns you need, to reduce scanned data. - Use
EXPLAIN— prefix a query withEXPLAINto see the execution plan and find expensive operations.
EXPLAIN SELECT product_category, SUM(price) AS total_revenueFROM salesWHERE sale_date >= '2023-01-01'GROUP BY product_categoryORDER BY total_revenue DESC;Memory management for large datasets
Section titled “Memory management for large datasets”For datasets larger than 10 GB, configure on-disk processing with PRAGMA settings:
PRAGMA memory_limit='8GB';PRAGMA temp_directory='/tmp/duckdb_temp';PRAGMA threads=4;PRAGMA enable_object_cache;