Introduction
As a data consultant working with small to mid-size companies, I’m always searching for tools that strike the right balance between power and practicality. While I love the robust capabilities of my usual stack (Microsoft Azure, Confluent Cloud, and Databricks), it can sometimes feel like overkill for smaller projects. That’s where DuckDB enters the picture. This in-memory database promises a compelling blend of lightweight efficiency and analytical power, all while maintaining compatibility with our preferred data formats like Delta Lake. In this blog article, I’ll take DuckDB for a test drive, investigating its features and assessing whether it truly has the potential to fill a crucial gap in my data toolkit.
Use Cases and Tech Stack of this Article
My focus for this exploration isn’t about replacing my existing persistence layers or squeezing out every last ounce of performance. Instead, I’m interested in how seamlessly DuckDB integrates with my preferred open data ecosystem, particularly within the Azure environment.
Think of it as a “getting to know you” phase with DuckDB. I want to understand how effectively it can interact with common file formats like Parquet, and Delta Lake, especially when those files reside in Azure Blob Storage.
Here’s the roadmap for my investigation:
- Local Exploration: I’ll start with the basics, testing DuckDB’s ability to read CSV, Parquet, and Delta files stored locally.
- Azure Blob Storage Integration: Next, I’ll explore how smoothly DuckDB can access Parquet and Delta files directly from Azure Blob Storage.
- Streamlit Synergy: I’ll experiment with combining DuckDB and Streamlit to create a simple application that reads Delta files from Azure Blob Storage. This will help assess its potential for building lightweight interactive dashboards.
- Unity Catalog Compatibility: Finally, I’ll delve into whether DuckDB can tap into Databricks’ Unity Catalog, leveraging its governance and data management features.
Python will be used in every use-case as programming language.
Local Exploration
Accessing and querying local files with DuckDB is refreshingly straightforward. Following this simple Python code:
from pathlib import Path
import duckdb
def read_csv(path: Path) -> duckdb.DuckDBPyRelation:
duckdb.read_csv(path.as_posix())
query = f"SELECT * FROM '{path.as_posix()}'"
return duckdb.sql(query)
def read_parquet(path: Path) -> duckdb.DuckDBPyRelation:
duckdb.read_parquet(path.as_posix())
query = f"SELECT * FROM '{path.as_posix()}'"
return duckdb.sql(query)
def read_delta(path: Path) -> duckdb.DuckDBPyRelation:
query = f"SELECT * FROM delta_scan('file:///{path.as_posix()}')"
return duckdb.sql(query)
Reading CSV and Parquet files is very intuitive. For Delta Lake, DuckDB
leverages the delta_scan extension. While still experimental at the time of
writing, this extension is automatically downloaded from the official
repository, making the process seamless.
Of course, these are just basic examples, and queries can be customized to fit specific needs. But the key takeaway here is the simplicity and ease of use that DuckDB offers right out of the box.
DuckDB also offers a PySpark API! While still experimental, this feature is very promising, especially for me, since I am primarily working with PySpark. Imagine the possibilities:
- Reduced learning curve: I can leverage my PySpark knowledge, minimizing the need to learn a completely new duckdb API.
- Simplified migrations: Switching between DuckDB and Spark becomes smoother, allowing to choose the best tool for the job without significant code rewrites.
from duckdb.experimental.spark.sql import SparkSession as session
from duckdb.experimental.spark.sql.dataframe import DataFrame
def read_parquet_data(path) -> DataFrame:
spark = session.builder.appName("duckdb").getOrCreate()
df = spark.read.parquet(path.as_posix())
return df
This code snippet demonstrates how to read a Parquet file using the familiar PySpark syntax. And the best part? It returns a Spark-compatible DataFrame!
Unfortunately, as of now, writing Delta files using DuckDB’s PySpark API isn’t supported. However, potential here is immense. DuckDB’s PySpark API could be a game-changer, enabling me to seamlessly integrate this powerful database into my existing Spark workflows.
While DuckDB’s PySpark API doesn’t yet support writing Delta files directly, Python’s deltalake and pandas libraries provide workaround:
from pathlib import Path
from deltalake import DeltaTable, write_deltalake
BASE_DIR = Path(__file__).resolve().parent
DATA_DIR = BASE_DIR / "data"
READ_FILE = DATA_DIR / "data.parquet"
WRITE_PATH = DATA_DIR "delta" / DATA_DIR / "data"
duckdb.read_parquet(READ_FILE.as_posix())
df = duckdb.sql(f"SELECT * FROM '{path.as_posix()}'").df()
write_deltalake(WRITE_PATH, df)
This sample might feel useless, since the same file is read which get’s dumped without doing anything. But putting this to side, let’s spend some focus on those two lines:
df = duckdb.sql(f"SELECT * FROM '{path.as_posix()}'").df()
write_deltalake(WRITE_PATH, df)
duckdb.sql(...).df()This converts our DuckDB query result into a familiar Pandas DataFrame.write_deltalake(WRITE_PATH, df)Use the deltalake library to effortlessly write the Pandas DataFrame as a Delta table.
This workaround highlights a key advantage: even if DuckDB’s objects feel unfamiliar at first, it can easily transformed into familiar Pandas or PySpark DataFrames.
Great! Let’s move on into the cloud.
Azure Blob Storage Integration
Let’s dive into DuckDBs integration into data stored in Azure Blob Storage.
To test this out, I spun up a new Storage Account in the Azure portal, with a container for my DuckDB test files. To grant DuckDB access, I assigned myself the Storage Blob Owner role for the test container. While this approach deviates from the principle of least privilege, it was a quick and easy solution for my experimental setup.
import os
import duckdb
def perform_az_blob_storage_auth(storage_account_name: str) -> None:
duckdb.sql(f"""
CREATE SECRET az (
TYPE AZURE,
PROVIDER CREDENTIAL_CHAIN,
CHAIN 'cli',
ACCOUNT_NAME '{storage_account_name}'
);
""")
def read_az_blob_parquet(az_path: str) -> duckdb.DuckDBPyRelation:
query = f"SELECT * FROM read_parquet('az://{az_path}');"
return duckdb.sql(query)
def read_az_blob_delta(az_path: str) -> duckdb.DuckDBPyRelation:
query = f"SELECT * FROM delta_scan('az://{az_path}');"
return duckdb.sql(query)
if __name__ == "__main__":
perform_az_blob_storage_auth(os.getenv('AZURE_STORAGE_ACCOUNT_NAME'))
parquet_data = read_az_blob_parquet("duckdbtest/parquet/data.parquet")
print(parquet_data)
delta_data = read_az_blob_parquet("duckdbtest/delta/data")
print(delta_data)
DuckDB interacts with Azure Blob Storage through secrets, which are seamlessly
integrated into queries. Notice the AZURE keyword in the CREATE SECRET
statement? This triggers the automatic loading of DuckDB’s Azure extension.
When working with Azure (or any cloud provider), you generally have two authentication options:
- Passwordless Authentication: Leverage roles and permissions to control access.
- Secret-Based Authentication: Use secrets to authenticate.
Personally, I lean towards the first approach. Secrets, while convenient, introduce security risks.
In this code, CREDENTIAL_CHAIN relies on the user currently logged in to Azure
at the system level. On my local machine, it’s my personal Azure user,
authenticated via az login. However, in a server or Azure Web App environment,
CREDENTIAL_CHAIN would utilize a Managed Identity associated with that machine.
DuckDB’s behavior aligns perfectly with my expectations and requirements. Let’s move forward and build a quick streamlit application using DuckDB and Azure Blob Storage.
Streamlit Synergy
Let’s move on with a very basic Streamlit example:
import duckdb
import streamlit as st
import os
def set_secret():
duckdb.sql(f"""
CREATE SECRET az IF NOT EXISTS (
TYPE AZURE,
PROVIDER CREDENTIAL_CHAIN,
CHAIN 'cli',
ACCOUNT_NAME '{os.getenv('AZURE_STORAGE_ACCOUNT_NAME')}'
);
""")
def get_all_rows() -> duckdb.DuckDBPyRelation:
query = "SELECT * FROM delta_scan('az://duckdbtest/delta/data') LIMIT 50;"
return duckdb.sql(query)
def main():
st.set_page_config(page_title="stduck", page_icon=":duck:")
set_secret()
data = get_all_rows()
st.table(data)
if __name__ == "__main__":
main()
This Streamlit example demonstrates the simplicity of reading Delta files from Azure Blob Storage and displaying them in a table.
It’s important to highlight that in this example, and throughout this exploration, DuckDB operates “only” as a processing engine. We’re not permanently persisting any data within DuckDB itself. This approach aligns with our goal of using DuckDB for efficient, on-the-fly data analysis.
Another thing to mention is storage allocation of DuckDB. In my personal case, my thread size was constantly growing through many reloads. Further investigation into Out-of-memory issues won’t be performed here.
Now, let’s move on to our final investigation!
Unity Catalog Integration
As I mentioned earlier, Databricks and Azure are core components of my data stack. However, I often encounter scenarios where Databricks’ compute costs become a concern, particularly when dealing with simple data read operations from Delta tables.
This is where I envision DuckDB playing a crucial role: a lightweight processing engine that sits between BI tools (like Streamlit or Power BI) and my Delta Lake data.
To validate this idea, I needed to explore DuckDB’s integration with Unity Catalog. And to my surprise, a solution already exists!
def read_dbx_data() -> duckdb.DuckDBPyRelation:
duckdb.sql(f"""
CREATE SECRET (
TYPE UC,
TOKEN '{os.getenv("DBX_TOKEN")}',
ENDPOINT 'https://{os.getenv("DBX_WORKSPACE")}'
);
""")
duckdb.sql(
f"""
CREATE SECRET (
TYPE AZURE,
PROVIDER CREDENTIAL_CHAIN,
CHAIN 'cli',
ACCOUNT_NAME '{os.getenv('AZURE_STORAGE_ACCOUNT_NAME')}'
);
"""
)
duckdb.sql("ATTACH 'duckdbtestuc' (TYPE UC_CATALOG);")
duckdb.sql("SELECT * FROM duckdbtestuc.testschema.data;")
This script enables DuckDB to read data directly from my Unity Catalog
(duckdbtestuc) without needing to specify the underlying storage location.
This abstraction simplifies data access and leverages the metadata management
capabilities of Unity Catalog.
However, there’s a crucial point to remember: DuckDB still needs proper permissions on the Azure Storage Account to access the data. While Unity Catalog provides valuable metadata, such as the storage location, the actual access control is managed within Azure (or your chosen cloud provider).
Conclusion
I’ve walked you through a range of DuckDB examples, some more relevant to my needs than others. While local file references and explicit storage locations were interesting to explore, my primary focus was on features like Unity Catalog integration, Delta Lake support, and the ability to reference remote data by table names rather than storage locations.
The code snippets I’ve shared might appear deceptively simple, but trust me, getting to this point took considerable effort. I spent hours debugging, troubleshooting, and scouring GitHub issues to overcome challenges posed by the experimental nature of features like the Spark API, Delta integration, and Unity Catalog support.
Despite these hurdles, my exploration has solidified my belief that DuckDB is promising. It addresses my personal growing need to work with open file formats like Delta Lake and Iceberg, offering a straightforward way to connect and analyze data without being solely reliant on Python.
To provide a balanced perspective, my next blog post will tackle the same use cases using plain Pandas. This will allow for a direct comparison and highlight the strengths and weaknesses of each approach.
