Tuesday, July 8, 2025

7 DuckDB SQL Queries That Save You Hours of Pandas Work

DuckDB SQL Queries
Picture by Creator | Canva

Pandas library has one of many fastest-growing communities. This reputation has opened the door for alternate options, like polars. On this article, we’ll discover one such different, DuckDB.

DuckDB is an SQL database that you may run proper in your pocket book. No setup is required, and no servers are wanted. It’s straightforward to put in and may work with Pandas in parallel.

Not like different SQL databases, you don’t have to configure the server. It simply works along with your pocket book after set up. Meaning no native setup complications, you’re writing the code immediately. DuckDB handles filtering, joins, and aggregations with clear SQL syntax, in comparison with Pandas, and performs considerably higher on giant datasets.

So sufficient with the phrases, let’s get began!

Knowledge Mission – Uber Enterprise Modeling

We’ll use it with Jupyter Pocket book, combining it with Python for information evaluation. To make issues extra thrilling, we’ll work on a real-life information mission. Let’s get began!

A Data Project Example for DuckDB SQL Queries

Right here is the hyperlink to the info mission we’ll be utilizing on this article. It’s a knowledge mission from Uber referred to as Accomplice’s Enterprise Modeling.

Uber used this information mission within the recruitment course of for the info science positions, and you’ll be requested to investigate the info for 2 totally different situations.

  • Situation 1: Examine the price of two bonus packages designed to get extra drivers on-line throughout a busy day.
  • Situation 2: Calculate and evaluate the annual internet revenue of a conventional taxi driver vs one who companions with Uber and buys a automotive.

Loading Dataset

Let’s load the dataframe first. This step shall be wanted; therefore, we’ll register this dataset with DuckDB within the following sections.

import pandas as pd
df = pd.read_csv("dataset_2.csv")

Exploring the Dataset

Listed here are the primary few rows:

A Data Project Example for DuckDB SQL Queries

Let’s see all of the columns.

Right here is the output.

A Data Project Example for DuckDB SQL Queries

Join DuckDB and Register the DataFrame

Good, it’s a actually easy dataset, however how can we join DuckDB with this dataset?
First, when you have not put in it but, set up DuckDB.

Connecting with DuckDB is simple. Additionally, if you wish to learn the documentation, test it out right here.

Now, right here is the code to make a connection and register the dataframe.

import duckdb
con = duckdb.join()

con.register("my_data", df)

Connect DuckDB and Register the DataFrame

Good, let’s begin exploring seven queries that may prevent hours of Pandas work!

1. Multi-Standards Filtering for Advanced Eligibility Guidelines

Probably the most vital benefits of SQL is the way it naturally handles filtering, particularly multi-condition filtering, very simply.

Implementation of Multi-Criterial Filtering in DuckDB vs Pandas

DuckDB permits you to apply a number of filters utilizing SQL’s The place Clauses and logic, which scales properly because the variety of filters grows.

SELECT 
    *
FROM information
WHERE condition_1
  AND condition_2
  AND condition_3
  AND condition_4

Now let’s see how we’d write the identical logic in Pandas. In Pandas, the small logic is expressed utilizing chained boolean masks with brackets, which may get verbose below many situations.

filtered_df = df(
    (df("condition_1")) &
    (df("condition_2")) &
    (df("condition_3")) &
    (df("condition_4"))
)

Each strategies are equally readable and relevant to fundamental use. DuckDB feels extra pure and cleaner because the logic will get extra advanced.

Multi-Standards Filtering for the Uber Knowledge Mission

On this case, we need to discover drivers who qualify for a selected Uber bonus program.

In line with the foundations, the drivers should:

  • Be on-line for at the very least 8 hours
  • Full at the very least 10 journeys
  • Settle for at the very least 90% of journey requests
  • Having a score of 4.7 or above

Now all we now have to do is write a question that does all these filterings. Right here is the code.

SELECT 
    COUN(*) AS qualified_drivers,
    COUNT(*) * 50 AS total_payout
FROM information
WHERE "Provide Hours" >= 8
  AND CAST(REPLACE("Settle for Charge", '%', '') AS DOUBLE) >= 90
  AND "Journeys Accomplished" >= 10
  AND Ranking >= 4.7

However to execute this code with Python, we have to add con.execute(“”” “””) and fetchdf() strategies as proven beneath:

con.execute("""
SELECT 
    COUNT(*) AS qualified_drivers,
    COUNT(*) * 50 AS total_payout
FROM information
WHERE "Provide Hours" >= 8
  AND CAST(REPLACE("Settle for Charge", '%', '') AS DOUBLE) >= 90
  AND "Journeys Accomplished" >= 10
  AND Ranking >= 4.7
""").fetchdf()

We’ll do that all through the article. Now that you know the way to run it in a Jupyter pocket book, we’ll present solely the SQL code any more, and also you’ll know convert it to the Pythonic model.
Good. Now, do not forget that the info mission desires us to calculate the full payout for Choice 1.

Multi-Criteria Filtering

We’ve calculated the sum of the driving force, however we should always multiply this by $50, as a result of the payout shall be $50 for every driver, so we’ll do it with COUNT
* 50.

Right here is the output.

Multi-Standards Filtering

2. Quick Aggregation to Estimate Enterprise Incentives

SQL is nice for shortly aggregating, particularly when it’s essential to summarize information throughout rows.

Implementation of Aggregation in DuckDB vs Pandas

SELECT 
    COUNT(*) AS num_rows,
    SUM(column_name) AS total_value
FROM information
WHERE some_condition

DuckDB allows you to mixture values throughout rows utilizing SQL features like SUM and COUNT in a single compact block.

filtered = df(df("some_condition"))
num_rows = filtered.form(0)
total_value = filtered("column_name").sum()

In pandas, you first have to filter the dataframe, then individually depend and sum utilizing chaining strategies.

DuckDB is extra concise and simpler to learn, and doesn’t require managing intermediate variables.

Aggregation in Uber Knowledge Mission

  • Good, let’s transfer on to the second bonus scheme, Choice 2. In line with the mission description, drivers will obtain $4 per journey if:
  • They full at the very least 12 journeys.

Have a score of 4.7 or higher.

SELECT 
    COUNT(*) AS qualified_drivers,
    SUM("Journeys Accomplished") * 4 AS total_payout
FROM information
WHERE "Journeys Accomplished" >= 12
  AND Ranking >= 4.7

This time, as a substitute of simply counting the drivers, we have to add the variety of journeys they accomplished because the bonus is paid per journey, not per individual.

The depend right here tells us what number of drivers qualify. Nonetheless, to calculate the full payout, we’ll calculate their journeys and multiply by $4, as required by Choice 2.

Aggregation in DuckDB

Right here is the output.

Aggregation in DuckDB

With DuckDB, we don’t have to loop by way of the rows or construct customized aggregations. The Sum perform takes care of every thing we want.

3. Detect Overlaps and Variations Utilizing Boolean Logic

In SQL, you possibly can simply mix the situations by utilizing Boolean Logic, equivalent to AND, OR, and NOT.

Implementation of Boolean Logic in DuckDB vs Pandas

SELECT *
FROM information
WHERE condition_a
  AND condition_b
  AND NOT (condition_c)

DuckDB helps boolean logic natively within the WHERE clause utilizing AND, OR, and NOT.

filtered = df(
    (df("condition_a")) &
    (df("condition_b")) &
    ~(df("condition_c"))
)

Pandas requires a mixture of logical operators with masks and parentheses, together with the usage of “~” for negation.

Whereas each are useful, DuckDB is simpler to cause about when the logic entails exclusions or nested situations.

Boolean Logic for Uber Knowledge Mission

Now we now have calculated Choice 1 and Choice 2, what comes subsequent? Now it’s time to do the comparability. Keep in mind our subsequent query.

Boolean Logic in DuckDB

SELECT COUNT(*) AS only_option1
FROM information
WHERE "Provide Hours" >= 8
  AND CAST(REPLACE("Settle for Charge", '%', '') AS DOUBLE) >= 90
  AND "Journeys Accomplished" >= 10
  AND Ranking >= 4.7
  AND NOT ("Journeys Accomplished" >= 12 AND Ranking >= 4.7)

That is the place we will use Boolean Logic. We’ll use a mixture of AND and NOT.

Right here is the output.

Boolean Logic in DuckDB

  • Let’s break it down:
  • The primary 4 situations are right here for Choice 1.

The NOT(..) half is used to exclude drivers who additionally qualify for Choice 2.

It’s fairly easy, proper?

4. Fast Cohort Sizing with Conditional Filters

Typically, you need to perceive how large a selected group or cohort is inside your information.

Implementation of Conditional Filters in DuckDB vs Pandas?

SELECT 
  ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM information), 2) AS share
FROM information
WHERE condition_1
  AND condition_2
  AND condition_3

DuckDB handles cohort filtering and share calculation with one SQL question, even together with subqueries.

filtered = df(
    (df("condition_1")) &
    (df("condition_2")) &
    (df("condition_3"))
)
share = spherical(100.0 * len(filtered) / len(df), 2)

Pandas requires filtering, counting, and guide division to calculate percentages.

DuckDB right here is cleaner and quicker. It minimizes the variety of steps and avoids repeated code.

Cohort Sizing For Uber Knowledge Mission

  • Now we’re on the final query of Situation 1. On this query, Uber desires us to search out out the drivers that would not obtain some duties, like journeys and acceptance price, but had greater rankings, particularly the drivers.
  • Accomplished lower than 10 journeys
  • Had an acceptance price decrease than 90

Had a score greater than 4.7

SELECT 
  ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM information), 2) AS share
FROM information
WHERE "Journeys Accomplished" < 10
  AND CAST(REPLACE("Settle for Charge", '%', '') AS DOUBLE) = 4.7

Now, these are three separate filters, and we need to calculate the proportion of drivers satisfying every of them. Let’s see the question.

Right here is the output.

Cohort Sizing in DuckDB

Right here, we filtered the rows the place all three situations have been glad, counted them, and divided them by the full variety of drivers to get a share.

5. Primary Arithmetic Queries for Income Modeling

Now, let’s say you need to do some fundamental math. You’ll be able to write expressions instantly into your SELECT assertion.

Implementation of Arithmetic in DuckDB vs Pandas

SELECT 
    daily_income * work_days * weeks_per_year AS annual_revenue,
    weekly_cost * weeks_per_year AS total_cost,
    (daily_income * work_days * weeks_per_year) - (weekly_cost * weeks_per_year) AS net_income
FROM information

DuckDB permits arithmetic to be written instantly within the SELECT clause like a calculator.

daily_income = 200
weeks_per_year = 49
work_days = 6
weekly_cost = 500

annual_revenue = daily_income * work_days * weeks_per_year
total_cost = weekly_cost * weeks_per_year
net_income = annual_revenue - total_cost

Pandas requires a number of intermediate calculations in separate variables for a similar consequence.

DuckDB simplifies the mathematics logic right into a readable SQL block, whereas Pandas will get a bit cluttered with variable assignments.

Primary Arithmetic in Uber Knowledge Mission

In Situation 2, Uber requested us to calculate how a lot cash (after bills) the driving force makes per yr with out partnering with Uber. Listed here are some bills like fuel, hire, and insurance coverage.

Primary Arithmetic in DuckDB

SELECT 
    200 * 6 * (52 - 3) AS annual_revenue,
    200 * (52 - 3) AS gas_expense,
    500 * (52 - 3) AS rent_expense,
    400 * 12 AS insurance_expense,
    (200 * 6 * (52 - 3)) 
      - (200 * (52 - 3) + 500 * (52 - 3) + 400 * 12) AS net_income

Now let’s calculate the annual income and subtract the bills from it.

Right here is the output.

Primary Arithmetic in DuckDB

With DuckDB, you possibly can write this like a SQL matrix block. You do not want Pandas Dataframes or guide looping!

6. Conditional Calculations for Dynamic Expense Planning

What in case your value construction adjustments based mostly on sure situations?

Implementation of Conditional Calculations in DuckDB vs Pandas

SELECT 
    original_cost * 1.05 AS increased_cost,
    original_cost * 0.8 AS discounted_cost,
    0 AS removed_cost,
    (original_cost * 1.05 + original_cost * 0.8) AS total_new_cost

DuckDB allows you to apply conditional logic utilizing arithmetic changes inside your question.

weeks_worked = 49
fuel = 200
insurance coverage = 400

gas_expense = fuel * 1.05 * weeks_worked
insurance_expense = insurance coverage * 0.8 * 12
rent_expense = 0
whole = gas_expense + insurance_expense

Pandas makes use of the identical logic with a number of math strains and guide updates to variables.

DuckDB turns what can be a multi-step logic in pandas right into a single SQL expression.

Conditional Calculations in Uber Knowledge Mission

  • On this state of affairs, we now mannequin what occurs if the driving force companions with Uber and buys a automotive. The bills change like
  • Gasoline value will increase by 5%
  • Insurance coverage decreases by 20%
con.execute("""
SELECT 
    200 * 1.05 * 49 AS gas_expense,
    400 * 0.8 * 12 AS insurance_expense,
    0 AS rent_expense,
    (200 * 1.05 * 49) + (400 * 0.8 * 12) AS total_expense
""").fetchdf()

No extra hire expense

Right here is the output.

Conditional Calculations in DuckDB

7. Aim-Pushed Math for Income Concentrating on

Typically, your evaluation might be pushed by a enterprise objective like hitting a income goal or overlaying a one time value.

Implementation of Aim-Pushed Math in DuckDB vs Pandas

WITH vars AS (
  SELECT base_income, cost_1, cost_2, target_item
),
calc AS (
  SELECT 
    base_income - (cost_1 + cost_2) AS current_profit,
    cost_1 * 1.1 + cost_2 * 0.8 + target_item AS new_total_expense
  FROM vars
),
remaining AS (
  SELECT 
    current_profit + new_total_expense AS required_revenue,
    required_revenue / 49 AS required_weekly_income
  FROM calc
)
SELECT required_weekly_income FROM remaining

DuckDB handles multi-step logic utilizing CTEs. It makes the question modular and straightforward to learn.

weeks = 49
original_income = 200 * 6 * weeks
original_cost = (200 + 500) * weeks + 400 * 12
net_income = original_income - original_cost

# new bills + automotive value
new_gas = 200 * 1.05 * weeks
new_insurance = 400 * 0.8 * 12
car_cost = 40000

required_revenue = net_income + new_gas + new_insurance + car_cost
required_weekly_income = required_revenue / weeks

Pandas requires nesting of calculations and reuse of earlier variables to keep away from duplication.

DuckDB permits you to construct a logic pipeline step-by-step, with out cluttering your pocket book with scattered code.

Aim-Pushed Math in Uber Knowledge Mission

Now that we now have modeled the brand new prices, let’s reply the ultimate enterprise query:

  • How way more does the driving force have to earn per week to do each?
  • Repay a $40.000 automotive inside a yr

Preserve the identical yearly internet revenue

WITH vars AS (
  SELECT 
    52 AS total_weeks_per_year,
    3 AS weeks_off,
    6 AS days_per_week,
    200 AS fare_per_day,
    400 AS monthly_insurance,
    200 AS gas_per_week,
    500 AS vehicle_rent,
    40000 AS car_cost
),
base AS (
  SELECT 
    total_weeks_per_year,
    weeks_off,
    days_per_week,
    fare_per_day,
    monthly_insurance,
    gas_per_week,
    vehicle_rent,
    car_cost,
    total_weeks_per_year - weeks_off AS weeks_worked,
    (fare_per_day * days_per_week * (total_weeks_per_year - weeks_off)) AS original_annual_revenue,
    (gas_per_week * (total_weeks_per_year - weeks_off)) AS original_gas,
    (vehicle_rent * (total_weeks_per_year - weeks_off)) AS original_rent,
    (monthly_insurance * 12) AS original_insurance
  FROM vars
),
evaluate AS (
  SELECT *,
    (original_gas + original_rent + original_insurance) AS original_total_expense,
    (original_annual_revenue - (original_gas + original_rent + original_insurance)) AS original_net_income
  FROM base
),
new_costs AS (
  SELECT *,
    gas_per_week * 1.05 * weeks_worked AS new_gas,
    monthly_insurance * 0.8 * 12 AS new_insurance
  FROM evaluate
),
remaining AS (
  SELECT *,
    new_gas + new_insurance + car_cost AS new_total_expense,
    original_net_income + new_gas + new_insurance + car_cost AS required_revenue,
    required_revenue / weeks_worked AS required_weekly_revenue,
    original_annual_revenue / weeks_worked AS original_weekly_revenue
  FROM new_costs
)
SELECT 
  ROUND(required_weekly_revenue, 2) AS required_weekly_revenue,
  ROUND(required_weekly_revenue - original_weekly_revenue, 2) AS weekly_uplift
FROM remaining

Now let’s write the code representing this logic.

Right here is the output.

Aim-Pushed Math in DuckDB

Closing Ideas
On this article, we explored join with DuckDB and analyze information. As a substitute of utilizing lengthy Pandas features, we used SQL queries. We additionally did this utilizing a real-life information mission that Uber requested within the information scientist recruitment course of.
For information scientists engaged on analysis-heavy duties, it’s a light-weight however highly effective different to Pandas. Attempt utilizing it in your subsequent mission, particularly when SQL logic suits the issue higher.

Nate Rosidi

is a knowledge scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from prime corporations. Nate writes on the most recent developments within the profession market, offers interview recommendation, shares information science tasks, and covers every thing SQL.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles