Running SQL Queries in the Browser with Arrow and DuckDB

What if you could run complex SQL queries on large datasets without a server? Thanks to Apache Arrow and technologies like DuckDB compiled to WebAssembly, this is now a reality.

The Evolution of In-Browser Data Processing

The Old Way

Traditionally, data analysis required:

Upload data to a server
Wait for processing
Download results
Repeat for each query

This was slow, expensive, and raised privacy concerns.

The New Way

Modern in-browser databases let you:

Load data once into your browser
Run unlimited queries locally
Get instant results
Keep your data private

Apache Arrow: The Foundation

Apache Arrow is a columnar memory format that enables efficient data processing. It's designed for:

Zero-Copy Reads

Traditional Format          Arrow Format
┌─────────────┐            ┌──────────┐
│ Parse       │            │ Memory   │
│ Copy        │     vs     │ Map      │
│ Transform   │            │ (instant)│
└─────────────┘            └──────────┘
    Slow                      Fast

Language Interoperability

Arrow data structures work seamlessly across:

Python (pandas, PyArrow)
R (arrow)
JavaScript (Arrow JS)
Rust (arrow-rs)
Java (Arrow Java)

Efficient Compression

Arrow supports compression schemes optimized for columnar data:

// Reading compressed Parquet as Arrow
let file = File::open("data.parquet")?;
let reader = ParquetRecordBatchReader::try_new(file, 1024)?;

for batch in reader {
    // Process Arrow RecordBatch
    process_batch(batch?)?;
}

SQL in the Browser: How It Works

Architecture Overview

User Query
    ↓
SQL Parser (WASM)
    ↓
Query Optimizer
    ↓
Arrow Data
    ↓
Result Set

Memory Management

The key to performance is keeping data in Arrow format:

Load: Parquet → Arrow RecordBatches
Query: Arrow → SQL Engine → Arrow
Display: Arrow → JavaScript arrays (only visible rows)

This minimizes copying and transformation overhead.

Practical SQL Examples

Basic SELECT

SELECT name, age, salary
FROM employees
WHERE age > 30
ORDER BY salary DESC
LIMIT 10;

In traditional databases, this might scan the entire table. With columnar storage:

Only read name, age, and salary columns
Push the age > 30 filter down to the scan
Stop after finding 10 results

Aggregations

SELECT 
    department,
    COUNT(*) as employee_count,
    AVG(salary) as avg_salary,
    MAX(salary) as max_salary
FROM employees
GROUP BY department
HAVING COUNT(*) > 5
ORDER BY avg_salary DESC;

Columnar format shines for aggregations:

SIMD operations for fast AVG/MAX calculations
Efficient GROUP BY with dictionary encoding
Minimal memory overhead

Joins

SELECT 
    e.name,
    e.salary,
    d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE d.budget > 1000000;

Arrow's columnar layout makes joins efficient:

Build hash tables from compressed columns
Filter early using Parquet statistics
Materialize only needed columns

DuckDB: SQL Analytics in WASM

DuckDB is an embeddable analytical database that compiles to WebAssembly. It's perfect for browser-based analytics:

Features

Full SQL support: Window functions, CTEs, subqueries
Vectorized execution: SIMD-optimized operations
Smart query planning: Automatic optimization
Arrow integration: Native support for Arrow format

Example: DuckDB in the Browser

import * as duckdb from '@duckdb/duckdb-wasm';

// Initialize DuckDB
const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();
const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);
const worker = new Worker(bundle.mainWorker);
const db = new duckdb.AsyncDuckDB(worker);
await db.instantiate(bundle.mainModule);

// Create a connection
const conn = await db.connect();

// Register Arrow table
await db.registerFileBuffer('employees.parquet', buffer);

// Run SQL query
const result = await conn.query(`
  SELECT department, COUNT(*) as count
  FROM employees
  GROUP BY department
`);

// Get results as Arrow
const arrow_result = result.toArrow();

Performance Optimizations

1. Lazy Loading

Don't load all data at once:

-- Use pagination
SELECT * FROM large_table
LIMIT 100 OFFSET 0;

-- Next page
SELECT * FROM large_table
LIMIT 100 OFFSET 100;

2. Column Pruning

Only select needed columns:

-- Good: Only needed columns
SELECT name, salary FROM employees;

-- Bad: Loads everything
SELECT * FROM employees;

3. Predicate Pushdown

Filter as early as possible:

-- Efficient: Filter during Parquet scan
SELECT * FROM orders
WHERE date >= '2024-01-01'
AND status = 'completed';

4. Use Appropriate Data Types

-- Efficient
CREATE TABLE sales (
    id INTEGER,
    amount DECIMAL(10,2),
    date DATE
);

-- Inefficient
CREATE TABLE sales (
    id VARCHAR,  -- Wasteful for numbers
    amount VARCHAR,  -- Loses numeric properties
    date VARCHAR  -- Can't use date optimizations
);

Real-World Use Cases

1. Data Exploration

-- Quick data profiling
SELECT 
    COUNT(*) as total_rows,
    COUNT(DISTINCT customer_id) as unique_customers,
    MIN(order_date) as first_order,
    MAX(order_date) as last_order,
    SUM(amount) as total_revenue
FROM orders;

2. Time Series Analysis

-- Daily revenue trends
SELECT 
    DATE_TRUNC('day', order_timestamp) as day,
    COUNT(*) as order_count,
    SUM(amount) as revenue
FROM orders
WHERE order_timestamp >= NOW() - INTERVAL '30 days'
GROUP BY day
ORDER BY day;

3. Complex Analytics

-- Running totals with window functions
SELECT 
    date,
    revenue,
    SUM(revenue) OVER (
        ORDER BY date 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as cumulative_revenue
FROM daily_sales
ORDER BY date;

Limitations and Considerations

Memory Constraints

Browsers limit available memory:

Chrome: ~2-4GB per tab
Firefox: ~2-4GB per tab
Safari: More restrictive

Solution: Process data in chunks or use streaming queries.

File Size Limits

Loading huge files can be slow:

Reading 1GB+ files takes time
Consider splitting data into smaller partitions

Solution: Use partitioned Parquet files and load only needed partitions.

Browser Compatibility

WASM and SharedArrayBuffer requirements:

Modern browsers: Full support
Older browsers: Limited or no support

Solution: Provide fallback or progressive enhancement.

Best Practices

1. Design for the Client

Structure your queries for browser execution:

-- Good: Filters early, limits results
SELECT name, amount 
FROM transactions
WHERE date = '2024-11-10'
LIMIT 1000;

-- Risky: Could return millions of rows
SELECT * FROM transactions;

2. Provide Feedback

Long-running queries need progress indication:

// Show loading state
setLoading(true);
try {
    const result = await runQuery(sql);
    displayResults(result);
} finally {
    setLoading(false);
}

3. Handle Errors Gracefully

try {
    const result = await conn.query(userSQL);
    return result;
} catch (error) {
    if (error.message.includes('syntax')) {
        showError('Invalid SQL syntax');
    } else if (error.message.includes('memory')) {
        showError('Query too large for available memory');
    } else {
        showError('Query failed: ' + error.message);
    }
}

The Future

Exciting developments ahead:

Improved Performance

Better SIMD support in browsers
More efficient WASM memory management
Streaming query execution

Enhanced Features

More SQL functions and operators
Better join algorithms
Advanced analytics functions

Easier Integration

Simplified APIs
Better TypeScript support
Framework-specific packages (React, Vue, etc.)

Conclusion

Running SQL queries in the browser opens up new possibilities for data analysis tools. With Apache Arrow's efficient memory format and databases like DuckDB compiled to WebAssembly, we can build powerful, privacy-preserving analytics applications that run entirely client-side.

Try it yourself with Parquet Tools – load your Parquet files and start querying immediately, no server required!

Parquet Tools