Running SQL Queries in the Browser with Arrow and DuckDB
What if you could run complex SQL queries on large datasets without a server? Thanks to Apache Arrow and technologies like DuckDB compiled to WebAssembly, this is now a reality.
The Evolution of In-Browser Data Processing
The Old Way
Traditionally, data analysis required:
- Upload data to a server
- Wait for processing
- Download results
- Repeat for each query
This was slow, expensive, and raised privacy concerns.
The New Way
Modern in-browser databases let you:
- Load data once into your browser
- Run unlimited queries locally
- Get instant results
- Keep your data private
Apache Arrow: The Foundation
Apache Arrow is a columnar memory format that enables efficient data processing. It's designed for:
Zero-Copy Reads
Traditional Format Arrow Format
┌─────────────┐ ┌──────────┐
│ Parse │ │ Memory │
│ Copy │ vs │ Map │
│ Transform │ │ (instant)│
└─────────────┘ └──────────┘
Slow Fast
Language Interoperability
Arrow data structures work seamlessly across:
- Python (pandas, PyArrow)
- R (arrow)
- JavaScript (Arrow JS)
- Rust (arrow-rs)
- Java (Arrow Java)
Efficient Compression
Arrow supports compression schemes optimized for columnar data:
// Reading compressed Parquet as Arrow
let file = File::open("data.parquet")?;
let reader = ParquetRecordBatchReader::try_new(file, 1024)?;
for batch in reader {
// Process Arrow RecordBatch
process_batch(batch?)?;
}
SQL in the Browser: How It Works
Architecture Overview
User Query
↓
SQL Parser (WASM)
↓
Query Optimizer
↓
Arrow Data
↓
Result Set
Memory Management
The key to performance is keeping data in Arrow format:
- Load: Parquet → Arrow RecordBatches
- Query: Arrow → SQL Engine → Arrow
- Display: Arrow → JavaScript arrays (only visible rows)
This minimizes copying and transformation overhead.
Practical SQL Examples
Basic SELECT
SELECT name, age, salary
FROM employees
WHERE age > 30
ORDER BY salary DESC
LIMIT 10;
In traditional databases, this might scan the entire table. With columnar storage:
- Only read
name,age, andsalarycolumns - Push the
age > 30filter down to the scan - Stop after finding 10 results
Aggregations
SELECT
department,
COUNT(*) as employee_count,
AVG(salary) as avg_salary,
MAX(salary) as max_salary
FROM employees
GROUP BY department
HAVING COUNT(*) > 5
ORDER BY avg_salary DESC;
Columnar format shines for aggregations:
- SIMD operations for fast AVG/MAX calculations
- Efficient GROUP BY with dictionary encoding
- Minimal memory overhead
Joins
SELECT
e.name,
e.salary,
d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE d.budget > 1000000;
Arrow's columnar layout makes joins efficient:
- Build hash tables from compressed columns
- Filter early using Parquet statistics
- Materialize only needed columns
DuckDB: SQL Analytics in WASM
DuckDB is an embeddable analytical database that compiles to WebAssembly. It's perfect for browser-based analytics:
Features
- Full SQL support: Window functions, CTEs, subqueries
- Vectorized execution: SIMD-optimized operations
- Smart query planning: Automatic optimization
- Arrow integration: Native support for Arrow format
Example: DuckDB in the Browser
import * as duckdb from '@duckdb/duckdb-wasm';
// Initialize DuckDB
const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();
const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);
const worker = new Worker(bundle.mainWorker);
const db = new duckdb.AsyncDuckDB(worker);
await db.instantiate(bundle.mainModule);
// Create a connection
const conn = await db.connect();
// Register Arrow table
await db.registerFileBuffer('employees.parquet', buffer);
// Run SQL query
const result = await conn.query(`
SELECT department, COUNT(*) as count
FROM employees
GROUP BY department
`);
// Get results as Arrow
const arrow_result = result.toArrow();
Performance Optimizations
1. Lazy Loading
Don't load all data at once:
-- Use pagination
SELECT * FROM large_table
LIMIT 100 OFFSET 0;
-- Next page
SELECT * FROM large_table
LIMIT 100 OFFSET 100;
2. Column Pruning
Only select needed columns:
-- Good: Only needed columns
SELECT name, salary FROM employees;
-- Bad: Loads everything
SELECT * FROM employees;
3. Predicate Pushdown
Filter as early as possible:
-- Efficient: Filter during Parquet scan
SELECT * FROM orders
WHERE date >= '2024-01-01'
AND status = 'completed';
4. Use Appropriate Data Types
-- Efficient
CREATE TABLE sales (
id INTEGER,
amount DECIMAL(10,2),
date DATE
);
-- Inefficient
CREATE TABLE sales (
id VARCHAR, -- Wasteful for numbers
amount VARCHAR, -- Loses numeric properties
date VARCHAR -- Can't use date optimizations
);
Real-World Use Cases
1. Data Exploration
-- Quick data profiling
SELECT
COUNT(*) as total_rows,
COUNT(DISTINCT customer_id) as unique_customers,
MIN(order_date) as first_order,
MAX(order_date) as last_order,
SUM(amount) as total_revenue
FROM orders;
2. Time Series Analysis
-- Daily revenue trends
SELECT
DATE_TRUNC('day', order_timestamp) as day,
COUNT(*) as order_count,
SUM(amount) as revenue
FROM orders
WHERE order_timestamp >= NOW() - INTERVAL '30 days'
GROUP BY day
ORDER BY day;
3. Complex Analytics
-- Running totals with window functions
SELECT
date,
revenue,
SUM(revenue) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as cumulative_revenue
FROM daily_sales
ORDER BY date;
Limitations and Considerations
Memory Constraints
Browsers limit available memory:
- Chrome: ~2-4GB per tab
- Firefox: ~2-4GB per tab
- Safari: More restrictive
Solution: Process data in chunks or use streaming queries.
File Size Limits
Loading huge files can be slow:
- Reading 1GB+ files takes time
- Consider splitting data into smaller partitions
Solution: Use partitioned Parquet files and load only needed partitions.
Browser Compatibility
WASM and SharedArrayBuffer requirements:
- Modern browsers: Full support
- Older browsers: Limited or no support
Solution: Provide fallback or progressive enhancement.
Best Practices
1. Design for the Client
Structure your queries for browser execution:
-- Good: Filters early, limits results
SELECT name, amount
FROM transactions
WHERE date = '2024-11-10'
LIMIT 1000;
-- Risky: Could return millions of rows
SELECT * FROM transactions;
2. Provide Feedback
Long-running queries need progress indication:
// Show loading state
setLoading(true);
try {
const result = await runQuery(sql);
displayResults(result);
} finally {
setLoading(false);
}
3. Handle Errors Gracefully
try {
const result = await conn.query(userSQL);
return result;
} catch (error) {
if (error.message.includes('syntax')) {
showError('Invalid SQL syntax');
} else if (error.message.includes('memory')) {
showError('Query too large for available memory');
} else {
showError('Query failed: ' + error.message);
}
}
The Future
Exciting developments ahead:
Improved Performance
- Better SIMD support in browsers
- More efficient WASM memory management
- Streaming query execution
Enhanced Features
- More SQL functions and operators
- Better join algorithms
- Advanced analytics functions
Easier Integration
- Simplified APIs
- Better TypeScript support
- Framework-specific packages (React, Vue, etc.)
Conclusion
Running SQL queries in the browser opens up new possibilities for data analysis tools. With Apache Arrow's efficient memory format and databases like DuckDB compiled to WebAssembly, we can build powerful, privacy-preserving analytics applications that run entirely client-side.
Try it yourself with Parquet Tools – load your Parquet files and start querying immediately, no server required!