DataQL

Query any data file using SQL. One command, instant results.

Quick Start • Features • Installation • Usage • LLM Integration • Documentation

DataQL is a CLI tool developed in Go that allows you to query and manipulate data files using SQL statements. It loads data into a DuckDB database (in-memory or file-based) with automatic type inference, enabling powerful SQL operations optimized for analytical queries.

Why DataQL?

The Problem

Working with data files has always been tedious. You either write throwaway scripts, load everything into pandas, or copy-paste into spreadsheets. With LLMs entering the workflow, a new problem emerged: how do you analyze a 10MB CSV without burning through your entire context window?

Traditional approaches fail:

Send file to LLM context: 10MB CSV = ~100,000+ tokens. Expensive, slow, often impossible.
Write a script: Context switch, setup overhead, not conversational.
Use pandas/Excel: Great for humans, useless for LLM automation.

The Solution

DataQL lets you query any data file using SQL. One command, instant results:

# Instead of sending 50,000 rows to an LLM...
dataql run -f sales.csv -q "SELECT region, SUM(revenue) FROM sales GROUP BY region"

# You get just what you need:
# region    | SUM(revenue)
# North     | 1,234,567
# South     | 987,654

Why This Matters

Scenario	Without DataQL	With DataQL
Analyze 10MB CSV with LLM	~100,000 tokens ($3+)	~500 tokens ($0.01)
Query data from S3	Download → Script → Parse	One command
Join CSV + JSON + Database	Custom ETL pipeline	Single SQL query
Automate data reports	Complex scripts	Simple CLI + cron
LLM data analysis	Context overflow	No size limit

Key Benefits

Token Efficient: LLMs get query results, not raw data. 99% reduction in token usage.
Universal Format Support: CSV, JSON, Parquet, Excel, XML, YAML, Avro, ORC - all queryable with SQL.
Any Data Source: Local files, URLs, S3, GCS, Azure, PostgreSQL, MySQL, MongoDB.
LLM-Native: Built-in MCP server for Claude, Codex, Gemini. Skills for Claude Code.
Zero Setup: Single binary, no dependencies, no configuration files.
Familiar Syntax: If you know SQL, you know DataQL.

Quick Start

# Install DataQL
curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash

# Query a CSV file
dataql run -f data.csv -q "SELECT * FROM data WHERE amount > 100"

# Query JSON from a URL
dataql run -f "https://api.example.com/data.json" -q "SELECT name, value FROM data"

# Query data from S3
dataql run -f "s3://bucket/data.parquet" -q "SELECT * FROM data LIMIT 10"

# Export results to JSON
dataql run -f data.csv -q "SELECT * FROM data" -e output.json -t json

# Interactive REPL mode
dataql run -f data.csv
# dataql> SELECT COUNT(*) FROM data;
# dataql> .tables
# dataql> .exit

Features

Supported File Formats:

CSV (with configurable delimiter)
JSON (arrays or single objects)
JSONL/NDJSON (newline-delimited JSON)
XML
YAML
Parquet
Excel (.xlsx, .xls)
Avro
ORC

Data Sources:

Local files
HTTP/HTTPS URLs
Amazon S3
Google Cloud Storage
Azure Blob Storage
Standard input (stdin)
Message Queues (SQS, Kafka, RabbitMQ - peek without consuming)

Database Connectors:

PostgreSQL
MySQL
DuckDB
MongoDB
DynamoDB

Key Capabilities:

Execute SQL queries using DuckDB syntax (OLAP-optimized)
Export results to CSV, JSONL, JSON, Excel, Parquet, XML, YAML formats
Interactive REPL mode with command history
Progress bar for large file operations
Parallel file processing for multiple inputs
Automatic flattening of nested JSON objects
Join data from multiple sources

LLM Integration:

MCP Server for Claude Code, OpenAI Codex, Google Gemini
Auto-activating Claude Code Skills
Token-efficient data processing for AI assistants

LLM Integration

DataQL is designed for efficient use with Large Language Models, enabling AI assistants to query large datasets without loading entire files into context.

# Install skills for Claude Code
dataql skills install

# Or start MCP server for any LLM
dataql mcp serve

Why use DataQL with LLMs?

Traditional Approach	With DataQL
Send 10MB CSV to context	Run SQL query
~100,000+ tokens	~500 tokens
Limited by context window	No file size limit

See LLM Integration Guide for complete documentation.

Installation

Quick Install (Recommended)

Linux / macOS:

curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash

Windows (PowerShell):

irm https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.ps1 | iex

Install Options

Specific version:

Linux / macOS:

curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash -s -- --version v1.0.0

Windows (PowerShell):

$env:DATAQL_VERSION="v1.0.0"; irm https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.ps1 | iex

User installation (no sudo/admin required):

Linux / macOS:

curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash -s -- --local

Windows (PowerShell):

$env:DATAQL_USER_INSTALL="true"; irm https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.ps1 | iex

From Source

# Clone the repository
git clone https://github.com/adrianolaselva/dataql.git
cd dataql

# Build and install
make build
make install       # requires sudo
# or
make install-local # installs to ~/.local/bin

Verify Installation

dataql --version

Update

Upgrade to latest version:

Linux / macOS:

# Only upgrades if a newer version is available
curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash -s -- --upgrade

# Force reinstall (same or different version)
curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash -s -- --force

# Clean install (remove all versions first, then install)
curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash -s -- --clean --force

Windows (PowerShell):

# Force reinstall
$env:DATAQL_FORCE="true"; irm https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.ps1 | iex

Uninstall

Linux / macOS:

curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/uninstall.sh | bash

Windows (PowerShell):

irm https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/uninstall.ps1 | iex

Usage

Basic Usage

Load a data file and start interactive mode (format is auto-detected):

# CSV file
dataql run -f data.csv -d ","

# JSON file (array or single object)
dataql run -f data.json

# JSONL/NDJSON file (one JSON per line)
dataql run -f data.jsonl

Supported Input Formats

Format	Extensions	Description
CSV	`.csv`	Comma-separated values with configurable delimiter
JSON	`.json`	JSON arrays or single objects
JSONL	`.jsonl`, `.ndjson`	Newline-delimited JSON (streaming)
XML	`.xml`	XML documents
YAML	`.yaml`, `.yml`	YAML documents
Parquet	`.parquet`	Apache Parquet columnar format
Excel	`.xlsx`, `.xls`	Microsoft Excel spreadsheets
Avro	`.avro`	Apache Avro format
ORC	`.orc`	Apache ORC format

Supported Data Sources

Source	Format	Example
Local file	Path	`-f data.csv`
HTTP/HTTPS	URL	`-f "https://example.com/data.csv"`
Amazon S3	`s3://`	`-f "s3://bucket/path/data.csv"`
Google Cloud Storage	`gs://`	`-f "gs://bucket/path/data.json"`
Azure Blob	`az://`	`-f "az://container/path/data.parquet"`
Standard input	`-`	`cat data.csv \| dataql run -f -`
PostgreSQL	`postgres://`	`-f "postgres://user:pass@host/db?table=t"`
MySQL	`mysql://`	`-f "mysql://user:pass@host/db?table=t"`
DuckDB	`duckdb://`	`-f "duckdb:///path/db.db?table=t"`
MongoDB	`mongodb://`	`-f "mongodb://host/db?collection=c"`
DynamoDB	`dynamodb://`	`-f "dynamodb://region/table-name"`

Command Line Options

Flag	Short	Description	Default
`--file`	`-f`	Input file, URL, or database connection	Required
`--delimiter`	`-d`	CSV delimiter (only for CSV files)	`,`
`--query`	`-q`	SQL query to execute	-
`--export`	`-e`	Export path	-
`--type`	`-t`	Export format (`csv`, `jsonl`, `json`, `excel`, `parquet`, `xml`, `yaml`)	-
`--storage`	`-s`	DuckDB file path (for persistence)	In-memory
`--lines`	`-l`	Limit number of lines/records to read	All
`--collection`	`-c`	Custom table name	Filename

Examples

Interactive Mode:

dataql run -f sales.csv -d ";"

dataql> SELECT product, SUM(amount) as total FROM sales GROUP BY product ORDER BY total DESC LIMIT 10;
product      total
Widget Pro   125430.50
Gadget Plus   98210.00
...

Execute Query and Display Results:

dataql run -f data.csv -d "," -q "SELECT * FROM data WHERE amount > 100 LIMIT 10"

Export to JSONL:

dataql run -f input.csv -d "," \
  -q "SELECT id, name, value FROM input WHERE status = 'active'" \
  -e output.jsonl -t jsonl

Export to CSV:

dataql run -f input.csv -d "," \
  -q "SELECT * FROM input" \
  -e output.csv -t csv

Multiple Input Files:

dataql run -f file1.csv -f file2.csv -d "," \
  -q "SELECT a.*, b.extra FROM file1 a JOIN file2 b ON a.id = b.id"

Query JSON Files:

# JSON array
dataql run -f users.json -q "SELECT name, email FROM users WHERE status = 'active'"

# JSON with nested objects (automatically flattened)
# {"user": {"name": "John", "address": {"city": "NYC"}}}
# becomes columns: user_name, user_address_city
dataql run -f data.json -q "SELECT user_name, user_address_city FROM data"

Query JSONL/NDJSON Files:

# JSONL is ideal for large datasets (streaming, low memory)
dataql run -f logs.jsonl -q "SELECT level, message, timestamp FROM logs WHERE level = 'ERROR'"

# Works with .ndjson extension too
dataql run -f events.ndjson -q "SELECT COUNT(*) as total FROM events"

Custom Table Name:

# Use --collection to specify a custom table name
dataql run -f data.json -c my_table -q "SELECT * FROM my_table"

Persist to DuckDB File:

dataql run -f data.csv -d "," -s ./database.duckdb

Query from URL:

dataql run -f "https://raw.githubusercontent.com/datasets/population/main/data/population.csv" \
  -q "SELECT Country_Name, Value FROM population WHERE Year = 2020 LIMIT 10"

Query from S3:

dataql run -f "s3://my-bucket/data/sales.csv" \
  -q "SELECT product, SUM(amount) as total FROM sales GROUP BY product"

Query from PostgreSQL:

dataql run -f "postgres://user:pass@localhost:5432/mydb?table=orders" \
  -q "SELECT * FROM orders WHERE status = 'completed'"

Peek at SQS messages (without consuming):

dataql run -f "sqs://my-events-queue?region=us-east-1" \
  -q "SELECT message_id, body_event_type, timestamp FROM my_events_queue"

Read from stdin:

cat data.csv | dataql run -f - -q "SELECT * FROM stdin_data WHERE value > 100"

Real-World Example

# Download sample data
wget https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2021-financial-year-provisional/Download-data/annual-enterprise-survey-2021-financial-year-provisional-csv.csv -O survey.csv

# Query and export
dataql run -f survey.csv -d "," \
  -q "SELECT Year, Industry_aggregation_NZSIOC as industry, Variable_name as metric, Value as amount FROM survey WHERE Value > 1000" \
  -e analysis.jsonl -t jsonl

SQL Reference

DataQL uses DuckDB under the hood, supporting standard SQL syntax optimized for analytical queries:

-- Basic SELECT
SELECT column1, column2 FROM tablename;

-- Filtering
SELECT * FROM data WHERE amount > 100 AND status = 'active';

-- Aggregation
SELECT category, COUNT(*), SUM(value) FROM data GROUP BY category;

-- Joins (multiple files)
SELECT a.*, b.extra FROM file1 a JOIN file2 b ON a.id = b.id;

-- Ordering and Limiting
SELECT * FROM data ORDER BY created_at DESC LIMIT 100;

Note: Table names are derived from filenames (without extension). For sales.csv, sales.json, or sales.jsonl, use SELECT * FROM sales. Use --collection flag to specify a custom table name.

Documentation

For detailed documentation, see:

Getting Started - Installation and Hello World examples
Architecture - System architecture and design diagrams
CLI Reference - Complete command-line reference
Data Sources - Working with S3, GCS, Azure, URLs, and stdin
Database Connections - Connect to PostgreSQL, MySQL, DuckDB, MongoDB
LLM Integration - Use DataQL with Claude, Codex, Gemini
MCP Setup - Configure MCP server for LLM integration
Examples - Real-world usage examples and automation scripts

Development

Prerequisites

Go 1.24 or higher
GCC (for CGO compilation - required for DuckDB)
Docker and Docker Compose (for E2E tests)

Building

make build

Testing

# Unit tests
make test

# E2E tests (requires Docker)
make e2e-up           # Start infrastructure (PostgreSQL, MySQL, MongoDB, Kafka, LocalStack)
make e2e-wait         # Wait for services to be healthy
make e2e-test-scripts # Run all E2E tests
make e2e-down         # Stop infrastructure

E2E Test Coverage

DataQL includes comprehensive E2E tests for all data sources:

Data Source	Tests	Status
PostgreSQL	26	SELECT, WHERE, ORDER BY, LIMIT, aggregates, exports
MySQL	26	SELECT, WHERE, ORDER BY, LIMIT, aggregates, exports
MongoDB	20+	Collections, queries, filters, exports
Kafka	10+	Peek mode, message parsing, exports
S3 (LocalStack)	13	CSV, JSON, JSONL file reading, queries, exports
SQS (LocalStack)	16	Message reading, filtering, aggregation, exports

See e2e/README.md for detailed E2E testing documentation.

Linting

make lint

Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

About This Project

This is a rewrite of csvql, an earlier experiment I did back in 2019. The original was simple and limited. This version? Built entirely with AI assistance (Claude Code). I wanted to see how far AI-assisted development could go, and honestly, it went pretty far. The code, docs, tests - all of it came from conversations with an AI. Make of that what you will.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

csvql - The original project that inspired this rewrite
Claude Code - AI assistant that helped build this entire project
DuckDB - Embedded analytical database engine
Cobra - CLI framework
go-duckdb - DuckDB driver for Go

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.github/workflows		.github/workflows
bin		bin
cmd		cmd
docs		docs
e2e		e2e
internal		internal
pkg		pkg
scripts		scripts
tests		tests
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yml		.goreleaser.yml
.scrutinizer.yml		.scrutinizer.yml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
_config.yml		_config.yml
coverage.out		coverage.out
go.mod		go.mod
go.sum		go.sum
index.md		index.md
main.go		main.go

Folders and files

Latest commit

History

Repository files navigation

DataQL

Why DataQL?

The Problem

The Solution

Why This Matters

Key Benefits

Quick Start

Features

LLM Integration

Installation

Quick Install (Recommended)

Install Options

From Source

Verify Installation

Update

Uninstall

Usage

Basic Usage

Supported Input Formats

Supported Data Sources

Command Line Options

Examples

Real-World Example

SQL Reference

Documentation

Development

Prerequisites

Building

Testing

E2E Test Coverage

Linting

Contributing

About This Project

License

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages