dt-cli-tools

CLI tools for viewing, filtering, and comparing tabular data files
Log | Files | Refs | README | LICENSE

commit a0e0a2972a6c6acac7789b432d28cc058f411d22
Author: Erik Loualiche <eloualic@umn.edu>
Date:   Mon, 30 Mar 2026 22:05:02 -0500

docs: add dt-cli-tools design spec

Multi-format data CLI suite (CSV, Parquet, Arrow, JSON, Excel, DTA).
v0.1 ships dtcat, dtfilter, dtdiff as read-only tools.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
Adocs/superpowers/specs/2026-03-30-dt-cli-tools-design.md | 177+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 177 insertions(+), 0 deletions(-)

diff --git a/docs/superpowers/specs/2026-03-30-dt-cli-tools-design.md b/docs/superpowers/specs/2026-03-30-dt-cli-tools-design.md @@ -0,0 +1,177 @@ +# dt-cli-tools Design Spec + +**Date:** 2026-03-30 +**Status:** Approved + +## Summary + +A new Rust CLI tool suite for inspecting, querying, and comparing tabular data files across formats. Three read-only tools ship in v0.1: `dtcat`, `dtfilter`, `dtdiff`. A write tool (`dtset`) is planned for v1.0. + +The project reuses the format-agnostic modules from xl-cli-tools (formatter, filter, diff) and adds a multi-format reader layer with automatic format detection. + +## Supported Formats + +| Format | Extensions | Magic Bytes | Crate | Feature Flag | +|--------|-----------|-------------|-------|-------------| +| CSV | `.csv`, `.tsv`, `.tab` | Heuristic (text + delimiters) | polars | default | +| Parquet | `.parquet`, `.pq` | `PAR1` (4 bytes) | polars | default | +| Arrow/Feather | `.arrow`, `.feather`, `.ipc` | `ARROW1` (6 bytes) | polars | default | +| JSON/NDJSON | `.json`, `.ndjson`, `.jsonl` | `[` or `{` at start | polars | default | +| Excel | `.xlsx`, `.xls`, `.xlsb`, `.ods` | ZIP (`PK`) or OLE (`D0 CF`) | calamine | default | +| DTA (Stata) | `.dta` | Version byte (`0x71`-`0x77`, `0x117`-`0x119`) | readstat | `dta` | + +Detection priority: `--format` flag > magic bytes > extension > error. + +CSV delimiter detection: auto-detect comma vs tab vs semicolon by sampling the first few lines. + +## Project Structure + +``` +dt-cli-tools/ + Cargo.toml + src/ + lib.rs + format.rs # format detection (magic bytes + extension) + reader.rs # Format enum, ReadOptions, read_file dispatch + formatter.rs # ported from xl-cli-tools + filter.rs # ported from xl-cli-tools (letter-based column resolution removed) + diff.rs # ported from xl-cli-tools + metadata.rs # format-aware file inspection + readers/ + csv.rs + parquet.rs + arrow.rs + json.rs + excel.rs + dta.rs # behind feature flag + src/bin/ + dtcat.rs + dtfilter.rs + dtdiff.rs + tests/ + integration/ + demo/ +``` + +Library crate name: `dtcore`. + +## Format Detection and Reading + +```rust +pub enum Format { + Csv, Tsv, Parquet, Arrow, Json, Ndjson, Excel, Dta, +} + +pub struct ReadOptions { + pub sheet: Option<String>, // Excel only + pub skip_rows: Option<usize>, + pub separator: Option<u8>, // CSV override +} + +pub fn detect_format(path: &Path, override_fmt: Option<&str>) -> Result<Format>; +pub fn read_file(path: &Path, format: Format, opts: &ReadOptions) -> Result<DataFrame>; +``` + +Format dispatch uses a match on `Format` rather than a trait with dynamic dispatch. Each reader module exposes a single `read(path, opts) -> Result<DataFrame>` function. + +## Binary Interfaces + +### dtcat + +``` +dtcat <FILE> [OPTIONS] + +Options: + --format <FMT> Override format detection + --sheet <NAME|INDEX> Select sheet (Excel only) + --skip <N> Skip first N rows + --schema Show column names and types only + --describe Show summary statistics + --head <N> Show first N rows (default: 50) + --tail <N> Show last N rows + --csv Output as CSV instead of markdown table + --info Show file metadata (size, format, shape, sheets) +``` + +### dtfilter + +``` +dtfilter <FILE> [OPTIONS] + +Options: + --format <FMT> Override format detection + --sheet <NAME|INDEX> Select sheet (Excel only) + --skip <N> Skip first N rows + --filter <EXPR>... Filter expressions (Amount>1000, Name~john) + --sort <SPEC>... Sort specifications (Amount:desc, Name:asc) + --columns <COLS> Select columns by name (comma-separated) + --head <N> First N rows (applied before filter) + --tail <N> Last N rows (applied before filter) + --limit <N> Max rows in output (applied after filter) + --csv Output as CSV +``` + +Column selection by name only. No letter-based addressing. + +### dtdiff + +``` +dtdiff <FILE_A> <FILE_B> [OPTIONS] + +Options: + --format <FMT> Override format detection (both files must match format) + --sheet <NAME|INDEX> Select sheet (Excel only) + --key <COL>... Key columns for matched comparison + --tolerance <N> Float comparison tolerance (default: 1e-10) + --json Output as JSON + --csv Output as CSV +``` + +Same-format only: errors if the detected formats of FILE_A and FILE_B differ. CSV and TSV are treated as the same format family (delimited text) and can be compared. + +## Exit Codes + +All tools: 0 = success, 1 = runtime error, 2 = invalid arguments. + +Exception: `dtdiff` uses diff(1) convention: 0 = no differences, 1 = differences found, 2 = error. + +## Code Reuse from xl-cli-tools + +**Ported verbatim:** +- `formatter.rs` — pure Polars DataFrame formatting, no changes needed +- `filter.rs` — remove letter-based column resolution, keep everything else +- `diff.rs` — pure Polars DataFrame comparison, no changes needed + +**Written fresh:** +- `format.rs` — magic byte reading, extension matching, format enum +- `reader.rs` — dispatch function and ReadOptions struct +- `readers/*.rs` — thin wrappers around Polars readers (~30-50 lines each) +- `readers/excel.rs` — adapted from xl-cli-tools reader.rs +- `readers/dta.rs` — readstat FFI binding, behind feature flag +- `metadata.rs` — format-aware file inspection +- `dtcat.rs`, `dtfilter.rs`, `dtdiff.rs` — adapted from xl-cli-tools binaries + +Roughly 60% ported, 40% new. The new code is mostly thin plumbing. The complex logic (filtering pipeline, diff algorithm, table formatting) transfers from xl-cli-tools. + +## Dependencies + +| Crate | Purpose | Feature Flag | +|-------|---------|-------------| +| polars | DataFrame engine, CSV/Parquet/Arrow/JSON readers | default | +| calamine | Excel reading (.xlsx, .xls, .xlsb, .ods) | default | +| clap | CLI argument parsing (derive) | default | +| anyhow | Error handling | default | +| serde_json | JSON output for dtdiff | default | +| readstat-rs | DTA (Stata) reading | `dta` | + +## Testing + +- Unit tests port with their modules (formatter, filter, diff) +- New unit tests for format detection (magic bytes, extensions, ambiguous cases) +- Integration tests per binary, per format: a matrix of (tool x format) using fixture files in `demo/` +- DTA tests gated behind `#[cfg(feature = "dta")]` + +## Milestones + +- **v0.1:** `dtcat`, `dtfilter`, `dtdiff` — read-only tools, all default formats +- **v1.0:** Add `dtset` — write/edit support for formats where it makes sense