docs: add dt-cli-tools design spec - dt-cli-tools - CLI tools for viewing, filtering, and comparing tabular data files

commit a0e0a2972a6c6acac7789b432d28cc058f411d22
Author: Erik Loualiche <eloualic@umn.edu>
Date:   Mon, 30 Mar 2026 22:05:02 -0500

docs: add dt-cli-tools design spec

Multi-format data CLI suite (CSV, Parquet, Arrow, JSON, Excel, DTA).
v0.1 ships dtcat, dtfilter, dtdiff as read-only tools.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
A docs/superpowers/specs/2026-03-30-dt-cli-tools-design.md  | 177 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1 file changed, 177 insertions(+), 0 deletions(-)
diff --git a/docs/superpowers/specs/2026-03-30-dt-cli-tools-design.md b/docs/superpowers/specs/2026-03-30-dt-cli-tools-design.md
@@ -0,0 +1,177 @@
+# dt-cli-tools Design Spec
+
+**Date:** 2026-03-30
+**Status:** Approved
+
+## Summary
+
+A new Rust CLI tool suite for inspecting, querying, and comparing tabular data files across formats. Three read-only tools ship in v0.1: `dtcat`, `dtfilter`, `dtdiff`. A write tool (`dtset`) is planned for v1.0.
+
+The project reuses the format-agnostic modules from xl-cli-tools (formatter, filter, diff) and adds a multi-format reader layer with automatic format detection.
+
+## Supported Formats
+
+| Format | Extensions | Magic Bytes | Crate | Feature Flag |
+|--------|-----------|-------------|-------|-------------|
+| CSV | `.csv`, `.tsv`, `.tab` | Heuristic (text + delimiters) | polars | default |
+| Parquet | `.parquet`, `.pq` | `PAR1` (4 bytes) | polars | default |
+| Arrow/Feather | `.arrow`, `.feather`, `.ipc` | `ARROW1` (6 bytes) | polars | default |
+| JSON/NDJSON | `.json`, `.ndjson`, `.jsonl` | `[` or `{` at start | polars | default |
+| Excel | `.xlsx`, `.xls`, `.xlsb`, `.ods` | ZIP (`PK`) or OLE (`D0 CF`) | calamine | default |
+| DTA (Stata) | `.dta` | Version byte (`0x71`-`0x77`, `0x117`-`0x119`) | readstat | `dta` |
+
+Detection priority: `--format` flag > magic bytes > extension > error.
+
+CSV delimiter detection: auto-detect comma vs tab vs semicolon by sampling the first few lines.
+
+## Project Structure
+
+```
+dt-cli-tools/
+  Cargo.toml
+  src/
+    lib.rs
+    format.rs              # format detection (magic bytes + extension)
+    reader.rs              # Format enum, ReadOptions, read_file dispatch
+    formatter.rs           # ported from xl-cli-tools
+    filter.rs              # ported from xl-cli-tools (letter-based column resolution removed)
+    diff.rs                # ported from xl-cli-tools
+    metadata.rs            # format-aware file inspection
+    readers/
+      csv.rs
+      parquet.rs
+      arrow.rs
+      json.rs
+      excel.rs
+      dta.rs               # behind feature flag
+  src/bin/
+    dtcat.rs
+    dtfilter.rs
+    dtdiff.rs
+  tests/
+    integration/
+  demo/
+```
+
+Library crate name: `dtcore`.
+
+## Format Detection and Reading
+
+```rust
+pub enum Format {
+    Csv, Tsv, Parquet, Arrow, Json, Ndjson, Excel, Dta,
+}
+
+pub struct ReadOptions {
+    pub sheet: Option<String>,    // Excel only
+    pub skip_rows: Option<usize>,
+    pub separator: Option<u8>,    // CSV override
+}
+
+pub fn detect_format(path: &Path, override_fmt: Option<&str>) -> Result<Format>;
+pub fn read_file(path: &Path, format: Format, opts: &ReadOptions) -> Result<DataFrame>;
+```
+
+Format dispatch uses a match on `Format` rather than a trait with dynamic dispatch. Each reader module exposes a single `read(path, opts) -> Result<DataFrame>` function.
+
+## Binary Interfaces
+
+### dtcat
+
+```
+dtcat <FILE> [OPTIONS]
+
+Options:
+  --format <FMT>        Override format detection
+  --sheet <NAME|INDEX>  Select sheet (Excel only)
+  --skip <N>            Skip first N rows
+  --schema              Show column names and types only
+  --describe            Show summary statistics
+  --head <N>            Show first N rows (default: 50)
+  --tail <N>            Show last N rows
+  --csv                 Output as CSV instead of markdown table
+  --info                Show file metadata (size, format, shape, sheets)
+```
+
+### dtfilter
+
+```
+dtfilter <FILE> [OPTIONS]
+
+Options:
+  --format <FMT>        Override format detection
+  --sheet <NAME|INDEX>  Select sheet (Excel only)
+  --skip <N>            Skip first N rows
+  --filter <EXPR>...    Filter expressions (Amount>1000, Name~john)
+  --sort <SPEC>...      Sort specifications (Amount:desc, Name:asc)
+  --columns <COLS>      Select columns by name (comma-separated)
+  --head <N>            First N rows (applied before filter)
+  --tail <N>            Last N rows (applied before filter)
+  --limit <N>           Max rows in output (applied after filter)
+  --csv                 Output as CSV
+```
+
+Column selection by name only. No letter-based addressing.
+
+### dtdiff
+
+```
+dtdiff <FILE_A> <FILE_B> [OPTIONS]
+
+Options:
+  --format <FMT>        Override format detection (both files must match format)
+  --sheet <NAME|INDEX>  Select sheet (Excel only)
+  --key <COL>...        Key columns for matched comparison
+  --tolerance <N>       Float comparison tolerance (default: 1e-10)
+  --json                Output as JSON
+  --csv                 Output as CSV
+```
+
+Same-format only: errors if the detected formats of FILE_A and FILE_B differ. CSV and TSV are treated as the same format family (delimited text) and can be compared.
+
+## Exit Codes
+
+All tools: 0 = success, 1 = runtime error, 2 = invalid arguments.
+
+Exception: `dtdiff` uses diff(1) convention: 0 = no differences, 1 = differences found, 2 = error.
+
+## Code Reuse from xl-cli-tools
+
+**Ported verbatim:**
+- `formatter.rs` — pure Polars DataFrame formatting, no changes needed
+- `filter.rs` — remove letter-based column resolution, keep everything else
+- `diff.rs` — pure Polars DataFrame comparison, no changes needed
+
+**Written fresh:**
+- `format.rs` — magic byte reading, extension matching, format enum
+- `reader.rs` — dispatch function and ReadOptions struct
+- `readers/*.rs` — thin wrappers around Polars readers (~30-50 lines each)
+- `readers/excel.rs` — adapted from xl-cli-tools reader.rs
+- `readers/dta.rs` — readstat FFI binding, behind feature flag
+- `metadata.rs` — format-aware file inspection
+- `dtcat.rs`, `dtfilter.rs`, `dtdiff.rs` — adapted from xl-cli-tools binaries
+
+Roughly 60% ported, 40% new. The new code is mostly thin plumbing. The complex logic (filtering pipeline, diff algorithm, table formatting) transfers from xl-cli-tools.
+
+## Dependencies
+
+| Crate | Purpose | Feature Flag |
+|-------|---------|-------------|
+| polars | DataFrame engine, CSV/Parquet/Arrow/JSON readers | default |
+| calamine | Excel reading (.xlsx, .xls, .xlsb, .ods) | default |
+| clap | CLI argument parsing (derive) | default |
+| anyhow | Error handling | default |
+| serde_json | JSON output for dtdiff | default |
+| readstat-rs | DTA (Stata) reading | `dta` |
+
+## Testing
+
+- Unit tests port with their modules (formatter, filter, diff)
+- New unit tests for format detection (magic bytes, extensions, ambiguous cases)
+- Integration tests per binary, per format: a matrix of (tool x format) using fixture files in `demo/`
+- DTA tests gated behind `#[cfg(feature = "dta")]`
+
+## Milestones
+
+- **v0.1:** `dtcat`, `dtfilter`, `dtdiff` — read-only tools, all default formats
+- **v1.0:** Add `dtset` — write/edit support for formats where it makes sense

	dt-cli-tools CLI tools for viewing, filtering, and comparing tabular data files
	Log \| Files \| Refs \| README \| LICENSE