2026-03-30-dt-cli-tools-design.md (6467B)
1 # dt-cli-tools Design Spec 2 3 **Date:** 2026-03-30 4 **Status:** Approved 5 6 ## Summary 7 8 A new Rust CLI tool suite for inspecting, querying, and comparing tabular data files across formats. Three read-only tools ship in v0.1: `dtcat`, `dtfilter`, `dtdiff`. A write tool (`dtset`) is planned for v1.0. 9 10 The project reuses the format-agnostic modules from xl-cli-tools (formatter, filter, diff) and adds a multi-format reader layer with automatic format detection. 11 12 ## Supported Formats 13 14 | Format | Extensions | Magic Bytes | Crate | Feature Flag | 15 |--------|-----------|-------------|-------|-------------| 16 | CSV | `.csv`, `.tsv`, `.tab` | Heuristic (text + delimiters) | polars | default | 17 | Parquet | `.parquet`, `.pq` | `PAR1` (4 bytes) | polars | default | 18 | Arrow/Feather | `.arrow`, `.feather`, `.ipc` | `ARROW1` (6 bytes) | polars | default | 19 | JSON/NDJSON | `.json`, `.ndjson`, `.jsonl` | `[` or `{` at start | polars | default | 20 | Excel | `.xlsx`, `.xls`, `.xlsb`, `.ods` | ZIP (`PK`) or OLE (`D0 CF`) | calamine | default | 21 | DTA (Stata) | `.dta` | Version byte (`0x71`-`0x77`, `0x117`-`0x119`) | readstat | `dta` | 22 23 Detection priority: `--format` flag > magic bytes > extension > error. 24 25 CSV delimiter detection: auto-detect comma vs tab vs semicolon by sampling the first few lines. 26 27 ## Project Structure 28 29 ``` 30 dt-cli-tools/ 31 Cargo.toml 32 src/ 33 lib.rs 34 format.rs # format detection (magic bytes + extension) 35 reader.rs # Format enum, ReadOptions, read_file dispatch 36 formatter.rs # ported from xl-cli-tools 37 filter.rs # ported from xl-cli-tools (letter-based column resolution removed) 38 diff.rs # ported from xl-cli-tools 39 metadata.rs # format-aware file inspection 40 readers/ 41 csv.rs 42 parquet.rs 43 arrow.rs 44 json.rs 45 excel.rs 46 dta.rs # behind feature flag 47 src/bin/ 48 dtcat.rs 49 dtfilter.rs 50 dtdiff.rs 51 tests/ 52 integration/ 53 demo/ 54 ``` 55 56 Library crate name: `dtcore`. 57 58 ## Format Detection and Reading 59 60 ```rust 61 pub enum Format { 62 Csv, Tsv, Parquet, Arrow, Json, Ndjson, Excel, Dta, 63 } 64 65 pub struct ReadOptions { 66 pub sheet: Option<String>, // Excel only 67 pub skip_rows: Option<usize>, 68 pub separator: Option<u8>, // CSV override 69 } 70 71 pub fn detect_format(path: &Path, override_fmt: Option<&str>) -> Result<Format>; 72 pub fn read_file(path: &Path, format: Format, opts: &ReadOptions) -> Result<DataFrame>; 73 ``` 74 75 Format dispatch uses a match on `Format` rather than a trait with dynamic dispatch. Each reader module exposes a single `read(path, opts) -> Result<DataFrame>` function. 76 77 ## Binary Interfaces 78 79 ### dtcat 80 81 ``` 82 dtcat <FILE> [OPTIONS] 83 84 Options: 85 --format <FMT> Override format detection 86 --sheet <NAME|INDEX> Select sheet (Excel only) 87 --skip <N> Skip first N rows 88 --schema Show column names and types only 89 --describe Show summary statistics 90 --head <N> Show first N rows (default: 50) 91 --tail <N> Show last N rows 92 --csv Output as CSV instead of markdown table 93 --info Show file metadata (size, format, shape, sheets) 94 ``` 95 96 ### dtfilter 97 98 ``` 99 dtfilter <FILE> [OPTIONS] 100 101 Options: 102 --format <FMT> Override format detection 103 --sheet <NAME|INDEX> Select sheet (Excel only) 104 --skip <N> Skip first N rows 105 --filter <EXPR>... Filter expressions (Amount>1000, Name~john) 106 --sort <SPEC>... Sort specifications (Amount:desc, Name:asc) 107 --columns <COLS> Select columns by name (comma-separated) 108 --head <N> First N rows (applied before filter) 109 --tail <N> Last N rows (applied before filter) 110 --limit <N> Max rows in output (applied after filter) 111 --csv Output as CSV 112 ``` 113 114 Column selection by name only. No letter-based addressing. 115 116 ### dtdiff 117 118 ``` 119 dtdiff <FILE_A> <FILE_B> [OPTIONS] 120 121 Options: 122 --format <FMT> Override format detection (both files must match format) 123 --sheet <NAME|INDEX> Select sheet (Excel only) 124 --key <COL>... Key columns for matched comparison 125 --tolerance <N> Float comparison tolerance (default: 1e-10) 126 --json Output as JSON 127 --csv Output as CSV 128 ``` 129 130 Same-format only: errors if the detected formats of FILE_A and FILE_B differ. CSV and TSV are treated as the same format family (delimited text) and can be compared. 131 132 ## Exit Codes 133 134 All tools: 0 = success, 1 = runtime error, 2 = invalid arguments. 135 136 Exception: `dtdiff` uses diff(1) convention: 0 = no differences, 1 = differences found, 2 = error. 137 138 ## Code Reuse from xl-cli-tools 139 140 **Ported verbatim:** 141 - `formatter.rs` — pure Polars DataFrame formatting, no changes needed 142 - `filter.rs` — remove letter-based column resolution, keep everything else 143 - `diff.rs` — pure Polars DataFrame comparison, no changes needed 144 145 **Written fresh:** 146 - `format.rs` — magic byte reading, extension matching, format enum 147 - `reader.rs` — dispatch function and ReadOptions struct 148 - `readers/*.rs` — thin wrappers around Polars readers (~30-50 lines each) 149 - `readers/excel.rs` — adapted from xl-cli-tools reader.rs 150 - `readers/dta.rs` — readstat FFI binding, behind feature flag 151 - `metadata.rs` — format-aware file inspection 152 - `dtcat.rs`, `dtfilter.rs`, `dtdiff.rs` — adapted from xl-cli-tools binaries 153 154 Roughly 60% ported, 40% new. The new code is mostly thin plumbing. The complex logic (filtering pipeline, diff algorithm, table formatting) transfers from xl-cli-tools. 155 156 ## Dependencies 157 158 | Crate | Purpose | Feature Flag | 159 |-------|---------|-------------| 160 | polars | DataFrame engine, CSV/Parquet/Arrow/JSON readers | default | 161 | calamine | Excel reading (.xlsx, .xls, .xlsb, .ods) | default | 162 | clap | CLI argument parsing (derive) | default | 163 | anyhow | Error handling | default | 164 | serde_json | JSON output for dtdiff | default | 165 | readstat-rs | DTA (Stata) reading | `dta` | 166 167 ## Testing 168 169 - Unit tests port with their modules (formatter, filter, diff) 170 - New unit tests for format detection (magic bytes, extensions, ambiguous cases) 171 - Integration tests per binary, per format: a matrix of (tool x format) using fixture files in `demo/` 172 - DTA tests gated behind `#[cfg(feature = "dta")]` 173 174 ## Milestones 175 176 - **v0.1:** `dtcat`, `dtfilter`, `dtdiff` — read-only tools, all default formats 177 - **v1.0:** Add `dtset` — write/edit support for formats where it makes sense