dt-cli-tools

CLI tools for viewing, filtering, and comparing tabular data files
Log | Files | Refs | README | LICENSE

2026-03-30-dt-cli-tools-design.md (6467B)


      1 # dt-cli-tools Design Spec
      2 
      3 **Date:** 2026-03-30
      4 **Status:** Approved
      5 
      6 ## Summary
      7 
      8 A new Rust CLI tool suite for inspecting, querying, and comparing tabular data files across formats. Three read-only tools ship in v0.1: `dtcat`, `dtfilter`, `dtdiff`. A write tool (`dtset`) is planned for v1.0.
      9 
     10 The project reuses the format-agnostic modules from xl-cli-tools (formatter, filter, diff) and adds a multi-format reader layer with automatic format detection.
     11 
     12 ## Supported Formats
     13 
     14 | Format | Extensions | Magic Bytes | Crate | Feature Flag |
     15 |--------|-----------|-------------|-------|-------------|
     16 | CSV | `.csv`, `.tsv`, `.tab` | Heuristic (text + delimiters) | polars | default |
     17 | Parquet | `.parquet`, `.pq` | `PAR1` (4 bytes) | polars | default |
     18 | Arrow/Feather | `.arrow`, `.feather`, `.ipc` | `ARROW1` (6 bytes) | polars | default |
     19 | JSON/NDJSON | `.json`, `.ndjson`, `.jsonl` | `[` or `{` at start | polars | default |
     20 | Excel | `.xlsx`, `.xls`, `.xlsb`, `.ods` | ZIP (`PK`) or OLE (`D0 CF`) | calamine | default |
     21 | DTA (Stata) | `.dta` | Version byte (`0x71`-`0x77`, `0x117`-`0x119`) | readstat | `dta` |
     22 
     23 Detection priority: `--format` flag > magic bytes > extension > error.
     24 
     25 CSV delimiter detection: auto-detect comma vs tab vs semicolon by sampling the first few lines.
     26 
     27 ## Project Structure
     28 
     29 ```
     30 dt-cli-tools/
     31   Cargo.toml
     32   src/
     33     lib.rs
     34     format.rs              # format detection (magic bytes + extension)
     35     reader.rs              # Format enum, ReadOptions, read_file dispatch
     36     formatter.rs           # ported from xl-cli-tools
     37     filter.rs              # ported from xl-cli-tools (letter-based column resolution removed)
     38     diff.rs                # ported from xl-cli-tools
     39     metadata.rs            # format-aware file inspection
     40     readers/
     41       csv.rs
     42       parquet.rs
     43       arrow.rs
     44       json.rs
     45       excel.rs
     46       dta.rs               # behind feature flag
     47   src/bin/
     48     dtcat.rs
     49     dtfilter.rs
     50     dtdiff.rs
     51   tests/
     52     integration/
     53   demo/
     54 ```
     55 
     56 Library crate name: `dtcore`.
     57 
     58 ## Format Detection and Reading
     59 
     60 ```rust
     61 pub enum Format {
     62     Csv, Tsv, Parquet, Arrow, Json, Ndjson, Excel, Dta,
     63 }
     64 
     65 pub struct ReadOptions {
     66     pub sheet: Option<String>,    // Excel only
     67     pub skip_rows: Option<usize>,
     68     pub separator: Option<u8>,    // CSV override
     69 }
     70 
     71 pub fn detect_format(path: &Path, override_fmt: Option<&str>) -> Result<Format>;
     72 pub fn read_file(path: &Path, format: Format, opts: &ReadOptions) -> Result<DataFrame>;
     73 ```
     74 
     75 Format dispatch uses a match on `Format` rather than a trait with dynamic dispatch. Each reader module exposes a single `read(path, opts) -> Result<DataFrame>` function.
     76 
     77 ## Binary Interfaces
     78 
     79 ### dtcat
     80 
     81 ```
     82 dtcat <FILE> [OPTIONS]
     83 
     84 Options:
     85   --format <FMT>        Override format detection
     86   --sheet <NAME|INDEX>  Select sheet (Excel only)
     87   --skip <N>            Skip first N rows
     88   --schema              Show column names and types only
     89   --describe            Show summary statistics
     90   --head <N>            Show first N rows (default: 50)
     91   --tail <N>            Show last N rows
     92   --csv                 Output as CSV instead of markdown table
     93   --info                Show file metadata (size, format, shape, sheets)
     94 ```
     95 
     96 ### dtfilter
     97 
     98 ```
     99 dtfilter <FILE> [OPTIONS]
    100 
    101 Options:
    102   --format <FMT>        Override format detection
    103   --sheet <NAME|INDEX>  Select sheet (Excel only)
    104   --skip <N>            Skip first N rows
    105   --filter <EXPR>...    Filter expressions (Amount>1000, Name~john)
    106   --sort <SPEC>...      Sort specifications (Amount:desc, Name:asc)
    107   --columns <COLS>      Select columns by name (comma-separated)
    108   --head <N>            First N rows (applied before filter)
    109   --tail <N>            Last N rows (applied before filter)
    110   --limit <N>           Max rows in output (applied after filter)
    111   --csv                 Output as CSV
    112 ```
    113 
    114 Column selection by name only. No letter-based addressing.
    115 
    116 ### dtdiff
    117 
    118 ```
    119 dtdiff <FILE_A> <FILE_B> [OPTIONS]
    120 
    121 Options:
    122   --format <FMT>        Override format detection (both files must match format)
    123   --sheet <NAME|INDEX>  Select sheet (Excel only)
    124   --key <COL>...        Key columns for matched comparison
    125   --tolerance <N>       Float comparison tolerance (default: 1e-10)
    126   --json                Output as JSON
    127   --csv                 Output as CSV
    128 ```
    129 
    130 Same-format only: errors if the detected formats of FILE_A and FILE_B differ. CSV and TSV are treated as the same format family (delimited text) and can be compared.
    131 
    132 ## Exit Codes
    133 
    134 All tools: 0 = success, 1 = runtime error, 2 = invalid arguments.
    135 
    136 Exception: `dtdiff` uses diff(1) convention: 0 = no differences, 1 = differences found, 2 = error.
    137 
    138 ## Code Reuse from xl-cli-tools
    139 
    140 **Ported verbatim:**
    141 - `formatter.rs` — pure Polars DataFrame formatting, no changes needed
    142 - `filter.rs` — remove letter-based column resolution, keep everything else
    143 - `diff.rs` — pure Polars DataFrame comparison, no changes needed
    144 
    145 **Written fresh:**
    146 - `format.rs` — magic byte reading, extension matching, format enum
    147 - `reader.rs` — dispatch function and ReadOptions struct
    148 - `readers/*.rs` — thin wrappers around Polars readers (~30-50 lines each)
    149 - `readers/excel.rs` — adapted from xl-cli-tools reader.rs
    150 - `readers/dta.rs` — readstat FFI binding, behind feature flag
    151 - `metadata.rs` — format-aware file inspection
    152 - `dtcat.rs`, `dtfilter.rs`, `dtdiff.rs` — adapted from xl-cli-tools binaries
    153 
    154 Roughly 60% ported, 40% new. The new code is mostly thin plumbing. The complex logic (filtering pipeline, diff algorithm, table formatting) transfers from xl-cli-tools.
    155 
    156 ## Dependencies
    157 
    158 | Crate | Purpose | Feature Flag |
    159 |-------|---------|-------------|
    160 | polars | DataFrame engine, CSV/Parquet/Arrow/JSON readers | default |
    161 | calamine | Excel reading (.xlsx, .xls, .xlsb, .ods) | default |
    162 | clap | CLI argument parsing (derive) | default |
    163 | anyhow | Error handling | default |
    164 | serde_json | JSON output for dtdiff | default |
    165 | readstat-rs | DTA (Stata) reading | `dta` |
    166 
    167 ## Testing
    168 
    169 - Unit tests port with their modules (formatter, filter, diff)
    170 - New unit tests for format detection (magic bytes, extensions, ambiguous cases)
    171 - Integration tests per binary, per format: a matrix of (tool x format) using fixture files in `demo/`
    172 - DTA tests gated behind `#[cfg(feature = "dta")]`
    173 
    174 ## Milestones
    175 
    176 - **v0.1:** `dtcat`, `dtfilter`, `dtdiff` — read-only tools, all default formats
    177 - **v1.0:** Add `dtset` — write/edit support for formats where it makes sense