dt-cli-tools

CLI tools for viewing, filtering, and comparing tabular data files
Log | Files | Refs | README | LICENSE

2026-04-04-v0.2.0-sample-convert-design.md (2666B)


      1 # dt-cli-tools v0.2.0 — sample and convert
      2 
      3 ## Summary
      4 
      5 Add two flags to dtcat: `--sample N` for random row sampling and `--convert FORMAT` with `-o PATH` for format conversion. No changes to dtfilter or dtdiff.
      6 
      7 ## `--sample N`
      8 
      9 Randomly select N rows from the DataFrame after reading.
     10 
     11 - Mutually exclusive with `--head`, `--tail`, `--all`
     12 - Works with `--csv` output and default markdown
     13 - Works with `--skip` and `--sheet` (applied before sampling)
     14 - Mutually exclusive with `--schema`, `--describe`, `--info`
     15 - Non-deterministic (no seed flag)
     16 - If N >= row count, return all rows (no error)
     17 
     18 ### Examples
     19 
     20 ```bash
     21 dtcat huge.parquet --sample 20
     22 dtcat huge.parquet --sample 50 --csv
     23 dtcat report.xlsx --sheet Data --sample 10
     24 ```
     25 
     26 ## `--convert FORMAT` with `-o PATH`
     27 
     28 Read any supported format, write to a different format.
     29 
     30 - `--convert FORMAT` — target format: csv, tsv, parquet, arrow, json, ndjson
     31 - `-o PATH` — output file path
     32 - For text formats (csv, tsv, json, ndjson): if `-o` omitted, write to stdout
     33 - For binary formats (parquet, arrow): `-o` is required; error if missing
     34 - Mutually exclusive with all display flags (`--schema`, `--describe`, `--info`, `--csv`, `--head`, `--tail`, `--all`, `--sample`)
     35 - Works with `--skip` and `--sheet` (select data before converting)
     36 
     37 ### Examples
     38 
     39 ```bash
     40 dtcat data.csv --convert parquet -o data.parquet
     41 dtcat report.xlsx --sheet Revenue --convert csv -o revenue.csv
     42 dtcat data.json --convert arrow -o data.arrow
     43 dtcat data.parquet --convert ndjson          # stdout
     44 ```
     45 
     46 ## Architecture
     47 
     48 Both features are additions to `src/bin/dtcat.rs` (new CLI args) and `dtcore` (writers).
     49 
     50 ### New library code
     51 
     52 - `src/writers/` module with: `csv.rs`, `parquet.rs`, `arrow.rs`, `json.rs`
     53 - Each writer takes a `&mut DataFrame` and a `Write` or file path, returns `Result<()>`
     54 - `src/lib.rs` exports new `writers` module
     55 
     56 ### Sampling
     57 
     58 - Use Polars `DataFrame::sample_n` or equivalent
     59 - Implemented in dtcat binary after read, before display/convert
     60 
     61 ## Testing
     62 
     63 ### `--sample N`
     64 - Verify output has exactly N rows (when N < total)
     65 - Verify N >= total returns all rows
     66 - Verify mutual exclusivity with `--head`/`--tail`/`--all`
     67 - Works on CSV, Parquet, Excel
     68 
     69 ### `--convert FORMAT`
     70 - Roundtrip: CSV → Parquet → CSV, verify data matches
     71 - All 6 target formats produce valid output
     72 - Text formats work without `-o` (stdout)
     73 - Binary formats error without `-o`
     74 - `--skip` and `--sheet` apply before conversion
     75 - Mutual exclusivity with display flags
     76 
     77 ### Existing tests
     78 - All 187+ existing tests still pass
     79 
     80 ## Not in scope
     81 
     82 - No changes to dtfilter or dtdiff
     83 - No new binaries
     84 - No dtset, dtvalidate, or dtjoin