2026-04-04-v0.2.0-sample-convert-design.md (2666B)
1 # dt-cli-tools v0.2.0 — sample and convert 2 3 ## Summary 4 5 Add two flags to dtcat: `--sample N` for random row sampling and `--convert FORMAT` with `-o PATH` for format conversion. No changes to dtfilter or dtdiff. 6 7 ## `--sample N` 8 9 Randomly select N rows from the DataFrame after reading. 10 11 - Mutually exclusive with `--head`, `--tail`, `--all` 12 - Works with `--csv` output and default markdown 13 - Works with `--skip` and `--sheet` (applied before sampling) 14 - Mutually exclusive with `--schema`, `--describe`, `--info` 15 - Non-deterministic (no seed flag) 16 - If N >= row count, return all rows (no error) 17 18 ### Examples 19 20 ```bash 21 dtcat huge.parquet --sample 20 22 dtcat huge.parquet --sample 50 --csv 23 dtcat report.xlsx --sheet Data --sample 10 24 ``` 25 26 ## `--convert FORMAT` with `-o PATH` 27 28 Read any supported format, write to a different format. 29 30 - `--convert FORMAT` — target format: csv, tsv, parquet, arrow, json, ndjson 31 - `-o PATH` — output file path 32 - For text formats (csv, tsv, json, ndjson): if `-o` omitted, write to stdout 33 - For binary formats (parquet, arrow): `-o` is required; error if missing 34 - Mutually exclusive with all display flags (`--schema`, `--describe`, `--info`, `--csv`, `--head`, `--tail`, `--all`, `--sample`) 35 - Works with `--skip` and `--sheet` (select data before converting) 36 37 ### Examples 38 39 ```bash 40 dtcat data.csv --convert parquet -o data.parquet 41 dtcat report.xlsx --sheet Revenue --convert csv -o revenue.csv 42 dtcat data.json --convert arrow -o data.arrow 43 dtcat data.parquet --convert ndjson # stdout 44 ``` 45 46 ## Architecture 47 48 Both features are additions to `src/bin/dtcat.rs` (new CLI args) and `dtcore` (writers). 49 50 ### New library code 51 52 - `src/writers/` module with: `csv.rs`, `parquet.rs`, `arrow.rs`, `json.rs` 53 - Each writer takes a `&mut DataFrame` and a `Write` or file path, returns `Result<()>` 54 - `src/lib.rs` exports new `writers` module 55 56 ### Sampling 57 58 - Use Polars `DataFrame::sample_n` or equivalent 59 - Implemented in dtcat binary after read, before display/convert 60 61 ## Testing 62 63 ### `--sample N` 64 - Verify output has exactly N rows (when N < total) 65 - Verify N >= total returns all rows 66 - Verify mutual exclusivity with `--head`/`--tail`/`--all` 67 - Works on CSV, Parquet, Excel 68 69 ### `--convert FORMAT` 70 - Roundtrip: CSV → Parquet → CSV, verify data matches 71 - All 6 target formats produce valid output 72 - Text formats work without `-o` (stdout) 73 - Binary formats error without `-o` 74 - `--skip` and `--sheet` apply before conversion 75 - Mutual exclusivity with display flags 76 77 ### Existing tests 78 - All 187+ existing tests still pass 79 80 ## Not in scope 81 82 - No changes to dtfilter or dtdiff 83 - No new binaries 84 - No dtset, dtvalidate, or dtjoin