2026-04-04-v0.2.0-sample-convert.md (18629B)
1 # v0.2.0: --sample and --convert Implementation Plan 2 3 > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. 4 5 **Goal:** Add random row sampling (`--sample N`) and format conversion (`--convert FORMAT -o PATH`) to dtcat. 6 7 **Architecture:** Both features extend the existing dtcat binary. `--sample` uses Polars `DataFrame::sample_n_literal` after reading, before display. `--convert` requires new writer functions in a `src/writers/` module mirroring `src/readers/`, then a write path in dtcat that short-circuits display. 8 9 **Tech Stack:** Polars (ParquetWriter, IpcWriter, JsonWriter, CsvWriter), clap, anyhow. 10 11 --- 12 13 ### Task 1: Add `--sample N` flag and validation 14 15 **Files:** 16 - Modify: `src/bin/dtcat.rs` 17 - Test: `tests/dtcat.rs` 18 19 - [ ] **Step 1: Write the failing tests** 20 21 Add to `tests/dtcat.rs`: 22 23 ```rust 24 #[test] 25 fn sample_returns_n_rows() { 26 // 18-row fixture, sample 5 27 let out = dtcat().arg("demo/sales.csv").arg("--sample").arg("5").arg("--csv") 28 .assert().success(); 29 let stdout = String::from_utf8(out.get_output().stdout.clone()).unwrap(); 30 // CSV header + 5 data rows = 6 lines (last line may be empty) 31 let lines: Vec<&str> = stdout.trim().lines().collect(); 32 assert_eq!(lines.len(), 6, "expected header + 5 rows, got {}", lines.len()); 33 } 34 35 #[test] 36 fn sample_ge_total_returns_all() { 37 let f = csv_file("x\n1\n2\n3\n"); 38 dtcat().arg(f.path()).arg("--sample").arg("100").arg("--csv") 39 .assert().success(); 40 } 41 42 #[test] 43 fn sample_conflicts_with_head() { 44 let f = csv_file("x\n1\n"); 45 dtcat().arg(f.path()).arg("--sample").arg("1").arg("--head").arg("1") 46 .assert().code(2); 47 } 48 49 #[test] 50 fn sample_conflicts_with_tail() { 51 let f = csv_file("x\n1\n"); 52 dtcat().arg(f.path()).arg("--sample").arg("1").arg("--tail").arg("1") 53 .assert().code(2); 54 } 55 56 #[test] 57 fn sample_conflicts_with_all() { 58 let f = csv_file("x\n1\n"); 59 dtcat().arg(f.path()).arg("--sample").arg("1").arg("--all") 60 .assert().code(2); 61 } 62 ``` 63 64 - [ ] **Step 2: Run tests to verify they fail** 65 66 Run: `cargo test --test dtcat sample` 67 Expected: FAIL — unknown arg `--sample` 68 69 - [ ] **Step 3: Add `--sample` arg and validation to dtcat** 70 71 In `src/bin/dtcat.rs`, add to the `Args` struct after the `all` field: 72 73 ```rust 74 /// Randomly sample N rows 75 #[arg(long, value_name = "N")] 76 sample: Option<usize>, 77 ``` 78 79 Update `validate_args`: 80 81 ```rust 82 fn validate_args(args: &Args) -> Result<()> { 83 if args.schema && args.describe { 84 bail!("--schema and --describe are mutually exclusive"); 85 } 86 if args.sample.is_some() { 87 if args.head.is_some() { 88 bail!("--sample and --head are mutually exclusive"); 89 } 90 if args.tail.is_some() { 91 bail!("--sample and --tail are mutually exclusive"); 92 } 93 if args.all { 94 bail!("--sample and --all are mutually exclusive"); 95 } 96 } 97 Ok(()) 98 } 99 ``` 100 101 - [ ] **Step 4: Implement sampling logic in the display section** 102 103 In `src/bin/dtcat.rs`, replace the display match block (the `let output = match ...` section) with: 104 105 ```rust 106 // Determine what to display 107 let output = if let Some(n) = args.sample { 108 let sampled = if n >= df.height() { 109 df 110 } else { 111 df.sample_n_literal(n, false, false, None)? 112 }; 113 format_data_table(&sampled) 114 } else { 115 match (args.head, args.tail) { 116 (Some(h), Some(t)) => { 117 format_head_tail(&df, h, t) 118 } 119 (Some(h), None) => { 120 let sliced = df.head(Some(h)); 121 format_data_table(&sliced) 122 } 123 (None, Some(t)) => { 124 let sliced = df.tail(Some(t)); 125 format_data_table(&sliced) 126 } 127 (None, None) => { 128 if args.all || df.height() <= DEFAULT_THRESHOLD { 129 format_data_table(&df) 130 } else { 131 format_head_tail(&df, DEFAULT_HEAD_TAIL, DEFAULT_HEAD_TAIL) 132 } 133 } 134 } 135 }; 136 ``` 137 138 Also handle `--sample` with `--csv` output. The current `--csv` branch exits early before the display match. Move sampling before the csv check, or handle it inline. The simplest approach: apply sampling before the `--csv` check. After the line `let df = read_file(&path, fmt, &opts)?;`, add: 139 140 ```rust 141 // Apply sampling if requested (before any display mode) 142 let df = if let Some(n) = args.sample { 143 if n >= df.height() { 144 df 145 } else { 146 df.sample_n_literal(n, false, false, None)? 147 } 148 } else { 149 df 150 }; 151 ``` 152 153 Then remove the sample handling from the display match block (revert it to the original match block). This way `--sample` + `--csv` works naturally. 154 155 - [ ] **Step 5: Run tests to verify they pass** 156 157 Run: `cargo test --test dtcat sample` 158 Expected: all 5 sample tests PASS 159 160 - [ ] **Step 6: Commit** 161 162 ```bash 163 git add src/bin/dtcat.rs tests/dtcat.rs 164 git commit -m "feat: add --sample N flag to dtcat" 165 ``` 166 167 --- 168 169 ### Task 2: Create writers module 170 171 **Files:** 172 - Create: `src/writers/mod.rs` 173 - Create: `src/writers/csv.rs` 174 - Create: `src/writers/parquet.rs` 175 - Create: `src/writers/arrow.rs` 176 - Create: `src/writers/json.rs` 177 - Modify: `src/lib.rs` 178 179 - [ ] **Step 1: Create `src/writers/mod.rs`** 180 181 ```rust 182 pub mod arrow; 183 pub mod csv; 184 pub mod json; 185 pub mod parquet; 186 ``` 187 188 - [ ] **Step 2: Create `src/writers/csv.rs`** 189 190 ```rust 191 use anyhow::Result; 192 use polars::prelude::*; 193 use std::io::Write; 194 use std::path::Path; 195 196 use crate::format::Format; 197 198 pub fn write(df: &mut DataFrame, path: Option<&Path>, format: Format) -> Result<()> { 199 let separator = match format { 200 Format::Tsv => b'\t', 201 _ => b',', 202 }; 203 204 match path { 205 Some(p) => { 206 let file = std::fs::File::create(p)?; 207 CsvWriter::new(file) 208 .with_separator(separator) 209 .finish(df)?; 210 } 211 None => { 212 let mut buf = Vec::new(); 213 CsvWriter::new(&mut buf) 214 .with_separator(separator) 215 .finish(df)?; 216 std::io::stdout().write_all(&buf)?; 217 } 218 } 219 Ok(()) 220 } 221 222 #[cfg(test)] 223 mod tests { 224 use super::*; 225 use tempfile::NamedTempFile; 226 227 #[test] 228 fn write_csv_roundtrip() { 229 let s1 = Series::new("name".into(), &["Alice", "Bob"]); 230 let s2 = Series::new("value".into(), &[100i64, 200]); 231 let mut df = DataFrame::new(vec![s1.into_column(), s2.into_column()]).unwrap(); 232 233 let f = NamedTempFile::with_suffix(".csv").unwrap(); 234 write(&mut df, Some(f.path()), Format::Csv).unwrap(); 235 236 let result = crate::readers::csv::read(f.path(), &crate::reader::ReadOptions::default()).unwrap(); 237 assert_eq!(result.height(), 2); 238 assert_eq!(result.get_column_names(), df.get_column_names()); 239 } 240 241 #[test] 242 fn write_tsv_uses_tab() { 243 let s = Series::new("x".into(), &[1i64]); 244 let mut df = DataFrame::new(vec![s.into_column()]).unwrap(); 245 246 let f = NamedTempFile::with_suffix(".tsv").unwrap(); 247 write(&mut df, Some(f.path()), Format::Tsv).unwrap(); 248 249 let content = std::fs::read_to_string(f.path()).unwrap(); 250 assert!(!content.contains(',')); 251 } 252 } 253 ``` 254 255 - [ ] **Step 3: Create `src/writers/parquet.rs`** 256 257 ```rust 258 use anyhow::{bail, Result}; 259 use polars::prelude::*; 260 use std::path::Path; 261 262 pub fn write(df: &mut DataFrame, path: Option<&Path>) -> Result<()> { 263 let path = path.ok_or_else(|| anyhow::anyhow!("--convert parquet requires -o PATH"))?; 264 let file = std::fs::File::create(path)?; 265 ParquetWriter::new(file).finish(df)?; 266 Ok(()) 267 } 268 269 #[cfg(test)] 270 mod tests { 271 use super::*; 272 use tempfile::NamedTempFile; 273 274 #[test] 275 fn write_parquet_roundtrip() { 276 let s1 = Series::new("name".into(), &["Alice", "Bob"]); 277 let s2 = Series::new("value".into(), &[100i64, 200]); 278 let mut df = DataFrame::new(vec![s1.into_column(), s2.into_column()]).unwrap(); 279 280 let f = NamedTempFile::with_suffix(".parquet").unwrap(); 281 write(&mut df, Some(f.path())).unwrap(); 282 283 let result = crate::readers::parquet::read(f.path(), &crate::reader::ReadOptions::default()).unwrap(); 284 assert_eq!(result.height(), 2); 285 } 286 287 #[test] 288 fn write_parquet_no_path_errors() { 289 let s = Series::new("x".into(), &[1i64]); 290 let mut df = DataFrame::new(vec![s.into_column()]).unwrap(); 291 assert!(write(&mut df, None).is_err()); 292 } 293 } 294 ``` 295 296 - [ ] **Step 4: Create `src/writers/arrow.rs`** 297 298 ```rust 299 use anyhow::Result; 300 use polars::prelude::*; 301 use std::path::Path; 302 303 pub fn write(df: &mut DataFrame, path: Option<&Path>) -> Result<()> { 304 let path = path.ok_or_else(|| anyhow::anyhow!("--convert arrow requires -o PATH"))?; 305 let file = std::fs::File::create(path)?; 306 IpcWriter::new(file).finish(df)?; 307 Ok(()) 308 } 309 310 #[cfg(test)] 311 mod tests { 312 use super::*; 313 use tempfile::NamedTempFile; 314 315 #[test] 316 fn write_arrow_roundtrip() { 317 let s = Series::new("x".into(), &[1i64, 2, 3]); 318 let mut df = DataFrame::new(vec![s.into_column()]).unwrap(); 319 320 let f = NamedTempFile::with_suffix(".arrow").unwrap(); 321 write(&mut df, Some(f.path())).unwrap(); 322 323 let result = crate::readers::arrow::read(f.path(), &crate::reader::ReadOptions::default()).unwrap(); 324 assert_eq!(result.height(), 3); 325 } 326 327 #[test] 328 fn write_arrow_no_path_errors() { 329 let s = Series::new("x".into(), &[1i64]); 330 let mut df = DataFrame::new(vec![s.into_column()]).unwrap(); 331 assert!(write(&mut df, None).is_err()); 332 } 333 } 334 ``` 335 336 - [ ] **Step 5: Create `src/writers/json.rs`** 337 338 ```rust 339 use anyhow::Result; 340 use polars::prelude::*; 341 use std::io::Write as IoWrite; 342 use std::path::Path; 343 344 use crate::format::Format; 345 346 pub fn write(df: &mut DataFrame, path: Option<&Path>, format: Format) -> Result<()> { 347 match format { 348 Format::Ndjson => write_ndjson(df, path), 349 _ => write_json(df, path), 350 } 351 } 352 353 fn write_json(df: &mut DataFrame, path: Option<&Path>) -> Result<()> { 354 match path { 355 Some(p) => { 356 let file = std::fs::File::create(p)?; 357 JsonWriter::new(file).finish(df)?; 358 } 359 None => { 360 let mut buf = Vec::new(); 361 JsonWriter::new(&mut buf).finish(df)?; 362 std::io::stdout().write_all(&buf)?; 363 } 364 } 365 Ok(()) 366 } 367 368 fn write_ndjson(df: &mut DataFrame, path: Option<&Path>) -> Result<()> { 369 match path { 370 Some(p) => { 371 let file = std::fs::File::create(p)?; 372 JsonLineWriter::new(file).finish(df)?; 373 } 374 None => { 375 let mut buf = Vec::new(); 376 JsonLineWriter::new(&mut buf).finish(df)?; 377 std::io::stdout().write_all(&buf)?; 378 } 379 } 380 Ok(()) 381 } 382 383 #[cfg(test)] 384 mod tests { 385 use super::*; 386 use tempfile::NamedTempFile; 387 388 #[test] 389 fn write_json_roundtrip() { 390 let s = Series::new("x".into(), &[1i64, 2]); 391 let mut df = DataFrame::new(vec![s.into_column()]).unwrap(); 392 393 let f = NamedTempFile::with_suffix(".json").unwrap(); 394 write(&mut df, Some(f.path()), Format::Json).unwrap(); 395 396 let result = crate::readers::json::read(f.path(), Format::Json, &crate::reader::ReadOptions::default()).unwrap(); 397 assert_eq!(result.height(), 2); 398 } 399 400 #[test] 401 fn write_ndjson_roundtrip() { 402 let s = Series::new("x".into(), &[1i64, 2]); 403 let mut df = DataFrame::new(vec![s.into_column()]).unwrap(); 404 405 let f = NamedTempFile::with_suffix(".ndjson").unwrap(); 406 write(&mut df, Some(f.path()), Format::Ndjson).unwrap(); 407 408 let result = crate::readers::json::read(f.path(), Format::Ndjson, &crate::reader::ReadOptions::default()).unwrap(); 409 assert_eq!(result.height(), 2); 410 } 411 } 412 ``` 413 414 - [ ] **Step 6: Add `writers` to `src/lib.rs`** 415 416 Replace the contents of `src/lib.rs` with: 417 418 ```rust 419 pub mod diff; 420 pub mod filter; 421 pub mod format; 422 pub mod formatter; 423 pub mod metadata; 424 pub mod reader; 425 pub mod readers; 426 pub mod writers; 427 ``` 428 429 - [ ] **Step 7: Run unit tests** 430 431 Run: `cargo test --lib` 432 Expected: all unit tests pass including new writer tests 433 434 - [ ] **Step 8: Commit** 435 436 ```bash 437 git add src/writers/ src/lib.rs 438 git commit -m "feat: add writers module (csv, tsv, parquet, arrow, json, ndjson)" 439 ``` 440 441 --- 442 443 ### Task 3: Add write_file dispatch function 444 445 **Files:** 446 - Create: `src/writer.rs` 447 - Modify: `src/lib.rs` 448 449 - [ ] **Step 1: Create `src/writer.rs`** 450 451 ```rust 452 use anyhow::{bail, Result}; 453 use polars::prelude::*; 454 use std::path::Path; 455 456 use crate::format::Format; 457 use crate::writers; 458 459 /// Write a DataFrame to a file or stdout, dispatching to the appropriate writer. 460 /// 461 /// For binary formats (Parquet, Arrow), `path` is required. 462 /// For text formats (CSV, TSV, JSON, NDJSON), `path` is optional (None = stdout). 463 /// Excel writing is not supported. 464 pub fn write_file(df: &mut DataFrame, path: Option<&Path>, format: Format) -> Result<()> { 465 match format { 466 Format::Csv | Format::Tsv => writers::csv::write(df, path, format), 467 Format::Parquet => writers::parquet::write(df, path), 468 Format::Arrow => writers::arrow::write(df, path), 469 Format::Json | Format::Ndjson => writers::json::write(df, path, format), 470 Format::Excel => bail!("writing Excel format is not supported; use csv or parquet"), 471 } 472 } 473 ``` 474 475 - [ ] **Step 2: Add `writer` to `src/lib.rs`** 476 477 ```rust 478 pub mod diff; 479 pub mod filter; 480 pub mod format; 481 pub mod formatter; 482 pub mod metadata; 483 pub mod reader; 484 pub mod readers; 485 pub mod writer; 486 pub mod writers; 487 ``` 488 489 - [ ] **Step 3: Run tests** 490 491 Run: `cargo test --lib` 492 Expected: PASS 493 494 - [ ] **Step 4: Commit** 495 496 ```bash 497 git add src/writer.rs src/lib.rs 498 git commit -m "feat: add write_file dispatch function" 499 ``` 500 501 --- 502 503 ### Task 4: Add `--convert` and `-o` flags to dtcat 504 505 **Files:** 506 - Modify: `src/bin/dtcat.rs` 507 - Test: `tests/dtcat.rs` 508 509 - [ ] **Step 1: Write the failing tests** 510 511 Add to `tests/dtcat.rs`: 512 513 ```rust 514 #[test] 515 fn convert_csv_to_parquet() { 516 let out = NamedTempFile::with_suffix(".parquet").unwrap(); 517 dtcat().arg("tests/fixtures/data.csv") 518 .arg("--convert").arg("parquet") 519 .arg("-o").arg(out.path()) 520 .assert().success(); 521 // Read back and verify 522 dtcat().arg(out.path()).arg("--csv") 523 .assert().success() 524 .stdout(predicate::str::contains("Alice")) 525 .stdout(predicate::str::contains("Charlie")); 526 } 527 528 #[test] 529 fn convert_parquet_to_csv_file() { 530 let out = NamedTempFile::with_suffix(".csv").unwrap(); 531 dtcat().arg("tests/fixtures/data.parquet") 532 .arg("--convert").arg("csv") 533 .arg("-o").arg(out.path()) 534 .assert().success(); 535 dtcat().arg(out.path()) 536 .assert().success() 537 .stdout(predicate::str::contains("Alice")); 538 } 539 540 #[test] 541 fn convert_csv_to_json_stdout() { 542 dtcat().arg("tests/fixtures/data.csv") 543 .arg("--convert").arg("json") 544 .assert().success() 545 .stdout(predicate::str::contains("Alice")); 546 } 547 548 #[test] 549 fn convert_csv_to_ndjson_stdout() { 550 dtcat().arg("tests/fixtures/data.csv") 551 .arg("--convert").arg("ndjson") 552 .assert().success() 553 .stdout(predicate::str::contains("Alice")); 554 } 555 556 #[test] 557 fn convert_parquet_no_output_errors() { 558 dtcat().arg("tests/fixtures/data.csv") 559 .arg("--convert").arg("parquet") 560 .assert().failure(); 561 } 562 563 #[test] 564 fn convert_arrow_no_output_errors() { 565 dtcat().arg("tests/fixtures/data.csv") 566 .arg("--convert").arg("arrow") 567 .assert().failure(); 568 } 569 570 #[test] 571 fn convert_conflicts_with_schema() { 572 let f = csv_file("x\n1\n"); 573 dtcat().arg(f.path()).arg("--convert").arg("csv").arg("--schema") 574 .assert().code(2); 575 } 576 577 #[test] 578 fn convert_with_skip() { 579 let f = csv_file("meta\nname,value\nAlice,100\n"); 580 dtcat().arg(f.path()).arg("--skip").arg("1").arg("--convert").arg("csv") 581 .assert().success() 582 .stdout(predicate::str::contains("Alice")); 583 } 584 ``` 585 586 - [ ] **Step 2: Run tests to verify they fail** 587 588 Run: `cargo test --test dtcat convert` 589 Expected: FAIL — unknown arg `--convert` 590 591 - [ ] **Step 3: Add `--convert` and `-o` args and validation** 592 593 In `src/bin/dtcat.rs`, add to the `Args` struct: 594 595 ```rust 596 /// Convert to format (csv, tsv, parquet, arrow, json, ndjson) 597 #[arg(long, value_name = "FORMAT")] 598 convert: Option<String>, 599 600 /// Output file path (required for binary formats with --convert) 601 #[arg(short = 'o', value_name = "PATH")] 602 output: Option<String>, 603 ``` 604 605 Add to imports at the top of the file: 606 607 ```rust 608 use dtcore::format::parse_format_str; 609 use dtcore::writer::write_file; 610 ``` 611 612 Update `validate_args` to add after the sample checks: 613 614 ```rust 615 if args.convert.is_some() { 616 if args.schema || args.describe || args.info || args.csv 617 || args.head.is_some() || args.tail.is_some() 618 || args.all || args.sample.is_some() 619 { 620 bail!("--convert is mutually exclusive with display flags"); 621 } 622 } 623 ``` 624 625 - [ ] **Step 4: Add convert logic to the run function** 626 627 In `src/bin/dtcat.rs`, insert after the sampling block and before the empty DataFrame check: 628 629 ```rust 630 // --convert: write to a different format and exit 631 if let Some(ref convert_str) = args.convert { 632 let target_fmt = parse_format_str(convert_str)?; 633 let out_path = args.output.as_deref().map(std::path::Path::new); 634 let mut df = df; 635 write_file(&mut df, out_path, target_fmt)?; 636 return Ok(()); 637 } 638 ``` 639 640 - [ ] **Step 5: Run tests to verify they pass** 641 642 Run: `cargo test --test dtcat convert` 643 Expected: all 8 convert tests PASS 644 645 - [ ] **Step 6: Run all tests** 646 647 Run: `cargo test` 648 Expected: all tests PASS 649 650 - [ ] **Step 7: Commit** 651 652 ```bash 653 git add src/bin/dtcat.rs tests/dtcat.rs 654 git commit -m "feat: add --convert FORMAT and -o PATH to dtcat" 655 ``` 656 657 --- 658 659 ### Task 5: Bump version and final verification 660 661 **Files:** 662 - Modify: `Cargo.toml` 663 664 - [ ] **Step 1: Bump version** 665 666 In `Cargo.toml`, change: 667 668 ```toml 669 version = "0.2.0" 670 ``` 671 672 - [ ] **Step 2: Run full test suite** 673 674 Run: `cargo test` 675 Expected: all tests PASS 676 677 - [ ] **Step 3: Run clippy** 678 679 Run: `cargo clippy --release` 680 Expected: no warnings 681 682 - [ ] **Step 4: Verify CLI help** 683 684 Run: `cargo run --release --bin dtcat -- --help` 685 Expected: output includes `--sample`, `--convert`, `-o` 686 687 - [ ] **Step 5: Commit** 688 689 ```bash 690 git add Cargo.toml 691 git commit -m "chore: bump version to 0.2.0" 692 ```