read_html_tables.md (2772B)
1 # Reading HTML Tables 2 3 Parse HTML tables into DataFrames — a Julia-native replacement for pandas' `read_html`. 4 5 --- 6 7 ## Quick Start 8 9 ```julia 10 using BazerUtils 11 12 # From a URL 13 dfs = read_html_tables("https://en.wikipedia.org/wiki/List_of_Alabama_state_parks") 14 15 # From a raw HTML string 16 dfs = read_html_tables("<table><tr><th>A</th></tr><tr><td>1</td></tr></table>") 17 ``` 18 19 `read_html_tables` returns a `Vector{DataFrame}` — one per `<table>` element found. 20 21 --- 22 23 ## API 24 25 ```@docs 26 read_html_tables 27 ``` 28 29 --- 30 31 ## Keyword Arguments 32 33 ### `match` 34 35 Pass a `Regex` to keep only tables whose text content matches: 36 37 ```julia 38 dfs = read_html_tables(url; match=r"Population"i) 39 ``` 40 41 ### `flatten` 42 43 Controls how multi-level headers (multiple `<thead>` rows) become column names. 44 DataFrames requires `String` column names, so multi-level tuples are flattened: 45 46 | Value | Column name example | Description | 47 |:------|:--------------------|:------------| 48 | `nothing` (default) | `"(Region, Name)"` | Tuple string representation | 49 | `:join` | `"Region_Name"` | Levels joined with `_` | 50 | `:last` | `"Name"` | Last header level only | 51 52 ```julia 53 dfs = read_html_tables(html; flatten=:join) 54 ``` 55 56 --- 57 58 ## How It Works 59 60 1. **Fetch**: URLs (starting with `http`) are downloaded via `HTTP.jl`; raw strings are parsed directly. 61 2. **Parse**: HTML is parsed with `Gumbo.jl`; `<table>` elements are selected with `Cascadia.jl`. 62 3. **Classify rows**: `<thead>` rows become headers, `<tbody>`/`<tfoot>` rows become body data. Without an explicit `<thead>`, consecutive all-`<th>` rows at the top are promoted to headers. 63 4. **Expand spans**: `colspan` and `rowspan` attributes are expanded into a dense grid (same algorithm as pandas' `_expand_colspan_rowspan`). 64 5. **Build DataFrame**: Empty cells become `missing`. Duplicate column names get `.1`, `.2` suffixes. 65 66 --- 67 68 ## Examples 69 70 ### Filter tables by content 71 72 ```julia 73 # Only tables mentioning "GDP" 74 dfs = read_html_tables(url; match=r"GDP"i) 75 ``` 76 77 ### Multi-level headers 78 79 ```julia 80 html = """ 81 <table> 82 <thead> 83 <tr><th colspan="2">Region</th></tr> 84 <tr><th>Name</th><th>Pop</th></tr> 85 </thead> 86 <tbody> 87 <tr><td>East</td><td>100</td></tr> 88 </tbody> 89 </table> 90 """ 91 92 read_html_tables(html; flatten=:join) 93 # 1×2 DataFrame: columns "Region_Name", "Region_Pop" 94 ``` 95 96 ### Tables with colspan/rowspan 97 98 Spanned cells are duplicated into every position they cover, so the resulting DataFrame has a regular rectangular shape with no gaps. 99 100 --- 101 102 ## See Also 103 104 - [`Gumbo.jl`](https://github.com/JuliaWeb/Gumbo.jl): HTML parser 105 - [`Cascadia.jl`](https://github.com/Algocircle/Cascadia.jl): CSS selector engine 106 - [`HTTP.jl`](https://github.com/JuliaWeb/HTTP.jl): HTTP client 107 - [`DataFrames.jl`](https://github.com/JuliaData/DataFrames.jl): Tabular data