BazerData.jl

Data manipulation utilities for Julia
Log | Files | Refs | README | LICENSE

commit a0cf07b9f57a8c2c17237a747373de6c9308392f
parent 1203c402a3c50602f5ea707c84134bcf6f58bfc0
Author: Erik Loualiche <eloualic@umn.edu>
Date:   Tue, 20 May 2025 17:51:26 -0500

readme + doc

Diffstat:
MREADME.md | 41++++++++++-------------------------------
Mdocs/src/index.md | 125++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 134 insertions(+), 32 deletions(-)

diff --git a/README.md b/README.md @@ -7,7 +7,7 @@ `BazerData.jl` is a placeholder package for some functions that I use in julia frequently. -So far the package provides a four functions +So far the package provides a five functions 1. tabulate some data ([`tabulate`](#tabulate-data)) 2. create category based on quantile ([`xtile`](#xtile)) @@ -107,38 +107,17 @@ panel_fill(df_panel, :id, :t, [:v1, :v2, :v3], ### Leads and lags This is largely "borrowed" (copied) from @FuZhiyu [`PanelShift.jl`](https://github.com/FuZhiyu/PanelShift.jl) package. +See the tests for more examples. ```julia -t, v = [1;2;4], [1;2;3]; -julia> tlag(t, v) # the default lag period is the unitary difference in t, here 1 -3-element Vector{Union{Missing, Int64}}: - missing - 1 - missing - - -julia> tlag(t, v, 2) # we can also specify lags using the third argument -3-element Vector{Union{Missing, Int64}}: - missing - missing - 2 - - -julia> using Dates; -julia> t = [Date(2020,1,1); Date(2020,1,2); Date(2020,1,4)]; -julia> tlag(t, [1, 2, 3]) # customized types of the time vector are also supported -3-element Vector{Union{Missing, Int64}}: - missing - 1 - missing - - -julia> tlag(t, [1, 2, 3], Day(2)) # specify two-day lags -3-element Vector{Union{Missing, Int64}}: - missing - missing - 2 - +x, t = [1, 2, 3], [1, 2, 4] +tlag(x, t) +tlag(x, t, n=2) + +using Dates; +t = [Date(2020,1,1); Date(2020,1,2); Date(2020,1,4)]; +tlag(x, t) +tlag(x, t, n=Day(2)) # specify two-day lags ``` diff --git a/docs/src/index.md b/docs/src/index.md @@ -1,4 +1,127 @@ # BazerData.jl -Useful functions for working with data. +Useful functions for working with data: `BazerData.jl` is a placeholder package for some functions that I use in julia frequently. +So far the package provides a five functions + + 1. tabulate some data ([`tabulate`](#tabulate-data)) + 2. create category based on quantile ([`xtile`](#xtile)) + 3. winsorize some data ([`winsorize`](#winsorize-data)) + 4. fill unbalanced panel data ([`panel_fill`](#filling-an-unbalanced-panel)) + 5. lead and lag functions ([`tlead|tlag`](#leads-and-lags)) + +Note that as the package grow in different directions, dependencies might become overwhelming. +The readme serves as documentation; there might be more examples inside of the test folder. + +## Installation + +`BazerData.jl` is a not yet a registered package. +You can install it from github via +```julia +import Pkg +Pkg.add(url="https://github.com/eloualiche/BazerData.jl") +``` + + +## Usage + +### Tabulate data + +The `tabulate` function tries to emulate the tabulate function from stata (see oneway [here](https://www.stata.com/manuals/rtabulateoneway.pdf) or twoway [here](https://www.stata.com/manuals13/rtabulatetwoway.pdf)). +This relies on the `DataFrames.jl` package and is useful to get a quick overview of the data. + +```julia +using DataFrames +using BazerData +using PalmerPenguins + +df = DataFrame(PalmerPenguins.load()) + +tabulate(df, :island) +tabulate(df, [:island, :species]) + +# If you are looking for groups by type (detect missing e.g.) +df = DataFrame(x = [1, 2, 2, "NA", missing], y = ["c", "c", "b", "z", "d"]) +tabulate(df, [:x, :y], group_type = :type) # only types for all group variables +tabulate(df, [:x, :y], group_type = [:value, :type]) # mix value and types +``` +I have not implemented all the features of the stata tabulate function, but I am open to [suggestions](#3). + + +### xtile + +See the [doc](https://eloualiche.github.io/BazerData.jl/dev/man/xtile_guide) or the [tests](test/UnitTests/xtile.jl) for examples. +```julia +sales = rand(10_000); +a = xtile(sales, 10); +b = xtile(sales, 10, weights=Weights(repeat([1], length(sales))) ); +# works on strings +cities = [randstr() for _ in 10] +xtile(cities, 10) +``` + + +### Winsorize data + +See the doc for [examples](https://eloualiche.github.io/BazerData.jl/dev/man/winsorize_guide) + +This is fairly standard and I offer options to specify probabilities or cutpoints; moreover you can replace the values that are winsorized with a missing, the cutpoints, or some specific values. +There is a [`winsor`](https://juliastats.org/StatsBase.jl/stable/robust/#StatsBase.winsor) function in StatsBase.jl but I think it's a little less full-featured. + +See the doc for [examples](https://eloualiche.github.io/BazerData.jl/dev/man/winsorize_guide) +```julia +df = DataFrame(PalmerPenguins.load()) +winsorize(df.flipper_length_mm, probs=(0.05, 0.95)) # skipmissing by default +transform(df, :flipper_length_mm => + (x->winsorize(x, probs=(0.05, 0.95), replace_value=missing)), renamecols=false) +``` + + +### Filling an unbalanced panel + +Sometimes it is unpractical to work with unbalanced panel data. +There are many ways to fill values between dates (what interpolation to use) and I try to implement a few of them. +I use the function sparingly, so it has not been tested extensively. + +See the following example (or the test suite) for more information. +```julia +df_panel = DataFrame( # missing t=2 for id=1 + id = ["a","a", "b","b", "c","c","c", "d","d","d","d"], + t = [Date(1990, 1, 1), Date(1990, 4, 1), Date(1990, 8, 1), Date(1990, 9, 1), + Date(1990, 1, 1), Date(1990, 2, 1), Date(1990, 4, 1), + Date(1999, 11, 10), Date(1999, 12, 21), Date(2000, 2, 5), Date(2000, 4, 1)], + v1 = [1,1, 1,6, 6,0,0, 1,4,11,13], + v2 = [1,2,3,6,6,4,5, 1,2,3,4], + v3 = [1,5,4,6,6,15,12.25, 21,22.5,17.2,1]) + +panel_fill(df_panel, :id, :t, [:v1, :v2, :v3], + gap=Month(1), method=:backwards, uniquecheck=true, flag=true, merge=true) +panel_fill(df_panel, :id, :t, [:v1, :v2, :v3], + gap=Month(1), method=:forwards, uniquecheck=true, flag=true, merge=true) +panel_fill(df_panel, :id, :t, [:v1, :v2, :v3], + gap=Month(1), method=:linear, uniquecheck=true, flag=true, merge=true) +``` + +### Leads and lags +This is largely "borrowed" (copied) from @FuZhiyu [`PanelShift.jl`](https://github.com/FuZhiyu/PanelShift.jl) package. +See the tests for more examples. + +```julia +x, t = [1, 2, 3], [1, 2, 4] +tlag(x, t) +tlag(x, t, n=2) + +using Dates; +t = [Date(2020,1,1); Date(2020,1,2); Date(2020,1,4)]; +tlag(x, t) +tlag(x, t, n=Day(2)) # specify two-day lags +``` + + +## Other stuff + + +See my other package + - [BazerUtils.jl](https://github.com/eloualiche/BazerUtils.jl) which groups together data wrangling functions. + - [FinanceRoutines.jl](https://github.com/eloualiche/FinanceRoutines.jl) which is more focused and centered on working with financial data. + - [TigerFetch.jl](https://github.com/eloualiche/TigerFetch.jl) which simplifies downloading shape files from the Census.