README.md (5409B)
1 # BazerData 2 3 [](https://github.com/louloulibs/BazerData.jl/actions/workflows/CI.yml) 4 [](https://github.com/louloulibs/BazerData.jl/actions/workflows/CI.yml) 5 [](https://codecov.io/gh/LouLouLibs/BazerData.jl) 6 7 `BazerData.jl` is a placeholder package for some functions that I use in julia frequently. 8 9 So far the package provides a five functions 10 11 1. tabulate some data ([`tabulate`](#tabulate-data)) 12 2. create category based on quantile ([`xtile`](#xtile)) 13 3. winsorize some data ([`winsorize`](#winsorize-data)) 14 4. fill unbalanced panel data ([`panel_fill`](#filling-an-unbalanced-panel)) 15 5. lead and lag functions ([`tlead|tlag`](#leads-and-lags)) 16 17 Note that as the package grow in different directions, dependencies might become overwhelming. 18 The readme serves as documentation; there might be more examples inside of the test folder. 19 20 ## Installation 21 22 `BazerData.jl` is now a registered package. You can install from the main julia registry via the julia package manager 23 ```julia 24 > import Pkg; Pkg.add("BazerData.jl") 25 # or in package mode in the REPL 26 pkg> add BazerData 27 # or from the main github branch 28 > import Pkg; Pkg.add("https://github.com/louloulibs/BazerData.jl#main") 29 ``` 30 31 32 33 ## Usage 34 35 ### Tabulate data 36 37 The `tabulate` function tries to emulate the tabulate function from stata (see oneway [here](https://www.stata.com/manuals/rtabulateoneway.pdf) or twoway [here](https://www.stata.com/manuals13/rtabulatetwoway.pdf)). 38 This relies on the `DataFrames.jl` package and is useful to get a quick overview of the data. 39 40 ```julia 41 using DataFrames 42 using BazerData 43 using PalmerPenguins 44 45 df = DataFrame(PalmerPenguins.load()) 46 47 tabulate(df, :island) 48 tabulate(df, [:island, :species]) 49 50 # If you are looking for groups by type (detect missing e.g.) 51 df = DataFrame(x = [1, 2, 2, "NA", missing], y = ["c", "c", "b", "z", "d"]) 52 tabulate(df, [:x, :y], group_type = :type) # only types for all group variables 53 tabulate(df, [:x, :y], group_type = [:value, :type]) # mix value and types 54 ``` 55 I have not implemented all the features of the stata tabulate function, but I am open to [suggestions](#3). 56 57 58 ### xtile 59 60 See the [doc](https://louloulibs.github.io/BazerData.jl/dev/man/xtile_guide) or the [tests](test/UnitTests/xtile.jl) for examples. 61 ```julia 62 sales = rand(10_000); 63 a = xtile(sales, 10); 64 b = xtile(sales, 10, weights=Weights(repeat([1], length(sales))) ); 65 # works on strings 66 cities = [randstr() for _ in 10] 67 xtile(cities, 10) 68 ``` 69 70 71 ### Winsorize data 72 73 This is fairly standard and I offer options to specify probabilities or cutpoints; moreover you can replace the values that are winsorized with a missing, the cutpoints, or some specific values. 74 There is a [`winsor`](https://juliastats.org/StatsBase.jl/stable/robust/#StatsBase.winsor) function in StatsBase.jl but I think it's a little less full-featured. 75 76 See the doc for [examples](https://louloulibs.github.io/BazerData.jl/dev/man/winsorize_guide) 77 ```julia 78 df = DataFrame(PalmerPenguins.load()) 79 winsorize(df.flipper_length_mm, probs=(0.05, 0.95)) # skipmissing by default 80 transform(df, :flipper_length_mm => 81 (x->winsorize(x, probs=(0.05, 0.95), replace_value=missing)), renamecols=false) 82 ``` 83 84 85 ### Filling an unbalanced panel 86 87 Sometimes it is unpractical to work with unbalanced panel data. 88 There are many ways to fill values between dates (what interpolation to use) and I try to implement a few of them. 89 I use the function sparingly, so it has not been tested extensively. 90 91 See the following example (or the test suite) for more information. 92 ```julia 93 df_panel = DataFrame( # missing t=2 for id=1 94 id = ["a","a", "b","b", "c","c","c", "d","d","d","d"], 95 t = [Date(1990, 1, 1), Date(1990, 4, 1), Date(1990, 8, 1), Date(1990, 9, 1), 96 Date(1990, 1, 1), Date(1990, 2, 1), Date(1990, 4, 1), 97 Date(1999, 11, 10), Date(1999, 12, 21), Date(2000, 2, 5), Date(2000, 4, 1)], 98 v1 = [1,1, 1,6, 6,0,0, 1,4,11,13], 99 v2 = [1,2,3,6,6,4,5, 1,2,3,4], 100 v3 = [1,5,4,6,6,15,12.25, 21,22.5,17.2,1]) 101 102 panel_fill(df_panel, :id, :t, [:v1, :v2, :v3], 103 gap=Month(1), method=:backwards, uniquecheck=true, flag=true, merge=true) 104 panel_fill(df_panel, :id, :t, [:v1, :v2, :v3], 105 gap=Month(1), method=:forwards, uniquecheck=true, flag=true, merge=true) 106 panel_fill(df_panel, :id, :t, [:v1, :v2, :v3], 107 gap=Month(1), method=:linear, uniquecheck=true, flag=true, merge=true) 108 ``` 109 110 ### Leads and lags 111 This is largely "borrowed" (copied) from @FuZhiyu [`PanelShift.jl`](https://github.com/FuZhiyu/PanelShift.jl) package. 112 See the tests for more examples. 113 114 ```julia 115 x, t = [1, 2, 3], [1, 2, 4] 116 tlag(x, t) 117 tlag(x, t, n=2) 118 119 using Dates; 120 t = [Date(2020,1,1); Date(2020,1,2); Date(2020,1,4)]; 121 tlag(x, t) 122 tlag(x, t, n=Day(2)) # specify two-day lags 123 ``` 124 125 126 ## Other stuff 127 128 129 See my other package 130 - [BazerUtils.jl](https://github.com/louloulibs/BazerUtils.jl) which groups together data wrangling functions. 131 - [FinanceRoutines.jl](https://github.com/louloulibs/FinanceRoutines.jl) which is more focused and centered on working with financial data. 132 - [TigerFetch.jl](https://github.com/louloulibs/TigerFetch.jl) which simplifies downloading shape files from the Census.