index.md (4883B)
1 # BazerData.jl 2 3 Useful functions for working with data: `BazerData.jl` is a placeholder package for some functions that I use in julia frequently. 4 5 So far the package provides a five functions 6 7 1. tabulate some data ([`tabulate`](#tabulate-data)) 8 2. create category based on quantile ([`xtile`](#xtile)) 9 3. winsorize some data ([`winsorize`](#winsorize-data)) 10 4. fill unbalanced panel data ([`panel_fill`](#filling-an-unbalanced-panel)) 11 5. lead and lag functions ([`tlead|tlag`](#leads-and-lags)) 12 13 Note that as the package grow in different directions, dependencies might become overwhelming. 14 The readme serves as documentation; there might be more examples inside of the test folder. 15 16 ## Installation 17 18 `BazerData.jl` is a registered package. 19 You can install it via 20 ```julia 21 import Pkg 22 Pkg.add("BazerData") 23 ``` 24 25 26 ## Usage 27 28 ### Tabulate data 29 30 The `tabulate` function tries to emulate the tabulate function from stata (see oneway [here](https://www.stata.com/manuals/rtabulateoneway.pdf) or twoway [here](https://www.stata.com/manuals13/rtabulatetwoway.pdf)). 31 This relies on the `DataFrames.jl` package and is useful to get a quick overview of the data. 32 33 ```julia 34 using DataFrames 35 using BazerData 36 using PalmerPenguins 37 38 df = DataFrame(PalmerPenguins.load()) 39 40 tabulate(df, :island) 41 tabulate(df, [:island, :species]) 42 43 # If you are looking for groups by type (detect missing e.g.) 44 df = DataFrame(x = [1, 2, 2, "NA", missing], y = ["c", "c", "b", "z", "d"]) 45 tabulate(df, [:x, :y], group_type = :type) # only types for all group variables 46 tabulate(df, [:x, :y], group_type = [:value, :type]) # mix value and types 47 ``` 48 I have not implemented all the features of the stata tabulate function, but I am open to suggestions. 49 50 51 52 ### xtile 53 54 See the [doc](https://eloualiche.github.io/BazerData.jl/dev/man/xtile_guide) or the [tests](https://github.com/eloualiche/BazerData.jl/blob/main/test/UnitTests/xtile.jl) for examples. 55 ```julia 56 sales = rand(10_000); 57 a = xtile(sales, 10); 58 b = xtile(sales, 10, weights=Weights(repeat([1], length(sales))) ); 59 # works on strings 60 cities = [randstr() for _ in 10] 61 xtile(cities, 10) 62 ``` 63 64 65 ### Winsorize data 66 67 See the doc for [examples](https://eloualiche.github.io/BazerData.jl/dev/man/winsorize_guide) 68 69 This is fairly standard and I offer options to specify probabilities or cutpoints; moreover you can replace the values that are winsorized with a missing, the cutpoints, or some specific values. 70 There is a [`winsor`](https://juliastats.org/StatsBase.jl/stable/robust/#StatsBase.winsor) function in StatsBase.jl but I think it's a little less full-featured. 71 72 See the doc for [examples](https://eloualiche.github.io/BazerData.jl/dev/man/winsorize_guide) 73 ```julia 74 df = DataFrame(PalmerPenguins.load()) 75 winsorize(df.flipper_length_mm, probs=(0.05, 0.95)) # skipmissing by default 76 transform(df, :flipper_length_mm => 77 (x->winsorize(x, probs=(0.05, 0.95), replace_value=missing)), renamecols=false) 78 ``` 79 80 81 ### Filling an unbalanced panel 82 83 Sometimes it is unpractical to work with unbalanced panel data. 84 There are many ways to fill values between dates (what interpolation to use) and I try to implement a few of them. 85 I use the function sparingly, so it has not been tested extensively. 86 87 See the following example (or the test suite) for more information. 88 ```julia 89 df_panel = DataFrame( # missing t=2 for id=1 90 id = ["a","a", "b","b", "c","c","c", "d","d","d","d"], 91 t = [Date(1990, 1, 1), Date(1990, 4, 1), Date(1990, 8, 1), Date(1990, 9, 1), 92 Date(1990, 1, 1), Date(1990, 2, 1), Date(1990, 4, 1), 93 Date(1999, 11, 10), Date(1999, 12, 21), Date(2000, 2, 5), Date(2000, 4, 1)], 94 v1 = [1,1, 1,6, 6,0,0, 1,4,11,13], 95 v2 = [1,2,3,6,6,4,5, 1,2,3,4], 96 v3 = [1,5,4,6,6,15,12.25, 21,22.5,17.2,1]) 97 98 panel_fill(df_panel, :id, :t, [:v1, :v2, :v3], 99 gap=Month(1), method=:backwards, uniquecheck=true, flag=true) 100 panel_fill(df_panel, :id, :t, [:v1, :v2, :v3], 101 gap=Month(1), method=:forwards, uniquecheck=true, flag=true) 102 panel_fill(df_panel, :id, :t, [:v1, :v2, :v3], 103 gap=Month(1), method=:linear, uniquecheck=true, flag=true) 104 ``` 105 106 ### Leads and lags 107 This is largely "borrowed" (copied) from @FuZhiyu [`PanelShift.jl`](https://github.com/FuZhiyu/PanelShift.jl) package. 108 See the tests for more examples. 109 110 ```julia 111 x, t = [1, 2, 3], [1, 2, 4] 112 tlag(x, t) 113 tlag(x, t, n=2) 114 115 using Dates; 116 t = [Date(2020,1,1); Date(2020,1,2); Date(2020,1,4)]; 117 tlag(x, t) 118 tlag(x, t, n=Day(2)) # specify two-day lags 119 ``` 120 121 122 ## Other stuff 123 124 125 See my other package 126 - [BazerUtils.jl](https://github.com/eloualiche/BazerUtils.jl) which groups together data wrangling functions. 127 - [FinanceRoutines.jl](https://github.com/eloualiche/FinanceRoutines.jl) which is more focused and centered on working with financial data. 128 - [TigerFetch.jl](https://github.com/eloualiche/TigerFetch.jl) which simplifies downloading shape files from the Census.