BazerData.jl

Data manipulation utilities for Julia
Log | Files | Refs | README | LICENSE

README.md (5409B)


      1 # BazerData
      2 
      3 [![CI](https://github.com/louloulibs/BazerData.jl/actions/workflows/CI.yml/badge.svg)](https://github.com/louloulibs/BazerData.jl/actions/workflows/CI.yml)
      4 [![Lifecycle:Experimental](https://img.shields.io/badge/Lifecycle-Experimental-339999)](https://github.com/louloulibs/BazerData.jl/actions/workflows/CI.yml)
      5 [![codecov](https://codecov.io/gh/LouLouLibs/BazerData.jl/graph/badge.svg?token=AQR1GHLLHG)](https://codecov.io/gh/LouLouLibs/BazerData.jl)
      6 
      7 `BazerData.jl` is a placeholder package for some functions that I use in julia frequently.
      8 
      9 So far the package provides a five functions
     10 
     11   1. tabulate some data ([`tabulate`](#tabulate-data))
     12   2. create category based on quantile ([`xtile`](#xtile))
     13   3. winsorize some data ([`winsorize`](#winsorize-data))
     14   4. fill unbalanced panel data ([`panel_fill`](#filling-an-unbalanced-panel))
     15   5. lead and lag functions ([`tlead|tlag`](#leads-and-lags))
     16 
     17 Note that as the package grow in different directions, dependencies might become overwhelming.
     18 The readme serves as documentation; there might be more examples inside of the test folder.
     19 
     20 ## Installation
     21 
     22 `BazerData.jl` is now a registered package. You can install from the main julia registry via the julia package manager
     23 ```julia
     24 > import Pkg; Pkg.add("BazerData.jl")
     25 # or in package mode in the REPL
     26 pkg> add BazerData 
     27 # or from the main github branch
     28 > import Pkg; Pkg.add("https://github.com/louloulibs/BazerData.jl#main")
     29 ```
     30 
     31 
     32 
     33 ## Usage
     34 
     35 ### Tabulate data
     36 
     37 The `tabulate` function tries to emulate the tabulate function from stata (see oneway [here](https://www.stata.com/manuals/rtabulateoneway.pdf) or twoway [here](https://www.stata.com/manuals13/rtabulatetwoway.pdf)).
     38 This relies on the `DataFrames.jl` package and is useful to get a quick overview of the data.
     39 
     40 ```julia
     41 using DataFrames
     42 using BazerData
     43 using PalmerPenguins
     44 
     45 df = DataFrame(PalmerPenguins.load())
     46 
     47 tabulate(df, :island)
     48 tabulate(df, [:island, :species])
     49 
     50 # If you are looking for groups by type (detect missing e.g.)
     51 df = DataFrame(x = [1, 2, 2, "NA", missing], y = ["c", "c", "b", "z", "d"])
     52 tabulate(df, [:x, :y], group_type = :type) # only types for all group variables
     53 tabulate(df, [:x, :y], group_type = [:value, :type]) # mix value and types
     54 ```
     55 I have not implemented all the features of the stata tabulate function, but I am open to [suggestions](#3).
     56 
     57 
     58 ### xtile
     59 
     60 See the [doc](https://louloulibs.github.io/BazerData.jl/dev/man/xtile_guide) or the [tests](test/UnitTests/xtile.jl) for examples.
     61 ```julia
     62 sales = rand(10_000);
     63 a = xtile(sales, 10);
     64 b = xtile(sales, 10, weights=Weights(repeat([1], length(sales))) );
     65 # works on strings
     66 cities = [randstr() for _ in 10]
     67 xtile(cities, 10)
     68 ```
     69 
     70 
     71 ### Winsorize data
     72 
     73 This is fairly standard and I offer options to specify probabilities or cutpoints; moreover you can replace the values that are winsorized with a missing, the cutpoints, or some specific values.
     74 There is a [`winsor`](https://juliastats.org/StatsBase.jl/stable/robust/#StatsBase.winsor) function in StatsBase.jl but I think it's a little less full-featured.
     75 
     76 See the doc for [examples](https://louloulibs.github.io/BazerData.jl/dev/man/winsorize_guide)
     77 ```julia
     78 df = DataFrame(PalmerPenguins.load())
     79 winsorize(df.flipper_length_mm, probs=(0.05, 0.95)) # skipmissing by default
     80 transform(df, :flipper_length_mm =>
     81     (x->winsorize(x, probs=(0.05, 0.95), replace_value=missing)), renamecols=false)
     82 ```
     83 
     84 
     85 ### Filling an unbalanced panel
     86 
     87 Sometimes it is unpractical to work with unbalanced panel data.
     88 There are many ways to fill values between dates (what interpolation to use) and I try to implement a few of them.
     89 I use the function sparingly, so it has not been tested extensively.
     90 
     91 See the following example (or the test suite) for more information.
     92 ```julia
     93 df_panel = DataFrame(        # missing t=2 for id=1
     94     id = ["a","a", "b","b", "c","c","c", "d","d","d","d"],
     95     t  = [Date(1990, 1, 1), Date(1990, 4, 1), Date(1990, 8, 1), Date(1990, 9, 1),
     96           Date(1990, 1, 1), Date(1990, 2, 1), Date(1990, 4, 1),
     97           Date(1999, 11, 10), Date(1999, 12, 21), Date(2000, 2, 5), Date(2000, 4, 1)],
     98     v1 = [1,1, 1,6, 6,0,0, 1,4,11,13],
     99     v2 = [1,2,3,6,6,4,5, 1,2,3,4],
    100     v3 = [1,5,4,6,6,15,12.25, 21,22.5,17.2,1])
    101 
    102 panel_fill(df_panel, :id, :t, [:v1, :v2, :v3],
    103     gap=Month(1), method=:backwards, uniquecheck=true, flag=true, merge=true)
    104 panel_fill(df_panel, :id, :t, [:v1, :v2, :v3],
    105     gap=Month(1), method=:forwards, uniquecheck=true, flag=true, merge=true)
    106 panel_fill(df_panel, :id, :t, [:v1, :v2, :v3],
    107     gap=Month(1), method=:linear, uniquecheck=true, flag=true, merge=true)
    108 ```
    109 
    110 ### Leads and lags
    111 This is largely "borrowed" (copied) from @FuZhiyu [`PanelShift.jl`](https://github.com/FuZhiyu/PanelShift.jl) package.
    112 See the tests for more examples.
    113 
    114 ```julia
    115 x, t = [1, 2, 3], [1, 2, 4]
    116 tlag(x, t) 
    117 tlag(x, t, n=2) 
    118 
    119 using Dates;
    120 t = [Date(2020,1,1); Date(2020,1,2); Date(2020,1,4)];
    121 tlag(x, t)
    122 tlag(x, t, n=Day(2)) # specify two-day lags
    123 ```
    124 
    125 
    126 ## Other stuff
    127 
    128 
    129 See my other package 
    130   - [BazerUtils.jl](https://github.com/louloulibs/BazerUtils.jl) which groups together data wrangling functions.
    131   - [FinanceRoutines.jl](https://github.com/louloulibs/FinanceRoutines.jl) which is more focused and centered on working with financial data.
    132   - [TigerFetch.jl](https://github.com/louloulibs/TigerFetch.jl) which simplifies downloading shape files from the Census.