Title: | Utilities for Streamlined Data Import, Imputation and Modelling |
---|---|
Description: | Provides functions streamlining the data analysis workflow: Outsourcing data import, renaming and type casting to a *.csv. Manipulating imputed datasets and fitting models on them. Summarizing models. |
Authors: | J. Peter Marquardt [aut, cre] , Till D. Best [aut] |
Maintainer: | J. Peter Marquardt <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.5 |
Built: | 2024-11-12 05:58:17 UTC |
Source: | https://github.com/codeblue-team/basecamb |
A helper function to scale a variable in a dataframe. Divides 'variable' by 'scaling_denominator'.
.scale_variable(data, variable, scaling_denominator)
.scale_variable(data, variable, scaling_denominator)
data |
data.frame |
variable |
a char indicating the variable to be scaled |
scaling_denominator |
a numeric indicating the scaling. The variable is divided by the scaling_denominator. |
the input dataframe with the newly scaled 'variable'
Use a data dictionary data.frame to apply the following tidying steps to your data.frame:
Remove superfluous columns
Rename columns
Ensure/coerce correct data type for each column
Assign factorial levels, including renaming and grouping
apply_data_dictionary( data, data_dictionary, na_action_default = "keep_NA", print_coerced_NA = TRUE )
apply_data_dictionary( data, data_dictionary, na_action_default = "keep_NA", print_coerced_NA = TRUE )
data |
data.frame to be cleaned |
data_dictionary |
data.frame with the following columns:
|
na_action_default |
character: Specify what to do with NA values. Defaults to 'keep_NA'. Options are:
|
print_coerced_NA |
logical indicating whether a message specifying the location of NAs that are introduced by apply_data_dictionary() to data should be printed. |
clean data.frame
J. Peter Marquardt
Wrapper function to apply a function on each dataframe in an imputed dataset
created with mice::mice()
.
apply_function_to_imputed_data(mice_data, fun, ...)
apply_function_to_imputed_data(mice_data, fun, ...)
mice_data |
a mids object generated by |
fun |
the function to apply to each dataframe. May only take one positional argument of type data.frame. |
... |
other arguments passed to fun() |
a mids object with transformed data.
J. Peter Marquardt
Use a named vector of keys (current value) and values for factorial columns to assign meaningful levels and/or group levels
assign_factorial_levels( data, factor_keys_values, na_action_default = "keep_NA" )
assign_factorial_levels( data, factor_keys_values, na_action_default = "keep_NA" )
data |
data.frame to modify |
factor_keys_values |
named list with:
|
na_action_default |
character: Specify what to do with NA values. Defaults to 'keep_NA'. Options are:
|
data frame with new levels
J. Peter Marquardt
data <- data.frame(col1 = as.factor(rep(c('1', '2', '4'), 5))) keys_1 <- list('col1' = c('1' = 'One', '2' = 'Two', '4' = 'Four')) data_1 <- assign_factorial_levels(data, keys_1) keys_2 <- list('col1' = c('1' = 'One', 'default' = 'Not_One')) data_2 <- assign_factorial_levels(data, keys_2)
data <- data.frame(col1 = as.factor(rep(c('1', '2', '4'), 5))) keys_1 <- list('col1' = c('1' = 'One', '2' = 'Two', '4' = 'Four')) data_1 <- assign_factorial_levels(data, keys_1) keys_2 <- list('col1' = c('1' = 'One', 'default' = 'Not_One')) data_2 <- assign_factorial_levels(data, keys_2)
Verbosely assign tidy name and data type for each column of a data.frame and get rid of superfluous columns. Uses a .csv file for assignments to encourage a data dictionary based workflow. CAVE! Requires 'Date' type columns to already be read in as Date.
assign_types_names(data, meta_data)
assign_types_names(data, meta_data)
data |
data.frame to be tidied. Dates must already be of type date. |
meta_data |
data.frame specifying old column names, new column names and datatypes of data. Has the following columns:
|
clean data.frame
J. Peter Marquardt
Build formula used in statistical models from vectors of strings with the option to specify an environment.
build_model_formula( outcome, predictors, censor_event = NULL, env = parent.frame() )
build_model_formula( outcome, predictors, censor_event = NULL, env = parent.frame() )
outcome |
character denoting the column with the outcome. |
predictors |
vector of characters denoting the columns with the predictors. |
censor_event |
character denoting the column with the censoring event, for use in Survival-type models. |
env |
environment to be used in formula creation |
formula for use in statistical models
J. Peter Marquardt
build_model_formula("outcome", c("pred_1", "pred_2")) build_model_formula("outcome", c("pred_1", "pred_2"), censor_event = "cens_event")
build_model_formula("outcome", c("pred_1", "pred_2")) build_model_formula("outcome", c("pred_1", "pred_2"), censor_event = "cens_event")
Constructs a model and conducts a cox.zph test for each imputation of the data set.
cox.zph.mids( model, imputations, p_level = 0.05, global_only = TRUE, return_raw = FALSE, p_only = TRUE, verbose = TRUE )
cox.zph.mids( model, imputations, p_level = 0.05, global_only = TRUE, return_raw = FALSE, p_only = TRUE, verbose = TRUE )
model |
cox proportional model to be evaluated |
imputations |
mids object containing imputations |
p_level |
value below which violation of proportional odds assumption is assumed. Defaults to .05 |
global_only |
return global p-value only. Implies p_only to be TRUE |
return_raw |
return cox.zph objects in a list. If TRUE, function will not return anything else |
p_only |
returns p-values of test only. If FALSE returns ChiĀ² and degrees of freedom as well |
verbose |
Set to FALSE to deactivate messages |
depending on specified options, this function can return
default: A vector of global p-values
global_only = FALSE: a data.frame with p-values for all variables plus the global
return_raw = TRUE: list of cox.zph objects
J. Peter Marquardt
data <- data.frame(time = 101:200, status = rep(c(0,1), 50), pred = rep(c(1:9, NA), 10)) imputed_data <- mice::mice(data) cox_mod <- Hmisc::fit.mult.impute(survival::Surv(time, status) ~ pred, fitter = rms::cph, xtrans = imputed_data) cox.zph.mids(cox_mod, imputed_data)
data <- data.frame(time = 101:200, status = rep(c(0,1), 50), pred = rep(c(1:9, NA), 10)) imputed_data <- mice::mice(data) cox_mod <- Hmisc::fit.mult.impute(survival::Surv(time, status) ~ pred, fitter = rms::cph, xtrans = imputed_data) cox.zph.mids(cox_mod, imputed_data)
Deconstruct a formula object into strings of its components. Predictors are split by '+', so interaction terms will be returned as a single string.
deconstruct_formula(formula)
deconstruct_formula(formula)
formula |
formula object for use in statistical models. |
a named list with fields:
outcome (character)
predictors (vector of characters)
censor_event (character) (optional) censor event, only for formulas including a Surv() object
J. Peter Marquardt
deconstruct_formula(stats::as.formula("outcome ~ predictor1 + predictor2 + predictor3")) deconstruct_formula(stats::as.formula("Surv(outcome, censor_event) ~ predictor"))
deconstruct_formula(stats::as.formula("outcome ~ predictor1 + predictor2 + predictor3")) deconstruct_formula(stats::as.formula("Surv(outcome, censor_event) ~ predictor"))
Filter a dataframe for the nth entry of each subject in it. A typical use cases would be to filter a dataset for the first or last measurement of a subject.#'
filter_nth_entry(data, ID_column, entry_column, n = 1, reverse_order = FALSE)
filter_nth_entry(data, ID_column, entry_column, n = 1, reverse_order = FALSE)
data |
the data.frame to filter |
ID_column |
character column identifying subjects |
entry_column |
character column identifying order of entries. That column can by of types Date, numeric, or any other type suitable for order() |
n |
integer number of entry to keep after ordering |
reverse_order |
logical when TRUE sorts entries last to first before filtering |
data.frame with <= 1 entry per subject
J. Peter Marquardt
data <- data.frame(list(ID = rep(1:5, 3), encounter = rep(1:3, each=5), value = rep(4:6, each=5))) filter_nth_entry(data, 'ID', 'encounter') filter_nth_entry(data, 'ID', 'encounter', n = 2) filter_nth_entry(data, 'ID', 'encounter', reverse_order = TRUE)
data <- data.frame(list(ID = rep(1:5, 3), encounter = rep(1:3, each=5), value = rep(4:6, each=5))) filter_nth_entry(data, 'ID', 'encounter') filter_nth_entry(data, 'ID', 'encounter', n = 2) filter_nth_entry(data, 'ID', 'encounter', reverse_order = TRUE)
This function is a wrapper for fitting models with Hmisc::fit.mult.impute()
on a
multiply imputed dataset generated with mice::mice()
. Cases with a
missing outcome in the original dataset are removed from the mids object
by using the "subset" argument in Hmisc::fit.mult.impute()
.
fit_mult_impute_obs_outcome(mids, formula, fitter, ...)
fit_mult_impute_obs_outcome(mids, formula, fitter, ...)
mids |
a mids object, i.e. the imputed dataset. |
formula |
a formula that describes the model to be fit. The outcome (y variable) in the formula will be used to remove missing cases. |
fitter |
a modeling function (not in quotes) that is compatible with
|
... |
additional arguments to |
mod a fit.mult.impute object.
Till D. Best
# create an imputed dataset imputed_data <- mice::mice(airquality) fit_mult_impute_obs_outcome(mids = imputed_data, formula = Ozone ~ Solar.R + Wind, fitter = glm)
# create an imputed dataset imputed_data <- mice::mice(airquality) fit_mult_impute_obs_outcome(mids = imputed_data, formula = Ozone ~ Solar.R + Wind, fitter = glm)
This function summarises regression models that return data on the log-odds
scale and returns a dataframe with estimates, and confidence intervals as
odds ratios. P value are also provided.
Additionally, intercepts can be removed from the summary. This comes in
handy when ordinal logistic regression models are fit. Ordinal regression
models (such as proportional odds models) usually result in many intercepts
that are not really of interest.
This function is also compatible with models obtained from multiply imputed
datasets, for example models fitted with Hmisc::fit.mult.impute()
.
or_model_summary( model, conf_int = 1.96, print_intercept = FALSE, round_est = 3, round_p = 4 )
or_model_summary( model, conf_int = 1.96, print_intercept = FALSE, round_est = 3, round_p = 4 )
model |
a model object with estimates on the log-odds scale. |
conf_int |
a numeric used to calculate the confidence intervals. The default of 1.96 gives the 95% confidence interval. |
print_intercept |
a logical flag indicating whether intercepts shall be removed. All variables that start with "y>=" will be removed. If there is a variable matching this pattern, it will also be removed! |
round_est |
the number of decimals returned for estimates (odds ratios) and confidence intervals. |
round_p |
the number of decimals provided for p-values. |
CAVE! The function does not check whether your estimates are on the log-odds scale. It will do the transformation no matter what!
a dataframe with the adjusted odds ratio, confidence intervals and p-values.
Till D. Best
# fit a logistic model mod <- glm(formula = am ~ mpg + cyl, data = mtcars, family = binomial()) or_model_summary(model = mod)
# fit a logistic model mod <- glm(formula = am ~ mpg + cyl, data = mtcars, family = binomial()) or_model_summary(model = mod)
Parse date columns in a data.frame as Date. Use a named list to specify each date column (key) and the format (value) it is coded in.
parse_date_columns(data, date_formats)
parse_date_columns(data, date_formats)
data |
data.frame to modify |
date_formats |
named list with:
|
data.frame with date columns in Date type
J. Peter Marquardt
data <- data.frame(date = rep('01/23/4567', 5)) data <- parse_date_columns(data, list(date = '%m/%d/%Y'))
data <- data.frame(date = rep('01/23/4567', 5)) data <- parse_date_columns(data, list(date = '%m/%d/%Y'))
Transforms a numeric vector into quantile groups. For each input value, the output value corresponds to the quantile that value is in. When grouping into n quantiles, the lowest 1/n of values are assigned 1, the highest 1/n are assigned n.
quantile_group(data, n, na.rm = TRUE)
quantile_group(data, n, na.rm = TRUE)
data |
a vector of type numeric with values to be grouped into quantiles |
n |
integer indicating number of quantiles, minimum of 2. Must be smaller than length(data) |
na.rm |
logical; if TRUE all NA values will be removed before calculating groups, if FALSE no NA values are permitted. |
Tied values will be assigned to the lower quantile group rather than etsimating a distribution. In extreme cases this can mean one or more quantile groups are not represented.
If uneven group sizes cannot be avoided, values will be assigned the higher quantile group.
vector of length length(data) with the quantile groups
J. Peter Marquardt
quantile_group(10:1, 3) quantile_group(c(rep(1,3), 10:1, NA), 5)
quantile_group(10:1, 3) quantile_group(c(rep(1,3), 10:1, NA), 5)
Removes rows that are duplicates of another row in all columns except exclude_columns
remove_duplicates( data, exclude_columns = NULL, ID_column = NULL, quiet = FALSE )
remove_duplicates( data, exclude_columns = NULL, ID_column = NULL, quiet = FALSE )
data |
data.frame to check |
exclude_columns |
character vector, these columns are not considered in determining whether two rows are equal |
ID_column |
character; column with identifiers to scan if possible duplicates remain |
quiet |
logical: Should messages be printed? |
Wraps unique()
vector of row indices with non-unique data
J. Peter Marquardt
data <- data.frame(Study_ID = c("A", "B", "C"), ID = c(123, 456, 123), num_cars = c(10, 2, 10)) remove_duplicates(data, exclude_columns = "Study_ID") remove_duplicates(data, exclude_columns = "Study_ID", ID_column = "ID")
data <- data.frame(Study_ID = c("A", "B", "C"), ID = c(123, 456, 123), num_cars = c(10, 2, 10)) remove_duplicates(data, exclude_columns = "Study_ID") remove_duplicates(data, exclude_columns = "Study_ID", ID_column = "ID")
Deprecated, use apply_function_to_imputed_data
instead.
remove_missing_from_mids(mids, var)
remove_missing_from_mids(mids, var)
mids |
mids objects that is filtered. |
var |
a string or vector of strings specifying the variable(s). All cases (i.e. rows) for which there are missing values are removed. |
Remove_missing_from_mids is used to filter a mids object for missing cases in the original dataset in the variable var. This is useful for situations where you want to use as many observations as possible for imputation but only fit your model on a subset of these. Or, if you want to create one large imputed datset from which multiple analyses with multiple outcomes are derived.
a mids object filtered for observed cases of var.
Till D. Best
apply_function_to_imputed_data
This function linearly scales variables in data objects according to a data dictionary. The data dictionary has at least two columns, "variable" and "scaling_denominator". "Variable" is divided by "scaling_denominator".
scale_continuous_predictors(data, scaling_dictionary)
scale_continuous_predictors(data, scaling_dictionary)
data |
a data object with variables. |
scaling_dictionary |
a data.frame with two columns that are called "variable" and "scaling_denominator". |
The data with the newly scaled 'variables'.
Till D. Best
Identify duplicate values in a vector representing a set
setduplicates(vect)
setduplicates(vect)
vect |
a vector of any type |
a vector of duplicate elements
J. Peter Marquardt
setduplicates(c(1,2,2,3))
setduplicates(c(1,2,2,3))
Create Box-Cox transformation using different optimal lambda values for each stratum
stratified_boxcox( data, value_col, strat_cols, plot = FALSE, return = "values", buffer = 0, inverse = FALSE, lambdas = NULL )
stratified_boxcox( data, value_col, strat_cols, plot = FALSE, return = "values", buffer = 0, inverse = FALSE, lambdas = NULL )
data |
data.frame containing the data |
value_col |
character, name of column with values to be transformed |
strat_cols |
character (vector), name(s) of columns to stratify by |
plot |
logical, should the lambda distribution be plotted? |
return |
character, either "values" or "lambdas" |
buffer |
numeric, buffer value to be added before transformation, used to ensure all positive values |
inverse |
logical, if TRUE, the function reverses the transformation given a list of lambdas |
lambdas |
if inverse == TRUE: Nested list of lambdas used in original transformation. Can be obtained by using return = "lambdas" on untransformed data |
if "values", vector of transformed values, if "lambdas" nested named list of used lambdas. The buffer will be equal for all strata
J. Peter Marquardt
data <- data.frame("value" = c(1:50, rnorm(50, 100, 10)), "strat_var" = rep(c(1,2), each = 50), "strat_var2" = rep(c(1, 2), 50)) lambdas <- stratified_boxcox(data = data, value_col = "value", strat_cols = c("strat_var", "strat_var2"), return = "lambdas") data$value_boxed <- stratified_boxcox(data = data, value_col = "value", strat_cols = c("strat_var", "strat_var2"), return = "values") data$value_unboxed <- stratified_boxcox(data = data, value_col = "value_boxed", strat_cols = c("strat_var", "strat_var2"), inverse = TRUE, lambdas = lambdas)
data <- data.frame("value" = c(1:50, rnorm(50, 100, 10)), "strat_var" = rep(c(1,2), each = 50), "strat_var2" = rep(c(1, 2), 50)) lambdas <- stratified_boxcox(data = data, value_col = "value", strat_cols = c("strat_var", "strat_var2"), return = "lambdas") data$value_boxed <- stratified_boxcox(data = data, value_col = "value", strat_cols = c("strat_var", "strat_var2"), return = "values") data$value_unboxed <- stratified_boxcox(data = data, value_col = "value_boxed", strat_cols = c("strat_var", "strat_var2"), inverse = TRUE, lambdas = lambdas)