MrIML is a R package allows users to generate and interpret multi-response models (i.e., joint species distribution models) leveraging advances in data science and machine learning. MrIML couples the tidymodel infrastructure developed by Max Kuhn and colleagues with model agnostic interpretable machine learning tools to gain insights into multiple response data such as. As such MrIML is flexible and easily extendable allowing users to construct everything from simple linear models to tree-based methods for each response using the same syntax and same way to compare predictive performance.In this vignette we will guide you through how to apply the MrIML framework to identify and benchmark biosecurity practices in veterinary disease. This working example is built upon a classification framework. More information about this can be found in the “Classification working example”.
This is a synthesized dataset to simulate PRRSV infection, biosecurity practices and farm demographics on swine farms across the united states. Please note that the data included in this package is simulated, and is not a reflection of any real farm, company or state.
#other packages needed library(parsnip); library(mrIML); library(tidymodels); library(workflows); library(tidyverse); library(flashlight); library(devtools); library(DALEX); library(plyr);library(vip); library(ggsci); library(ggrepel); library(breakDown) set.seed(130) #set the seed to ensure consistency data <- as.data.frame(mrIML::biosecurity_vignette_data)
This is a random forest model, as used in the classification example. However there are more models available within MrIML and can be found here.
#split predictor variables and outcome X <- as.data.frame(data %>% select(1)) Y <- data %>% select(-c(1, 44, 45)) #remove class, group and ID model1 <- #model used to generate yhat #specify that the model is a random forest rand_forest(trees = 100, mtry = tune(), min_n = tune(), mode = "classification") %>% #select the engine/package that underlies the model set_engine("ranger", importance = c("impurity","impurity_corrected")) %>% #must be set to classification for classification problems set_mode("classification")
Now we can train and test our model.
yhats <- mrIMLpredicts(X=X, #response data Y=Y, #features/predictors model1=model1, #specify your model balance_data='no', #chose how to balance your data model='classification', #chose your mode (classification versus regression) parallel = FALSE, #do you want to run the model in parallel? seed = 120) #set seed
We can then assess our model performance using a number of performance metrics including area under the curve (AUC), sensitivity, specificity and Matthews correlation coefficient (MCC).
ModelPerf <- mrIMLperformance(yhats, model1, X=X, model='classification') ModelPerf[] #predictive performance for individual responses #> response model_name roc_AUC mcc sensitivity #> 1 Class rand_forest 0.551282051282051 0.116131449170815 0.5 #> specificity prevalence #> 1 0.615384615384615 0.5 ModelPerf[]#overall predictive performance. r2 for regression and MCC for classification #>  0.5512821
Here we can look at the variable importance. This is the dependence between our outcome and biosecurity variables.
#calculate variable importance VI <- mrVip(yhats, Y=Y) #plot cumulative variable importance plot_vi_biosecurity(VI=VI, X=X,Y=Y, modelPerf=ModelPerf, cutoff= 0.6, plot.pca='no') #Follow instructions in console
#> Press [enter] to plot individual variable importance summaries #>  ""
We can explore further and assess the partial dependence of each variable. Here we isolate the dependence of one variable and visualize how this dependence changes over different observed values.
#create a flashlight object fl <- mrFlashlight(yhats, X, Y, response = "single", index=1, model='classification') #plot partial dependence profiles plot(light_profile(fl, v = "Premises_in_3_miles"))
The following function allows you to visualize and compare predicted risk among and within groups
#apply the trained model to the entire data set to provide risk of predicted outbreak fit_bio <- pull_workflow_fit(yhats[]$mod1_k) preds_pos <- predict(fit_bio, data, "prob") data$Predicted <- preds_pos$.pred_1 #process data ready for the function data$Class <- as.factor(data$Class) data<-data%>% mutate(Class = revalue(Class, c("1" = "Positive", "0" = "Negative"))) data$Class <- relevel(data$Class, "Positive") #plot among group comparison of predicted risk mrBenchmark(my.data = "data", outcome = "Class", y = "Predicted", group = "Group", type = "external")
#plot within group comparison of predicted risk mrBenchmark(my.data = "data", outcome = "Class", y = "Predicted", group = "Group", label_by = "ID", type = "internal")
To investigate and interpret the contribution of variables at an individual level, we must use a local explanation method. Here we implement a local breakDown explainer.
mrLocalExplainer produces both aggregated and individual results of variable contribution. Variables with a phi > 0 contribute to an increase in predicted PRRSV outbreak risk, while variables with a phi < 0 contribute to a decrease in predicted outbreak risk.
#Use this function to implement the local explainer data<-data%>% mutate(Class = revalue(Class, c("1" = "Positive", "0" = "Negative"))) data$Class <- relevel(data$Class, "Positive") mrLocalExplainer(my.data = Y, model_data = yhats, outcome = data$Class) #> Preparation of a new explainer is initiated #> -> model label : ranger ( [33m default [39m ) #> -> data : 200 rows 42 cols #> -> target variable : not specified! ( [31m WARNING [39m ) #> -> predict function : yhat.ranger will be used ( [33m default [39m ) #> -> predicted values : No value for predict function target column. ( [33m default [39m ) #> -> model_info : package ranger , ver. 0.12.1 , task classification ( [33m default [39m ) #> -> model_info : Model info detected classification task but 'y' is a NULL . ( [31m WARNING [39m ) #> -> model_info : By deafult classification tasks supports only numercical 'y' parameter. #> -> model_info : Consider changing to numerical vector with 0 and 1 values. #> -> model_info : Otherwise I will not be able to calculate residuals or loss function. #> -> predicted values : numerical, min = 0.1618017 , mean = 0.4919367 , max = 0.8665756 #> -> residual function : difference between y and yhat ( [33m default [39m ) #> [32m A new explainer has been created! [39m
Individual plots are produced in the list object
LE_indiv_plots. Each plot corresponds to a single farm. An example of a positive farm and a negative farm are shown here.
LE_indiv_plots #> []
LE_indiv_plots #> []
At the positive farm we can see that yards from the public road and garbage collection are contributing to an increased outbreak risk, while manure removals and repair visits are contributing to a decreased outbreak risk. Likewise, at the negative farm yards to the public road and and the downtime required for manure removal personnel are contributing to an increased outbreak risk, while feed deliveries and manure removals are contributing to a decreased outbreak risk.
This session of the MrIML has been funded by Critical Agricultural Research and Extension 2019-68008-29910 from the USDA National Institute of Food and Agriculture and is a join effort by Dr. Linhares group at Iowa State University and Dr. Machado group at North Carolina State University