Common errors in mrIML • mrIML

In this vignette, we guide you through some of the common errors generated by MrIML and how to approach fixing them. As mrIML uses the Tidymodels R package on the back end, some of the errors you can get come straight from the package. See https://www.tidyverse.org/blog/2023/11/tidymodels-errors-q4/#fn:1 for a bit of an overview. The package tidysdm also has similar errors to what you see in MrIML, and I recommend their vignette too: https://evolecolgroup.github.io/tidysdm/articles/a3_troubleshooting.html. For example, if there are NA’s in the predictor/response sets this will generate the same error messages as tidysdm.

Most importantly before running MrIML 2.0 make sure all the row ids match across response and predictor data frames (X, X1 and Y). If they don’t your models will be meaningless. The following code is one way to check these.

library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Sample data frames (replace these with your actual data frames)
df1 <- data.frame(ID = 1:5, Value = letters[1:5])
df2 <- data.frame(ID = 1:5, Value = LETTERS[1:5])
df3 <- data.frame(ID = 1:5, Value = month.name[1:5])

#we will add this as a mrIML error in the next version.

# Function to check if row IDs match across data frames
check_row_ids <- function(df_list) {
  # Extract row IDs for each data frame
  row_ids <- df_list %>%
    map(rownames)
  
  # Check if row IDs match across all data frames
  if (all(row_ids[[1]] == row_ids)) {
    cat("Row IDs match across all data frames.\n")
    return(TRUE)
  } else {
    cat("Row IDs do not match between data frames.\n")
    return(FALSE)
  }
}

# Usage
data_frames <- list(df1, df2, df3)
check_row_ids(data_frames)
#> Warning in row_ids[[1]] == row_ids: longer object length is not a multiple of
#> shorter object length
#> Row IDs do not match between data frames.
#> [1] FALSE

Once you have made sure everything matches, there are three key things to consider first when MrIML models aren’t working.

Are there NAs in the data? If so, many of the algorithms will give the following error:

Warning: All models failed. Run show_notes(.Last.tune.result) for more #> information.

To fix this, you can either remove all the rows with missing values or impute them using packages such as missing forests. Ensure that you do not include your response variable (to stop data leakage).

library(missForest)

#import data
X_noNAmissF <- missForest(X,
                          variablewise=T) #default values have worked fine previously

X$OOBerror #make sure they have low mse.

X <- X_noNAmissF$ximp #assing X the data without missing values

Is there enough data to do cross validation? This is a particular problem for smaller data sets. The warning you’ll see is:

A | warning: No control observations were detected in truth with control level ‘1’

To fix this you can increase the ammount of data in the testing set by increasing the ‘prop’ in MrIMLpredicts.

Also, for small data sets, you can see the errodata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABIAAAASCAYAAABWzo5XAAAAWElEQVR42mNgGPTAxsZmJsVqQApgmGw1yApwKcQiT7phRBuCzzCSDSHGMKINIeDNmWQlA2IigKJwIssQkHdINgxfmBBtGDEBS3KCxBc7pMQgMYE5c/AXPwAwSX4lV3pTWwAAAABJRU5ErkJggg==r associated with the racing tuning algorithm:

2dpo(.M2sym(from)) : not a positive definite matrix (and positive semidefiniteness is not checked)

In this case, go back to standar grid based autotuning by setting ’ racing = F’

yhats_rf <- mrIMLpredicts(X=X,
                          Y=Y,
                          Model=model_rf,
                          balance_data='no',
                          mode='classification',
                          k=5,
                          tune_grid_size=5, 
                          seed = 123,
                          racing=F )

Do you have categorical predictors?

Some algorithms such as linear models and extreme gradient boosting can’t parse variables that are factors. You’ll either have to dumify them using one hot encoding or likelihood encoding to get around this issue.

You’ll see the warning:

Warning: All models failed. Run show_notes(.Last.tune.result) for more information.

See this blogpost for more details: https://animalecologyinfocus.com/2019/10/11/how-to-make-the-most-out-of-machine-learning-models-and-what-can-go-wrong/