The aim of this work is to validate a quantification method for carotenoid contents in roots of M. esculenta from colorimetric data using the CIE L * a * b * system. Assuming that the statistical techniques of prognostic analysis as well as machine learning can correlate colorimetric data easily obtained in the field, with the levels obtained through traditional techniques for compounds quantification, such as UV-visible spectrophotometry or the HPLC and, from this, construct prediction models of carotenoids content for this type of biomass.
Roots of fifty M. esculenta genotypes belonging to the EPAGRI’s germplasm bank were sampled in the 2014/2015 season. Carotenoids were extracted from fresh roots and the absorbances of the organosolvent extracts were collected on a UV-visible spectrophotometer using a spectral window from 200 to 700 nm. Aliquots (10 µl) of the extracts were also injected into a liquid chromatograph. The color attributes of the samples were measured by a colorimeter and the results were expressed according to the CIELAB color space scale.
To run this script the following packages are necessary:
library(specmine)
library(xlsx)
Setting working directory:
setwd("C:/Users/Telma/Desktop/CassavaCarotenoids")
set.seed(12345)
The machine learning models used in this analysis are listed in the table below. These belong to the caret package, which is used by specmine.
Model | “Method” Value | Built-in Feature Selection |
---|---|---|
Conditional Inference Random Forest | cforest | YES |
Conditional Inference Tree | ctree | YES |
Decision Trees | rpart | YES |
Elastic Net | enet | YES |
K-Nearest Neighbors | knn | NO |
Lasso Regression | lasso | YES |
Linear Regression | lm | NO |
Linear Regression (w/ Backwards Selection) | leapBackward | YES |
Linear Regression (w/ Forward Selection) | leapForward | YES |
Linear Regression (w/ Stepwise Selection) | leapSeq | YES |
Partial Least Squares | kernelpls, pls, simpls, widekernelpls | YES |
Random Forest | rf | YES |
Ridge Regression | ridge | NO |
Ridge Regression | foba | YES |
Support Vector Machines (kernlab package) | svmLinear | NO |
Support Vector Machines (e1071 package) | svmLinear2 | NO |
The following function is used to retrieve the model name given the “method” value.
getModelName <- function(model) {
if (model == 'lasso') name = 'Lasso'
else if (model == 'ridge') name = 'Ridge Regression'
else if (model == 'foba') name = 'Ridge Regression (w/ FS)'
else if (model == 'rf') name = 'Random Forest'
else if (model == 'cforest') name = 'Conditional Inference Random Forest'
else if (model == 'enet') name = 'Elastic Net'
else if (model == 'pls') name = 'Partial Least Squares (pls)'
else if (model == 'kernelpls') name = 'Partial Least Squares (kernelpls)'
else if (model == 'simpls') name = 'Partial Least Squares (simpls)'
else if (model == 'widekernelpls') name = 'Partial Least Squares (widekernelpls)'
else if (model == 'rpart') name = 'Decision Trees'
else if (model == 'ctree') name = 'Conditional Inference Tree'
else if (model == 'svmLinear') name = 'Support Vector Machines (kernlab)'
else if (model == 'svmLinear2') name = 'Support Vector Machines (e1071)'
else if (model == 'knn') name = 'K-Nearest Neighbors'
else if (model == 'lm') name = 'Linear Regression'
else if (model == 'leapBackward') name = 'Linear Regression (w/ Backwards Selection)'
else if (model == 'leapForward') name = 'Linear Regression (w/ Forward Selection)'
else if (model == 'leapSeq') name = 'Linear Regression (w/ Stepwise Selection)'
else return()
return (name)
}
The following function returns a data frame with the result of applying one or more machine learning models to a selected dataset. The metadata variable for prediction must be supplied.
perform_ML <- function(dataset, models, pred_var) {
res = data.frame(RMSE = numeric(0), Rsquared = numeric(0), RMSESD = numeric(0), RsquaredSD = numeric(0))
for (model in models) {
name = getModelName(model)
ml_res = train_models_performance(dataset, c(model), pred_var, "repeatedcv",
num.folds = 5, compute.varimp = F)
res[name,] = c(ml_res$performance$RMSE, ml_res$performance$Rsquared,
ml_res$performance$RMSESD, ml_res$performance$RsquaredSD)
assign('res', res, envir = .GlobalEnv)
}
return(res)
}
The following function returns a data frame with the result of applying a machine learning model to a dataset that is to be applied various preprocessing methods, including scaling, smoothing interpolation, background, offset and baseline corrections, first derivative and multiplicative scatter correction. The metadata variable for prediction must be supplied.
perform_ML_preproc <- function(dataset, model, pred_var) {
res = data.frame(RMSE = numeric(0), Rsquared = numeric(0), RMSESD = numeric(0), RsquaredSD = numeric(0))
ds.sc = specmine::scaling(dataset)
ds.wavelens = get_x_values_as_num(dataset)
x.axis.sm = seq(min(ds.wavelens), max(ds.wavelens),10)
ds.smooth = smoothing_interpolation(carotAg, method = "loess", x.axis = x.axis.sm)
ds.bg = data_correction(dataset, 'background')
ds.offset = data_correction(ds.bg, 'offset')
ds.baseline = data_correction(ds.offset, 'baseline')
ds.fd = first_derivative(dataset)
ds.msc = msc_correction(dataset)
datasets = list('No preprocessing' = dataset, 'Scaling' = ds.sc, 'Smoothing' = ds.smooth,
'Background cor' = ds.bg, 'Background + Offset cors' = ds.offset,
'Background + Offset + Baseline cors' = ds.baseline, 'First Derivative' = ds.fd,
'Multiplicative Scatter Cor' = ds.msc)
i = 1
for (ds in datasets) {
ml_res = train_models_performance(ds, c(model), pred_var, "repeatedcv", num.folds = 5, compute.varimp = F)
res[names(datasets)[i],] = c(ml_res$performance$RMSE, ml_res$performance$Rsquared,
ml_res$performance$RMSESD, ml_res$performance$RsquaredSD)
assign('res', res, envir = .GlobalEnv)
i = i + 1
}
return(res)
}
UV data is stored in 150 .xlsx files (3 replicates for each of the 50 genotypes), each file containing the read absorbances values between 200 to 700 Æm.
files = list.files("data/UV")
datamat = matrix(nrow = 501, ncol = length(files))
rownames(datamat) = 200:700 #data recorded between 200-700nm
colnames(datamat) = gsub("\\.xls", "", files)
for (i in 1:length(files)){
tab_excel = read.xlsx(paste("data/UV/", files[i], sep = ""), sheetIndex = 1, header = F)
datamat[,i] = c(tab_excel[,2], rep(NA, 501-length(tab_excel[,2])))
}
datamat[1:6, 1:6]
## 101.1 101.2 101.3 102.1 102.2 102.3
## 200 0.08763 0.1863 0.10565 0.10565 0.1482 0.13221
## 201 0.09468 0.2184 0.13756 0.12944 0.1254 0.08732
## 202 0.06238 0.1792 0.08410 0.09159 0.1437 0.09159
## 203 0.11513 0.1776 0.13093 0.13497 0.1190 0.07799
## 204 0.11364 0.2038 0.05227 0.11364 0.1376 0.08368
## 205 0.13941 0.1820 0.10809 0.09691 0.1006 0.10809
Besides information regarding sample varieties and replicates, the metadata file also contains information about HPLC concentration measurements and CIELAB data.
file.metadata = "metadata/Carotenoides_Colorimetria.csv"
metadata = read_metadata(file.metadata)
description = "UV data for cassava cultivars - carotenoids"
label.x = "Wavelength"
label.values = "Absorbance"
head(metadata)
## Varieties Replicates Cielab_L Cielab_A Cielab_B CarotenoidsContent_TCCS Lutein Betacryptoxanthin
## 3.1 3 1 85.72 -2.70 22.28 4.853 0.03248 0.06543
## 3.2 3 2 86.18 -2.48 21.39 4.809 0.03248 0.06543
## 3.3 3 3 85.25 -2.64 22.38 4.951 0.03248 0.06543
## 5.1 5 1 85.47 -1.76 6.74 3.098 0.02598 0.07023
## 5.2 5 2 82.29 -2.00 7.02 4.046 0.02598 0.07023
## 5.3 5 3 84.99 -1.86 7.25 3.383 0.02598 0.07023
## Alphacarotene Cisbetacarotene transbetacarotene Lycopene TCCHPLC
## 3.1 0.06021 2.250 3.269 0 5.678
## 3.2 0.06021 2.250 3.269 0 5.678
## 3.3 0.06021 2.250 3.269 0 5.678
## 5.1 0.08319 2.679 2.860 0 5.719
## 5.2 0.08319 2.679 2.860 0 5.719
## 5.3 0.08319 2.679 2.860 0 5.719
After creating a matrix from the UV .xlsx files and reading the metadata, a dataset can be easily created.
Carotenoides_Colorimetria = create_dataset(type = "uvv-spectra", datamatrix = datamat, metadata = metadata,
label.x = label.x, label.values = label.values,
description = description)
sum_dataset(Carotenoides_Colorimetria)
## Dataset summary:
## Valid dataset
## Description: UV data for cassava cultivars - carotenoids
## Type of data: uvv-spectra
## Number of samples: 150
## Number of data points 501
## Number of metadata variables: 13
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 4224
## Mean of data values: 0.3301
## Median of data values: 0.1048
## Standard deviation: 0.6824
## Range of values: -0.06964 4.191
## Quantiles:
## 0% 25% 50% 75% 100%
## -0.06964 0.02003 0.10478 0.23166 4.19051
Because the majority of carotenoids exhibit absorption in the visible region of the spectrum, between 400 to 500 nm, a subset of the original dataset was created, with samples belonging to this wavelenght interval. Also, because the dataset has some missing values, as perceived by the summary above, missing values were replaced with the mean of the variables’ values.
carot_sub = subset_x_values_by_interval(Carotenoides_Colorimetria, 400, 500) # Absorbances between 400-500nm
carot_sub_nomissing = missingvalues_imputation(carot_sub, method = "mean")
sum_dataset(carot_sub_nomissing)
## Dataset summary:
## Valid dataset
## Description: UV data for cassava cultivars - carotenoids; Missing value imputation with method mean
## Type of data: uvv-spectra
## Number of samples: 150
## Number of data points 101
## Number of metadata variables: 13
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 0
## Mean of data values: 0.2316
## Median of data values: 0.187
## Standard deviation: 0.1907
## Range of values: -0.002721 1.574
## Quantiles:
## 0% 25% 50% 75% 100%
## -0.002721 0.130033 0.186963 0.261674 1.574271
The data was then aggregated, so that there are no replicates per genotype. (150 samples -> 50 samples)
indexes = rep(seq(1, num_samples(carot_sub_nomissing)/3), each = 3)
carotAg = aggregate_samples(carot_sub_nomissing, indexes, meta.to.remove = c("Replicates"))
sum_dataset(carotAg)
## Dataset summary:
## Valid dataset
## Description: UV data for cassava cultivars - carotenoids; Missing value imputation with method mean
## Type of data: uvv-spectra
## Number of samples: 50
## Number of data points 101
## Number of metadata variables: 12
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 0
## Mean of data values: 0.2316
## Median of data values: 0.1871
## Standard deviation: 0.188
## Range of values: 0.00136 1.299
## Quantiles:
## 0% 25% 50% 75% 100%
## 0.00136 0.13380 0.18708 0.26038 1.29949
The dataset is now ready to be used in the subsequent analysis.
The following step consisted in using a variety of machine learning regression approaches to the data, testing with different output variables and applying various preprocessing methods to the data.
To test model performance for prediction of carotenoids content the already mentioned machine learning models were applied over the created dataset, using different output variables. The chosen evaluation metric to compare model performance was the Root-Mean-Square Error (RMSE), since it explicitly shows how much the model predictions deviate, on average, from the actual values in the dataset.
models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 'widekernelpls',
'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm', 'leapBackward', 'leapForward', 'leapSeq')
#Using CarotenoidsContent_TCCS variable
res1 = perform_ML(carotAg, models, pred_var = 'CarotenoidsContent_TCCS')
res1[order(res1$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Ridge Regression 3.361 0.9453 2.606 0.06067
## Partial Least Squares (widekernelpls) 3.392 0.9392 2.419 0.05604
## Partial Least Squares (kernelpls) 3.515 0.9498 2.366 0.04934
## Partial Least Squares (simpls) 3.563 0.9293 2.649 0.12503
## Linear Regression (w/ Backwards Selection) 3.587 0.8794 2.809 0.17217
## Elastic Net 3.750 0.9244 3.028 0.12689
## Partial Least Squares (pls) 3.824 0.9238 2.884 0.15686
## Ridge Regression (w/ FS) 3.826 0.9353 2.642 0.09911
## Random Forest 3.838 0.9696 2.215 0.03146
## Support Vector Machines (e1071) 3.860 0.9205 3.112 0.15125
## Support Vector Machines (kernlab) 4.342 0.9228 3.345 0.14426
## Linear Regression (w/ Forward Selection) 4.355 0.8581 3.628 0.21425
## Linear Regression (w/ Stepwise Selection) 4.761 0.8179 4.176 0.24028
## K-Nearest Neighbors 5.245 0.8721 3.902 0.15636
## Lasso 5.369 0.8270 4.485 0.23804
## Conditional Inference Random Forest 6.764 0.7787 3.095 0.12982
## Conditional Inference Tree 7.552 0.6522 3.576 0.19633
## Decision Trees 7.647 0.6644 3.198 0.19817
## Linear Regression 18.372 0.5572 31.137 0.34458
mean(get_metadata(carotAg)$CarotenoidsContent_TCCS) # CarotenoidsContent_TCCS variable mean values
## [1] 10.67
The results using the “CarotenoidsContent_TCCS” variable show that the models that achieved the lowest RMSE values for the given data included ridge regression with RMSE of 3.361 and partial least squares (“widekernelpls” and “kernelpls” methods) with RMSE of 3.392 and 3.515, respectively. These values could, however, be better, considering the average value of the “CarotenoidsContent_TCCS” variable.
Overall the coefficient of determination (\(R^{2}\)) shows a good fit of the predictions to the observations. The linear regression model without feature selection showed the worst results, with a 18.372 RMSE and 0.5572 \(R^{2}\).
#Using TCCHPLC variable
res2 = perform_ML(carotAg, models, pred_var = 'TCCHPLC')
res2[order(res2$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Partial Least Squares (kernelpls) 5.725 0.5707 4.038 0.3318
## Partial Least Squares (simpls) 5.770 0.5962 3.751 0.3275
## Partial Least Squares (widekernelpls) 5.843 0.5930 3.948 0.3269
## Support Vector Machines (e1071) 5.881 0.5235 3.937 0.3119
## Partial Least Squares (pls) 5.888 0.5992 4.034 0.3227
## Elastic Net 5.899 0.5939 3.557 0.3148
## Ridge Regression (w/ FS) 6.018 0.6326 4.017 0.3127
## Support Vector Machines (kernlab) 6.263 0.6171 4.362 0.2797
## Linear Regression (w/ Backwards Selection) 6.415 0.4996 3.838 0.3113
## K-Nearest Neighbors 6.557 0.4233 4.029 0.2852
## Conditional Inference Random Forest 6.715 0.5135 3.936 0.3083
## Ridge Regression 6.855 0.5322 4.317 0.2988
## Conditional Inference Tree 7.079 0.4570 3.794 0.2994
## Random Forest 7.105 0.3762 3.339 0.3058
## Decision Trees 7.375 0.4819 3.346 0.2853
## Linear Regression (w/ Stepwise Selection) 7.735 0.4760 6.605 0.3510
## Linear Regression (w/ Forward Selection) 8.303 0.4804 6.755 0.2756
## Lasso 18.403 0.2409 12.110 0.2671
## Linear Regression 513.237 0.2688 1649.554 0.2781
mean(get_metadata(carotAg)$TCCHPLC) # TCCHPLC variable mean values
## [1] 10.84
The results using the “TCCHPLC” variable show that, overall, RMSE values increased compared to when using the “CarotenoidsContent_TCCS” variable. The models that achieved the lowest RMSE values for the given data included partial least squares with methods “kernelpls”, “simpls” and “widekernelpls”, with RMSE of 5.725, 5.770 and 5.843, respectively, support vector machines with RMSE of 5.881 and elastic network with RMSE of 5.899.
Overall the coefficient of determination shows a poor fit of the predictions to the observations. The linear regression model without feature selection showed the worst results, with a 513.237 RMSE and 0.2688 \(R^{2}\). Lasso regression also performed much poorly in this case, with a 18.403 RMSE.
#Using transbetacarotene variable
res3 = perform_ML(carotAg, models, pred_var = 'transbetacarotene')
res3[order(res3$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Ridge Regression (w/ FS) 4.051 0.40198 3.970 0.3178
## Elastic Net 4.084 0.42169 4.135 0.3501
## Partial Least Squares (pls) 4.137 0.45437 4.172 0.3346
## Partial Least Squares (kernelpls) 4.169 0.51105 4.183 0.3267
## Partial Least Squares (simpls) 4.217 0.49752 4.278 0.3177
## Ridge Regression 4.253 0.32796 4.184 0.3446
## Support Vector Machines (e1071) 4.344 0.42478 4.306 0.3365
## Partial Least Squares (widekernelpls) 4.362 0.42517 4.322 0.3125
## Support Vector Machines (kernlab) 4.389 0.50181 4.218 0.3303
## K-Nearest Neighbors 4.536 0.22342 4.089 0.2083
## Conditional Inference Random Forest 4.724 0.39563 3.985 0.2772
## Linear Regression (w/ Backwards Selection) 4.918 0.27839 4.177 0.2350
## Conditional Inference Tree 4.929 0.24248 3.954 0.2621
## Linear Regression (w/ Forward Selection) 5.023 0.34750 4.157 0.3227
## Decision Trees 5.133 0.08755 4.003 0.1218
## Random Forest 5.641 0.22644 3.829 0.2584
## Linear Regression (w/ Stepwise Selection) 5.782 0.30538 4.320 0.2974
## Lasso 16.450 0.17465 14.823 0.2256
## Linear Regression 271.132 0.25855 482.988 0.2680
mean(get_metadata(carotAg)$transbetacarotene) # transbetacarotene variable mean values
## [1] 5.897
Transbetacarotene concentrations were also used, considering that it was the carotenoid with highest concentration levels. The results using the “transbetacarotene” variable show that, overall, RMSE values increased compared to when using the “CarotenoidsContent_TCCS” variable and decreased compared to when using the “TCCHPLC” variable. The models that achieved the lowest RMSE values for the given data included ridge regression (w/ feature selection) with RMSE of 4.051, elastic net with a RMSE of 4.084 and partial least squares (“pls” method) with RMSE of 4.137
Overall the coefficient of determination shows a poor fit of the predictions to the observations. The linear regression model without feature selection showed the worst results as in the previous cases, with a 271.132 RMSE and 0.25855 \(R^{2}\). Lasso regression also performed much poorly in this case, with a 16.450 RMSE.
All the results above point to a better model performance whrn using the “CarotenoidsContent_TCCS” metadata variable. This was somewhat expected since this concentrations were calculated from UV data using the Lambert-Beer formula. However, in this report the variable that will be used in the subsequent analysis is “transbetacarotene”.
For the best models from the previous analysis (when using the “transbetacarotene” metadata variable) the variable importance was calculated. Those models were ridge regression (w/ feature selection), elastic network and partial least squares (“pls” method).
# Ridge regression (w/ feature selection)
varImp1 = train_models_performance(carotAg, c('foba'), 'transbetacarotene', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Elastic Network
varImp2 = train_models_performance(carotAg, c('enet'), 'transbetacarotene', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Partial Least Squares
varImp3 = train_models_performance(carotAg, c('pls'), 'transbetacarotene', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Top 20 variables: Ridge regression | Elastic Network | Partial Least Squares
div = rep(' | ', dim(varImp1$vips[[1]])[1])
cbind(varImp1$vips[[1]], div, varImp2$vips[[1]], div, varImp3$vips[[1]])[1:20,]
## Overall Mean div Overall Mean div Overall Mean
## 472 100.00 100.00 | 100.00 100.00 | 100.00 100.00
## 471 99.94 99.94 | 99.94 99.94 | 95.04 95.04
## 473 99.91 99.91 | 99.91 99.91 | 90.35 90.35
## 469 99.12 99.12 | 99.12 99.12 | 83.18 83.18
## 474 99.12 99.12 | 99.12 99.12 | 78.01 78.01
## 470 98.68 98.68 | 98.68 98.68 | 70.01 70.01
## 468 97.77 97.77 | 97.77 97.77 | 64.82 64.82
## 475 97.64 97.64 | 97.64 97.64 | 57.09 57.09
## 467 96.74 96.74 | 96.74 96.74 | 52.61 52.61
## 479 96.65 96.65 | 96.65 96.65 | 46.24 46.24
## 466 96.03 96.03 | 96.03 96.03 | 42.95 42.95
## 476 95.35 95.35 | 95.35 95.35 | 39.93 39.93
## 480 94.55 94.55 | 94.55 94.55 | 39.60 39.60
## 465 94.05 94.05 | 94.05 94.05 | 39.44 39.44
## 477 93.81 93.81 | 93.81 93.81 | 38.59 38.59
## 464 91.31 91.31 | 91.31 91.31 | 38.44 38.44
## 481 91.24 91.24 | 91.24 91.24 | 38.17 38.17
## 478 90.00 90.00 | 90.00 90.00 | 37.42 37.42
## 463 89.46 89.46 | 89.46 89.46 | 37.15 37.15
## 462 87.80 87.80 | 87.80 87.80 | 36.76 36.76
The results for variable importance show that the predictors with most impact on results are the ones around the 470nm wavelength, with the variable with most importance being the one corresponding to the 472nm wavelength.
The next step consisted in testing the best models from the analysis using the “transbetacarotene” metadata variable (ridge regression (w/ feature selection), elastic network and partial least squares (“pls” method)) on a preprocessed dataset, to see if the model performance improved.
# Ridge regression (w/ feature selection)
res4 = perform_ML_preproc(carotAg, 'foba', 'transbetacarotene')
res4[order(res4$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## First Derivative 3.651 0.4247 1.035 0.3939
## Background + Offset cors 4.258 0.4539 4.212 0.3544
## Background cor 4.369 0.3611 4.164 0.3330
## Scaling 4.389 0.3468 4.193 0.3247
## No preprocessing 4.393 0.3600 4.205 0.3266
## Smoothing 4.464 0.3418 4.025 0.3048
## Background + Offset + Baseline cors 4.754 0.3985 4.250 0.3097
## Multiplicative Scatter Cor 6.296 0.3188 5.128 0.2791
Applying the ridge regression model to the preprocessed datasets showed improvement of model performance when using first derivative (RMSE 3.651), a combination of background, offset and baseline corrections (RMSE 4.258), background correction (RMSE 4.369) and scaling (RMSE 4.389) as preprocessing methods.
# Elastic network
res5 = perform_ML_preproc(carotAg, 'enet', 'transbetacarotene')
res5[order(res5$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Scaling 4.100 0.4513 4.156 0.3341
## No preprocessing 4.109 0.4158 3.895 0.3545
## Background cor 4.148 0.4497 4.201 0.3329
## Smoothing 4.164 0.4310 4.124 0.3389
## Background + Offset cors 4.383 0.4407 3.904 0.3346
## Background + Offset + Baseline cors 4.574 0.3391 4.051 0.3081
## First Derivative 6.705 0.3191 5.300 0.3265
## Multiplicative Scatter Cor 8.377 0.1888 6.880 0.2420
Applying the elastic network model to the preprocessed datasets showed improvement of model performance when scaling the dataset (RMSE 4.100).
# Partial Least Squares
res6 = perform_ML_preproc(carotAg, 'pls', 'transbetacarotene')
res6[order(res6$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## No preprocessing 4.353 0.4189 4.093 0.3050
## Smoothing 4.376 0.4165 4.190 0.2794
## Background cor 4.402 0.3936 4.211 0.2926
## Background + Offset cors 4.407 0.3973 4.171 0.3016
## Scaling 4.441 0.4079 3.973 0.2910
## Background + Offset + Baseline cors 4.558 0.3881 4.252 0.2776
## First Derivative 5.026 0.3081 4.175 0.2547
## Multiplicative Scatter Cor 5.866 0.2485 4.658 0.2991
Applying the partial least squares model to the preprocessed dataset showed no improvement in model performance when using any of the preprocessing methods.
The data was also filtered in order to determine if feature selection could improve model performance. A flat pattern filter with inter-quartile range as filter function was applied to the dataset, retaining 40%, 60% and 80% of the data each time.
#Filtering 80% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 80)
res7 = perform_ML(carotAg.filt, models, 'transbetacarotene')
# Results of 80% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res7-res3
res7_3 = cbind(round(res7,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res7_3[order(res7_3$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Ridge Regression (w/ FS) 3.216 0.50101 2.629 0.30427 | -0.83539 0.09902
## Support Vector Machines (e1071) 4.156 0.47259 4.311 0.32108 | -0.18735 0.04782
## Support Vector Machines (kernlab) 4.310 0.46897 4.385 0.32987 | -0.07832 -0.03285
## Elastic Net 4.627 0.43365 3.942 0.30884 | 0.54239 0.01196
## Partial Least Squares (widekernelpls) 4.635 0.39989 4.164 0.30880 | 0.27252 -0.02529
## Ridge Regression 4.635 0.39935 4.224 0.31891 | 0.38230 0.07138
## Partial Least Squares (simpls) 4.656 0.47520 3.997 0.26854 | 0.43893 -0.02232
## Partial Least Squares (pls) 4.679 0.42499 4.203 0.28594 | 0.54156 -0.02938
## K-Nearest Neighbors 4.685 0.26368 4.333 0.23468 | 0.14939 0.04026
## Partial Least Squares (kernelpls) 4.703 0.41374 4.229 0.27793 | 0.53325 -0.09732
## Conditional Inference Random Forest 4.805 0.41761 4.109 0.30500 | 0.08018 0.02198
## Conditional Inference Tree 4.950 0.00729 3.819 0.00791 | 0.02176 -0.23519
## Lasso 5.001 0.25549 3.893 0.28308 | -11.44858 0.08084
## Decision Trees 5.120 0.06824 3.938 0.07566 | -0.01369 -0.01931
## Linear Regression (w/ Backwards Selection) 5.258 0.21770 3.944 0.24775 | 0.34047 -0.06068
## Linear Regression (w/ Stepwise Selection) 5.375 0.29623 3.941 0.29042 | -0.40775 -0.00915
## Linear Regression (w/ Forward Selection) 5.381 0.30270 4.051 0.29856 | 0.35851 -0.04480
## Random Forest 5.658 0.15240 3.857 0.20071 | 0.01747 -0.07404
## Linear Regression 11.379 0.22750 5.501 0.25548 | -259.75350 -0.03106
Filtering 80% of the data showed mixed results, with model performance increasing or decreasing depending on the used model, in comparison to the results using the original dataset. However, it massively increased the performance of the linear model (without selection), decreasing the RMSE by 259 units. Ridge regression (RMSE 3.216), SVMS (RMSE 4.156 and 4.310) and elastic network (RMSE 4.627) models had the best performance.
#Filtering 60% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 60)
res8 = perform_ML(carotAg.filt, models, 'transbetacarotene')
# Results of 60% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res8-res3
res8_3 = cbind(round(res8,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res8_3[order(res8_3$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Ridge Regression (w/ FS) 3.857 0.40095 3.927 0.30102 | -0.19381 -0.00103
## Support Vector Machines (kernlab) 4.191 0.47469 4.254 0.32990 | -0.19764 -0.02712
## Support Vector Machines (e1071) 4.303 0.45114 4.458 0.30532 | -0.04048 0.02636
## K-Nearest Neighbors 4.550 0.22615 4.276 0.21575 | 0.01407 0.00273
## Elastic Net 4.569 0.45515 4.015 0.31017 | 0.48495 0.03346
## Ridge Regression 4.579 0.39625 4.131 0.29045 | 0.32670 0.06829
## Partial Least Squares (widekernelpls) 4.630 0.43855 4.302 0.30985 | 0.26802 0.01337
## Partial Least Squares (kernelpls) 4.632 0.44113 4.178 0.31995 | 0.46241 -0.06992
## Partial Least Squares (pls) 4.643 0.38746 4.143 0.26302 | 0.50542 -0.06691
## Conditional Inference Random Forest 4.663 0.38400 4.019 0.26385 | -0.06145 -0.01163
## Partial Least Squares (simpls) 4.711 0.45436 4.168 0.30475 | 0.49394 -0.04316
## Decision Trees 5.018 0.17522 3.767 0.18968 | -0.11492 0.08767
## Conditional Inference Tree 5.044 0.01171 3.868 0.00634 | 0.11572 -0.23077
## Linear Regression (w/ Stepwise Selection) 5.116 0.23261 4.053 0.26651 | -0.66645 -0.07276
## Linear Regression (w/ Backwards Selection) 5.353 0.31545 4.897 0.28876 | 0.43548 0.03706
## Linear Regression (w/ Forward Selection) 5.640 0.39462 4.559 0.34223 | 0.61763 0.04712
## Random Forest 5.951 0.19547 3.799 0.23954 | 0.30999 -0.03097
## Lasso 12.325 0.28306 19.120 0.34952 | -4.12458 0.10841
## Linear Regression 498.929 0.37062 1422.767 0.33383 | 227.79654 0.11206
Filtering 60% of the data showed mixed results, with model performance increasing or decreasing depending on the used model, in comparison to the results using the original dataset. Here, ridge regression (with FS) had best performance, with RMSE of 3.857. SVMs models also showed good performance with RMSE of 4.191 and 4.303 for kernlab and e1071 packages, respectively.
#Filtering 40% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 40)
res9 = perform_ML(carotAg.filt, models, 'transbetacarotene')
# Results of 40% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res9-res3
res9_3 = cbind(round(res9,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res9_3[order(res9_3$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Support Vector Machines (kernlab) 4.118 0.49265 4.381 0.33108 | -0.27066 -0.00916
## Support Vector Machines (e1071) 4.252 0.49738 4.466 0.32575 | -0.09194 0.07261
## Ridge Regression (w/ FS) 4.268 0.39253 4.138 0.29393 | 0.21669 -0.00945
## Elastic Net 4.488 0.43447 3.832 0.32264 | 0.40366 0.01278
## Ridge Regression 4.532 0.31906 4.139 0.29795 | 0.27883 -0.00890
## Partial Least Squares (widekernelpls) 4.567 0.44318 4.115 0.31425 | 0.20509 0.01801
## K-Nearest Neighbors 4.573 0.22989 4.163 0.23887 | 0.03643 0.00647
## Partial Least Squares (kernelpls) 4.587 0.43197 4.240 0.28393 | 0.41756 -0.07909
## Partial Least Squares (pls) 4.603 0.43710 4.121 0.30858 | 0.46592 -0.01727
## Partial Least Squares (simpls) 4.609 0.46393 4.116 0.33188 | 0.39225 -0.03358
## Conditional Inference Random Forest 4.791 0.41726 4.082 0.26592 | 0.06683 0.02163
## Conditional Inference Tree 4.820 0.02649 3.947 0.03402 | -0.10838 -0.21599
## Decision Trees 5.106 0.16324 3.745 0.18865 | -0.02708 0.07569
## Linear Regression (w/ Forward Selection) 5.138 0.38590 4.598 0.30334 | 0.11586 0.03840
## Linear Regression (w/ Backwards Selection) 5.177 0.30938 4.294 0.32440 | 0.25971 0.03099
## Random Forest 5.737 0.21476 4.189 0.25318 | 0.09633 -0.01168
## Linear Regression (w/ Stepwise Selection) 6.096 0.33878 4.658 0.33445 | 0.31336 0.03341
## Lasso 10.649 0.30848 7.090 0.30926 | -5.80082 0.13383
## Linear Regression 203.134 0.36460 307.890 0.31247 | -67.99830 0.10605
Filtering 40% of the data, showed similar results to the previous case, with model performance increasing or decreasing depending on the used model, in comparison to the results using the original dataset. Here, best RMSE values were achieved by SVMs (from packages e1071 and kernlab) with RMSE of 4.252 and 4.118, respectively, and ridge regression (w/ FS) with RMSE of 4.268.
A machine learning analysis using the CIELAB data was also performed.
The CIELAB data is stored in the metadata file. Therefore, it needs to be extracted first to create the cielab dataset.
color.values = t(get_metadata(carotAg)[2:4]) #L a b
filtered.meta = get_metadata(carotAg)[5:12]
carotCielab = create_dataset(datamatrix = color.values, metadata = filtered.meta, label.x = "cielab",
label.values = "color values", description = "Dataset from cielab values")
head(carotCielab$data)[,1:12] #Cielab values for first 12 samples
## 101.1 102.1 103.1 105.1 11.1 119.1 123.1 125.1 21.1 23.1 27.1 3.1
## Cielab_L 77.670 85.017 81.25 69.25 83.59 69.510 82.893 68.563 74.113 70.240 83.983 85.717
## Cielab_A -3.397 -3.663 -4.46 -4.95 -3.44 -5.457 -2.123 -4.733 -4.277 -1.437 -2.140 -2.607
## Cielab_B 16.493 18.477 18.49 31.96 16.81 37.693 8.213 36.790 20.107 16.160 8.683 22.017
sum_dataset(carotCielab) # Dataset summary
## Dataset summary:
## Valid dataset
## Description: Dataset from cielab values
## Type of data: undefined
## Number of samples: 50
## Number of data points 3
## Number of metadata variables: 8
## Label of x-axis values: cielab
## Label of data points: color values
## Number of missing values in data: 0
## Mean of data values: 31.99
## Median of data values: 18.69
## Standard deviation: 35.84
## Range of values: -5.457 88.28
## Quantiles:
## 0% 25% 50% 75% 100%
## -5.457 -3.070 18.685 75.292 88.283
The same machine learning models used in the UV dataset were used for the CIELAB dataset, with the exception of linear regression models with selection, as it does not make sense to use these considering there are only 3 features in the dataset (L, a and b values). The metadata variable used for prediction was “TCCHPLC” .
models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls',
'widekernelpls', 'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm')
#Using transbetacarotene variable
res10 = perform_ML(carotCielab, models, pred_var = 'transbetacarotene')
# Results w/ CIELAB data and difference to unprocessed UV data results (Two last columns)
diff = res10-res3[-c(17,18,19),]
res10_3 = cbind(round(res10,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res10_3[order(res10_3$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Partial Least Squares (widekernelpls) 4.551 0.2794 3.847 0.26530 | 0.18909 -0.14573
## Partial Least Squares (pls) 4.667 0.2446 3.881 0.25617 | 0.52949 -0.20981
## Conditional Inference Random Forest 4.667 0.2066 3.872 0.18505 | -0.05693 -0.18902
## Partial Least Squares (simpls) 4.731 0.2371 4.058 0.24897 | 0.51431 -0.26037
## Partial Least Squares (kernelpls) 4.785 0.2278 3.883 0.23562 | 0.61592 -0.28330
## Elastic Net 4.787 0.1840 3.882 0.21587 | 0.70242 -0.23774
## Ridge Regression (w/ FS) 4.802 0.2020 3.739 0.21105 | 0.75108 -0.20000
## Lasso 4.826 0.1539 4.069 0.18745 | -11.62383 -0.02078
## Support Vector Machines (e1071) 4.829 0.1506 3.974 0.20949 | 0.48553 -0.27414
## Support Vector Machines (kernlab) 4.878 0.2043 4.162 0.24571 | 0.48898 -0.29747
## Ridge Regression 4.886 0.1774 3.817 0.20406 | 0.63283 -0.15055
## Conditional Inference Tree 4.934 0.1105 3.699 0.09806 | 0.00557 -0.13193
## Linear Regression 4.937 0.2424 3.570 0.18929 | -266.19504 -0.01620
## K-Nearest Neighbors 4.997 0.2036 3.861 0.20608 | 0.46096 -0.01987
## Decision Trees 5.015 0.2880 3.661 0.23662 | -0.11834 0.20042
## Random Forest 5.148 0.1532 3.714 0.17119 | -0.49215 -0.07326
From the results above it is clear that there is an overall decrease in model performance when using CIELAB data in comparison to when using UV data, with increased RMSE values. However, the linear model performed much better than in the case of UV data with a RMSE of 4.937. Lasso regression also performed better comparing to when using UV data, with a RMSE of 4.826. The best model performance was achieved by partial least squares (“widekernelpls” and “pls” methods) with RMSE of 4.551 and 4.667, respectively, and conditional inference random rorest with RMSE of 4.667.
The variable importance was calculated for the models that achieved better performance using CIELAB data. These models were Partial Least Squares (“widekernelpls” and “pls” methods) and conditional inference random rorest.
# Partial Least Squares ("widekernelpls")
varImp4 = train_models_performance(carotCielab, c('widekernelpls'), 'transbetacarotene', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Partial Least Squares ("pls")
varImp5 = train_models_performance(carotCielab, c('pls'), 'transbetacarotene', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Conditional inference random rorest
varImp6 = train_models_performance(carotCielab, c('cforest'), 'transbetacarotene', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Variable Importance: PLS ("widekernelpls") | PLS ("pls") | Conditional inference random rorest
div = rep(' | ', dim(varImp4$vips[[1]])[1])
cbind(varImp4$vips[[1]], div, varImp5$vips[[1]], div, varImp6$vips[[1]])
## Overall Mean div Overall Mean div Overall Mean
## Cielab_B 100.00 100.00 | 100.00 100.00 | 100.000 100.000
## Cielab_L 58.19 58.19 | 58.19 58.19 | 9.722 9.722
## Cielab_A 0.00 0.00 | 0.00 0.00 | 0.000 0.000
The results for variable importance show that the predictor with most impact on results is the CIELAB b value.
The dataset was then scalled to test whether CIELAB data scaling could improve results.
carotCielab.sc = specmine::scaling(carotCielab)
sum_dataset(carotCielab.sc)
## Dataset summary:
## Valid dataset
## Description: Dataset from cielab values; Scaling with method auto
## Type of data: undefined
## Number of samples: 50
## Number of data points 3
## Number of metadata variables: 8
## Label of x-axis values: cielab
## Label of data points: color values
## Number of missing values in data: 0
## Mean of data values: 1.49e-16
## Median of data values: 0.06326
## Standard deviation: 0.9933
## Range of values: -2.187 3.695
## Quantiles:
## 0% 25% 50% 75% 100%
## -2.18663 -0.49244 0.06326 0.52084 3.69515
res11 = perform_ML(carotCielab.sc, models, pred_var = 'transbetacarotene')
# Results w/ scalled CIELAB data and difference to unprocessed CIELAB data results (Two last columns)
diff = res11-res10
res11_10 = cbind(round(res11,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res11_10[order(res11_10$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Elastic Net 4.690 0.2121 3.741 0.2559 | -0.09717 0.02811
## Support Vector Machines (kernlab) 4.745 0.2016 4.027 0.2366 | -0.13278 -0.00273
## Partial Least Squares (simpls) 4.781 0.1943 3.921 0.2214 | 0.05033 -0.04280
## Conditional Inference Random Forest 4.782 0.2230 3.818 0.2185 | 0.11426 0.01639
## Lasso 4.793 0.1824 3.777 0.2107 | -0.03334 0.02849
## Support Vector Machines (e1071) 4.800 0.1554 3.985 0.2039 | -0.02938 0.00472
## Partial Least Squares (kernelpls) 4.815 0.1624 3.987 0.2180 | 0.02924 -0.06539
## Ridge Regression 4.848 0.2382 3.897 0.2210 | -0.03791 0.06075
## Partial Least Squares (widekernelpls) 4.857 0.1706 3.832 0.2382 | 0.30591 -0.10882
## Partial Least Squares (pls) 4.859 0.1645 4.008 0.2278 | 0.19266 -0.08001
## Conditional Inference Tree 4.929 0.1303 3.733 0.1356 | -0.00476 0.01980
## Linear Regression 4.945 0.2200 3.741 0.2224 | 0.00796 -0.02233
## Ridge Regression (w/ FS) 4.951 0.2381 3.705 0.2267 | 0.14908 0.03607
## K-Nearest Neighbors 4.956 0.1536 3.879 0.1839 | -0.04106 -0.04991
## Decision Trees 5.000 0.2977 3.820 0.2277 | -0.01538 0.00975
## Random Forest 5.393 0.1497 4.042 0.2072 | 0.24458 -0.00352
Applying the machine learning models to scalled CIELAB data showed mixed results, with increased and decreased model performance depending on the model used. These changes were, however, small.
A machine learning analysis using fused UV and CIELAB data was also performed.
Two datasets were created, one using 40% of filtered UV data and another using the entire data
# Not filtered
carot.fus = low_level_fusion(list(carotAg, carotCielab))
sum_dataset(carot.fus)
## Dataset summary:
## Valid dataset
## Description: Data integration from types: uvv-spectra,undefined
## Type of data: integrated-data
## Number of samples: 50
## Number of data points 104
## Number of metadata variables: 12
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 0
## Mean of data values: 1.148
## Median of data values: 0.1881
## Standard deviation: 8.069
## Range of values: -5.457 88.28
## Quantiles:
## 0% 25% 50% 75% 100%
## -5.4567 0.1335 0.1881 0.2673 88.2833
# 40% data filtered
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 40)
carot.fus.filt = low_level_fusion(list(carotAg.filt, carotCielab))
sum_dataset(carot.fus.filt)
## Dataset summary:
## Valid dataset
## Description: Data integration from types: uvv-spectra,undefined
## Type of data: integrated-data
## Number of samples: 50
## Number of data points 63
## Number of metadata variables: 12
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 0
## Mean of data values: 1.782
## Median of data values: 0.217
## Standard deviation: 10.32
## Range of values: -5.457 88.28
## Quantiles:
## 0% 25% 50% 75% 100%
## -5.4567 0.1700 0.2170 0.3074 88.2833
The same machine learning models applied in the UV dataset were used for the UV and CIELAB fusion datasets. The metadata variable used for prediction was “transbetacarotene” .
models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 'widekernelpls',
'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm', 'leapBackward', 'leapForward', 'leapSeq')
# Using unfiltered dataset
res12 = perform_ML(carot.fus, models, pred_var = 'transbetacarotene')
# Results w/ unfiltered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res12-res3
res12_3 = cbind(round(res12,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res12_3[order(res12_3$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Support Vector Machines (e1071) 4.353 0.3551 4.250 0.3107 | 0.00985 -0.06969
## Support Vector Machines (kernlab) 4.436 0.4241 4.261 0.3113 | 0.04756 -0.07773
## Elastic Net 4.450 0.3256 4.086 0.2937 | 0.36575 -0.09606
## Partial Least Squares (widekernelpls) 4.592 0.2804 4.034 0.2734 | 0.23010 -0.14481
## Conditional Inference Random Forest 4.645 0.3540 4.073 0.2351 | -0.07892 -0.04160
## Partial Least Squares (pls) 4.652 0.2530 4.154 0.2529 | 0.51453 -0.20133
## Partial Least Squares (kernelpls) 4.681 0.2745 3.964 0.2656 | 0.51153 -0.23659
## Partial Least Squares (simpls) 4.685 0.2545 4.028 0.2627 | 0.46842 -0.24302
## Ridge Regression (w/ FS) 4.730 0.3430 4.114 0.3034 | 0.67862 -0.05898
## Linear Regression (w/ Forward Selection) 4.860 0.2492 4.095 0.2565 | -0.16216 -0.09835
## Conditional Inference Tree 4.870 0.2706 4.030 0.2714 | -0.05825 0.02809
## Linear Regression (w/ Stepwise Selection) 4.909 0.2406 3.983 0.2518 | -0.87333 -0.06477
## K-Nearest Neighbors 4.996 0.1359 3.897 0.1837 | 0.45952 -0.08747
## Linear Regression (w/ Backwards Selection) 5.179 0.2966 4.147 0.2610 | 0.26124 0.01821
## Decision Trees 5.221 0.2181 3.744 0.2196 | 0.08775 0.13058
## Ridge Regression 5.627 0.2783 3.886 0.3185 | 1.37378 -0.04969
## Random Forest 6.101 0.1621 4.067 0.2230 | 0.46009 -0.06436
## Lasso 13.821 0.1487 7.389 0.1694 | -2.62904 -0.02593
## Linear Regression 264.936 0.2363 457.315 0.3017 | -6.19636 -0.02223
The machine learning analysis with unprocessed fusion data showed a decrease in model performance, with overall increase in RMSE values when comparing to the unprocessed UV data results. The best performance was achieved by SVMs model (e1071 and kernlab packages) with a RMSE of 4.353 and 4.436, respectively, and elastic network with RMSE of 4.450.
The variable importance was calculated for the models that achieved better performance using unprocessed fusion data. These models were ridge regression, partial least squares and elastic network.
# Support Vector Machines (e1071 package)
varImp7 = train_models_performance(carot.fus, c('svmLinear2'), 'transbetacarotene', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Support Vector Machines (kernlab package)
varImp8 = train_models_performance(carot.fus, c('svmLinear'), 'transbetacarotene', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Elastic Network
varImp9 = train_models_performance(carot.fus, c('enet'), 'transbetacarotene', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Variable Importance: SVMs (e1071 package) | SVMs (kernlab package) | Elastic Network
div = rep(' | ', dim(varImp7$vips[[1]])[1])
cbind(varImp7$vips[[1]], div, varImp8$vips[[1]], div, varImp9$vips[[1]])
## Overall Mean div Overall Mean div Overall Mean
## 472 100.00 100.00 | 100.00 100.00 | 100.00 100.00
## 471 99.96 99.96 | 99.96 99.96 | 99.96 99.96
## 473 99.94 99.94 | 99.94 99.94 | 99.94 99.94
## 469 99.39 99.39 | 99.39 99.39 | 99.39 99.39
## 474 99.39 99.39 | 99.39 99.39 | 99.39 99.39
## 470 99.08 99.08 | 99.08 99.08 | 99.08 99.08
## 468 98.45 98.45 | 98.45 98.45 | 98.45 98.45
## 475 98.36 98.36 | 98.36 98.36 | 98.36 98.36
## 467 97.74 97.74 | 97.74 97.74 | 97.74 97.74
## 479 97.68 97.68 | 97.68 97.68 | 97.68 97.68
## 466 97.25 97.25 | 97.25 97.25 | 97.25 97.25
## 476 96.77 96.77 | 96.77 96.77 | 96.77 96.77
## 480 96.22 96.22 | 96.22 96.22 | 96.22 96.22
## 465 95.87 95.87 | 95.87 95.87 | 95.87 95.87
## 477 95.70 95.70 | 95.70 95.70 | 95.70 95.70
## 464 93.97 93.97 | 93.97 93.97 | 93.97 93.97
## 481 93.92 93.92 | 93.92 93.92 | 93.92 93.92
## 478 93.06 93.06 | 93.06 93.06 | 93.06 93.06
## 463 92.68 92.68 | 92.68 92.68 | 92.68 92.68
## 462 91.53 91.53 | 91.53 91.53 | 91.53 91.53
## 459 91.30 91.30 | 91.30 91.30 | 91.30 91.30
## 460 90.88 90.88 | 90.88 90.88 | 90.88 90.88
## 461 90.09 90.09 | 90.09 90.09 | 90.09 90.09
## 458 89.76 89.76 | 89.76 89.76 | 89.76 89.76
## 457 88.25 88.25 | 88.25 88.25 | 88.25 88.25
## 482 87.84 87.84 | 87.84 87.84 | 87.84 87.84
## 494 86.27 86.27 | 86.27 86.27 | 86.27 86.27
## 456 85.51 85.51 | 85.51 85.51 | 85.51 85.51
## 495 85.49 85.49 | 85.49 85.49 | 85.49 85.49
## 455 84.02 84.02 | 84.02 84.02 | 84.02 84.02
## 483 83.89 83.89 | 83.89 83.89 | 83.89 83.89
## 486 83.82 83.82 | 83.82 83.82 | 83.82 83.82
## 444 82.87 82.87 | 82.87 82.87 | 82.87 82.87
## 487 82.71 82.71 | 82.71 82.71 | 82.71 82.71
## 445 82.41 82.41 | 82.41 82.41 | 82.41 82.41
## 489 82.36 82.36 | 82.36 82.36 | 82.36 82.36
## 443 82.16 82.16 | 82.16 82.16 | 82.16 82.16
## 446 82.07 82.07 | 82.07 82.07 | 82.07 82.07
## 488 82.06 82.06 | 82.06 82.06 | 82.06 82.06
## 454 81.36 81.36 | 81.36 81.36 | 81.36 81.36
## 442 81.27 81.27 | 81.27 81.27 | 81.27 81.27
## 447 81.16 81.16 | 81.16 81.16 | 81.16 81.16
## 484 80.81 80.81 | 80.81 80.81 | 80.81 80.81
## 448 80.72 80.72 | 80.72 80.72 | 80.72 80.72
## 440 80.69 80.69 | 80.69 80.69 | 80.69 80.69
## 441 80.44 80.44 | 80.44 80.44 | 80.44 80.44
## 485 80.43 80.43 | 80.43 80.43 | 80.43 80.43
## 453 80.26 80.26 | 80.26 80.26 | 80.26 80.26
## 452 79.83 79.83 | 79.83 79.83 | 79.83 79.83
## 439 79.79 79.79 | 79.79 79.79 | 79.79 79.79
## 451 79.74 79.74 | 79.74 79.74 | 79.74 79.74
## 449 79.73 79.73 | 79.73 79.73 | 79.73 79.73
## 493 79.49 79.49 | 79.49 79.49 | 79.49 79.49
## 450 79.04 79.04 | 79.04 79.04 | 79.04 79.04
## 490 77.55 77.55 | 77.55 77.55 | 77.55 77.55
## 496 76.75 76.75 | 76.75 76.75 | 76.75 76.75
## 438 76.60 76.60 | 76.60 76.60 | 76.60 76.60
## 492 76.50 76.50 | 76.50 76.50 | 76.50 76.50
## 491 76.39 76.39 | 76.39 76.39 | 76.39 76.39
## 437 75.56 75.56 | 75.56 75.56 | 75.56 75.56
## 436 73.16 73.16 | 73.16 73.16 | 73.16 73.16
## 435 71.03 71.03 | 71.03 71.03 | 71.03 71.03
## 497 70.22 70.22 | 70.22 70.22 | 70.22 70.22
## 425 69.75 69.75 | 69.75 69.75 | 69.75 69.75
## 424 69.53 69.53 | 69.53 69.53 | 69.53 69.53
## 426 69.04 69.04 | 69.04 69.04 | 69.04 69.04
## 434 68.98 68.98 | 68.98 68.98 | 68.98 68.98
## 433 68.81 68.81 | 68.81 68.81 | 68.81 68.81
## 427 68.63 68.63 | 68.63 68.63 | 68.63 68.63
## 418 68.55 68.55 | 68.55 68.55 | 68.55 68.55
## 423 68.53 68.53 | 68.53 68.53 | 68.53 68.53
## 419 68.00 68.00 | 68.00 68.00 | 68.00 68.00
## 432 67.65 67.65 | 67.65 67.65 | 67.65 67.65
## 422 67.64 67.64 | 67.64 67.64 | 67.64 67.64
## 428 67.50 67.50 | 67.50 67.50 | 67.50 67.50
## 429 67.17 67.17 | 67.17 67.17 | 67.17 67.17
## 431 67.15 67.15 | 67.15 67.15 | 67.15 67.15
## 430 67.13 67.13 | 67.13 67.13 | 67.13 67.13
## 417 66.71 66.71 | 66.71 66.71 | 66.71 66.71
## 421 66.44 66.44 | 66.44 66.44 | 66.44 66.44
## 416 66.40 66.40 | 66.40 66.40 | 66.40 66.40
## 415 65.73 65.73 | 65.73 65.73 | 65.73 65.73
## 414 64.25 64.25 | 64.25 64.25 | 64.25 64.25
## 420 63.74 63.74 | 63.74 63.74 | 63.74 63.74
## 413 62.04 62.04 | 62.04 62.04 | 62.04 62.04
## Cielab_B 61.22 61.22 | 61.22 61.22 | 61.22 61.22
## 498 60.58 60.58 | 60.58 60.58 | 60.58 60.58
## 412 59.64 59.64 | 59.64 59.64 | 59.64 59.64
## 499 57.96 57.96 | 57.96 57.96 | 57.96 57.96
## 411 57.41 57.41 | 57.41 57.41 | 57.41 57.41
## 500 54.35 54.35 | 54.35 54.35 | 54.35 54.35
## Cielab_A 54.33 54.33 | 54.33 54.33 | 54.33 54.33
## 410 53.95 53.95 | 53.95 53.95 | 53.95 53.95
## 409 52.99 52.99 | 52.99 52.99 | 52.99 52.99
## 408 51.17 51.17 | 51.17 51.17 | 51.17 51.17
## 407 48.08 48.08 | 48.08 48.08 | 48.08 48.08
## 406 43.24 43.24 | 43.24 43.24 | 43.24 43.24
## 405 41.29 41.29 | 41.29 41.29 | 41.29 41.29
## 404 38.71 38.71 | 38.71 38.71 | 38.71 38.71
## 403 37.47 37.47 | 37.47 37.47 | 37.47 37.47
## 402 34.33 34.33 | 34.33 34.33 | 34.33 34.33
## 401 33.05 33.05 | 33.05 33.05 | 33.05 33.05
## 400 30.58 30.58 | 30.58 30.58 | 30.58 30.58
## Cielab_L 0.00 0.00 | 0.00 0.00 | 0.00 0.00
The results for variable importance show that the predictors with most impact on results are the ones around the 470nm wavelength, with the variable with most importance being the one corresponding to the 472nm wavelength.
# Using dataset w/ 40% data filtered
res13 = perform_ML(carot.fus.filt, models, pred_var = 'transbetacarotene')
# Results w/ 40% filtered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res13-res3
res13_3 = cbind(round(res13,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res13_3[order(res13_3$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Support Vector Machines (e1071) 4.341 0.43004 4.260 0.31603 | -0.00278 0.00527
## Support Vector Machines (kernlab) 4.458 0.44320 4.251 0.31300 | 0.06938 -0.05861
## Partial Least Squares (simpls) 4.577 0.26942 3.864 0.25090 | 0.36021 -0.22809
## Partial Least Squares (pls) 4.604 0.25776 3.975 0.25330 | 0.46726 -0.19661
## Partial Least Squares (widekernelpls) 4.642 0.29400 3.987 0.28253 | 0.27956 -0.13118
## Partial Least Squares (kernelpls) 4.664 0.25297 3.996 0.23998 | 0.49425 -0.25808
## Elastic Net 4.698 0.22024 3.925 0.24457 | 0.61387 -0.20145
## Conditional Inference Random Forest 4.709 0.31113 4.092 0.22349 | -0.01546 -0.08450
## Ridge Regression (w/ FS) 4.845 0.21226 3.876 0.24484 | 0.79396 -0.18973
## K-Nearest Neighbors 4.881 0.20095 3.765 0.20428 | 0.34527 -0.02248
## Linear Regression (w/ Forward Selection) 4.924 0.28887 3.990 0.28127 | -0.09826 -0.05863
## Decision Trees 4.951 0.29144 3.724 0.26478 | -0.18239 0.20389
## Conditional Inference Tree 4.970 0.02511 3.916 0.02603 | 0.04152 -0.21737
## Ridge Regression 5.010 0.16075 4.024 0.16434 | 0.75689 -0.16721
## Linear Regression (w/ Backwards Selection) 5.095 0.33564 3.928 0.32173 | 0.17730 0.05725
## Random Forest 5.912 0.17026 3.727 0.21035 | 0.27105 -0.05618
## Linear Regression (w/ Stepwise Selection) 5.945 0.30782 4.398 0.31041 | 0.16268 0.00244
## Lasso 11.800 0.25968 10.428 0.28364 | -4.65051 0.08503
## Linear Regression 586.915 0.27394 2747.641 0.29894 | 315.78277 0.01539
The machine learning analysis with filtered fusion data showed an overall decrease in model performance when comparing to the results obtained with unprocessed UV data (higher RMSE values). The best performance was achieved by support vector machines (e1071 and kernlab packages) with a RMSE of 4.341 and 4.458, respectively.
Both filtered and unfiltered datasets were scalled and the machine learning models applied to these scalled datasets.
# Using unfiltered dataset
carot.fus.sc = specmine::scaling(carot.fus)
res14 = perform_ML(carot.fus.sc, models, pred_var = 'transbetacarotene')
# Results w/ unfiltered scalled fusion data and difference to unprocessed UV data results (Two last columns)
diff = res14-res3
res14_3 = cbind(round(res14,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res14_3[order(res14_3$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Support Vector Machines (e1071) 4.409 0.4003 4.438 0.3390 | 0.06584 -0.02448
## Support Vector Machines (kernlab) 4.418 0.3873 4.236 0.3288 | 0.02988 -0.11448
## Elastic Net 4.460 0.3272 4.102 0.3089 | 0.37561 -0.09450
## Partial Least Squares (widekernelpls) 4.514 0.4432 4.214 0.3161 | 0.15209 0.01806
## Ridge Regression (w/ FS) 4.519 0.3453 3.991 0.2899 | 0.46776 -0.05667
## Partial Least Squares (pls) 4.522 0.3912 4.157 0.3054 | 0.38480 -0.06320
## Partial Least Squares (simpls) 4.529 0.4455 4.083 0.3245 | 0.31228 -0.05206
## Partial Least Squares (kernelpls) 4.561 0.4649 4.232 0.3154 | 0.39112 -0.04610
## K-Nearest Neighbors 4.709 0.2221 4.098 0.2119 | 0.17331 -0.00136
## Conditional Inference Random Forest 4.752 0.3363 4.240 0.2308 | 0.02744 -0.05929
## Linear Regression (w/ Forward Selection) 4.894 0.2738 4.110 0.2860 | -0.12881 -0.07374
## Conditional Inference Tree 4.915 0.2458 4.056 0.2403 | -0.01313 0.00334
## Decision Trees 5.111 0.2385 3.766 0.2313 | -0.02184 0.15100
## Ridge Regression 5.315 0.2682 4.093 0.2836 | 1.06263 -0.05971
## Linear Regression (w/ Backwards Selection) 5.348 0.2695 4.003 0.2436 | 0.43029 -0.00890
## Linear Regression (w/ Stepwise Selection) 5.540 0.2226 4.222 0.2761 | -0.24252 -0.08276
## Random Forest 5.879 0.1874 3.932 0.2465 | 0.23885 -0.03904
## Lasso 16.820 0.2086 15.214 0.2308 | 0.37003 0.03396
## Linear Regression 442.244 0.2519 1153.205 0.2742 | 171.11181 -0.00662
The machine learning analysis with scalled fusion data showed a decrease in model performance, with increased RMSE values when comparing to the unprocessed UV data results. The best performance was achieved by SVMs model (e1071 and kernlab packages) with a RMSE of 4.409 and 4.418, respectively.
# Using dataset w/ 40% data filtered
carot.fus.filt.sc = specmine::scaling(carot.fus.filt)
res15 = perform_ML(carot.fus.filt.sc, models, pred_var = 'transbetacarotene')
# Results w/ 40% filtered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res15-res3
res15_3 = cbind(round(res15,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res15_3[order(res15_3$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Support Vector Machines (e1071) 4.187 0.3731 4.231 0.32637 | -0.15695 -0.05166
## Support Vector Machines (kernlab) 4.341 0.3911 4.254 0.31080 | -0.04792 -0.11073
## Partial Least Squares (widekernelpls) 4.595 0.4442 4.115 0.29971 | 0.23263 0.01900
## K-Nearest Neighbors 4.603 0.2077 4.050 0.20361 | 0.06682 -0.01569
## Partial Least Squares (kernelpls) 4.626 0.4058 4.296 0.29314 | 0.45646 -0.10523
## Partial Least Squares (simpls) 4.727 0.4132 4.173 0.31029 | 0.51009 -0.08435
## Elastic Net 4.728 0.2311 3.972 0.23533 | 0.64333 -0.19063
## Conditional Inference Random Forest 4.737 0.3699 4.067 0.24418 | 0.01266 -0.02578
## Partial Least Squares (pls) 4.744 0.4098 4.311 0.31637 | 0.60660 -0.04456
## Ridge Regression 5.007 0.1578 3.904 0.16725 | 0.75378 -0.17013
## Conditional Inference Tree 5.013 0.0089 3.939 0.01087 | 0.08435 -0.23358
## Ridge Regression (w/ FS) 5.086 0.1747 4.134 0.19989 | 1.03500 -0.22732
## Decision Trees 5.179 0.2505 3.834 0.23159 | 0.04596 0.16292
## Linear Regression (w/ Forward Selection) 5.187 0.3103 3.995 0.29537 | 0.16485 -0.03722
## Linear Regression (w/ Stepwise Selection) 5.469 0.3291 4.114 0.29630 | -0.31317 0.02377
## Random Forest 5.638 0.2566 4.006 0.24846 | -0.00253 0.03017
## Linear Regression (w/ Backwards Selection) 5.963 0.2701 4.914 0.27612 | 1.04565 -0.00825
## Lasso 11.350 0.2755 7.541 0.27930 | -5.09964 0.10083
## Linear Regression 585.131 0.2704 2300.894 0.29489 | 313.99835 0.01184
Using filtered and scalled fusion data resulted in an overall decrease in model performance when comparing to the unprocessed UV data results (higher RMSE values). The best performance was achieved by SVMs model (e1071 and kernlab packages) with a RMSE of 4.187 and 4.341, respectively.
UV Data:
CIELAB Data:
Fusion Data: