The aim of this work is to validate a quantification method for carotenoid contents in roots of M. esculenta from colorimetric data using the CIE L * a * b * system. Assuming that the statistical techniques of prognostic analysis as well as machine learning can correlate colorimetric data easily obtained in the field, with the levels obtained through traditional techniques for compounds quantification, such as UV-visible spectrophotometry or the HPLC and, from this, construct prediction models of carotenoids content for this type of biomass.
Roots of fifty M. esculenta genotypes belonging to the EPAGRI’s germplasm bank were sampled in the 2014/2015 season. Carotenoids were extracted from fresh roots and the absorbances of the organosolvent extracts were collected on a UV-visible spectrophotometer using a spectral window from 200 to 700 nm. Aliquots (10 µl) of the extracts were also injected into a liquid chromatograph. The color attributes of the samples were measured by a colorimeter and the results were expressed according to the CIELAB color space scale.
To run this script the following packages are necessary:
library(specmine)
library(xlsx)
Setting working directory:
setwd("C:/Users/Telma/Desktop/CassavaCarotenoids")
set.seed(12345)
The machine learning models used in this analysis are listed in the table below. These belong to the caret package, which is used by specmine.
Model | “Method” Value | Built-in Feature Selection |
---|---|---|
Conditional Inference Random Forest | cforest | YES |
Conditional Inference Tree | ctree | YES |
Decision Trees | rpart | YES |
Elastic Net | enet | YES |
K-Nearest Neighbors | knn | NO |
Lasso Regression | lasso | YES |
Linear Regression | lm | NO |
Linear Regression (w/ Backwards Selection) | leapBackward | YES |
Linear Regression (w/ Forward Selection) | leapForward | YES |
Linear Regression (w/ Stepwise Selection) | leapSeq | YES |
Partial Least Squares | kernelpls, pls, simpls, widekernelpls | YES |
Random Forest | rf | YES |
Ridge Regression | ridge | NO |
Ridge Regression | foba | YES |
Support Vector Machines (kernlab package) | svmLinear | NO |
Support Vector Machines (e1071 package) | svmLinear2 | NO |
The following function is used to retrieve the model name given the “method” value.
getModelName <- function(model) {
if (model == 'lasso') name = 'Lasso'
else if (model == 'ridge') name = 'Ridge Regression'
else if (model == 'foba') name = 'Ridge Regression (w/ FS)'
else if (model == 'rf') name = 'Random Forest'
else if (model == 'cforest') name = 'Conditional Inference Random Forest'
else if (model == 'enet') name = 'Elastic Net'
else if (model == 'pls') name = 'Partial Least Squares (pls)'
else if (model == 'kernelpls') name = 'Partial Least Squares (kernelpls)'
else if (model == 'simpls') name = 'Partial Least Squares (simpls)'
else if (model == 'widekernelpls') name = 'Partial Least Squares (widekernelpls)'
else if (model == 'rpart') name = 'Decision Trees'
else if (model == 'ctree') name = 'Conditional Inference Tree'
else if (model == 'svmLinear') name = 'Support Vector Machines (kernlab)'
else if (model == 'svmLinear2') name = 'Support Vector Machines (e1071)'
else if (model == 'knn') name = 'K-Nearest Neighbors'
else if (model == 'lm') name = 'Linear Regression'
else if (model == 'leapBackward') name = 'Linear Regression (w/ Backwards Selection)'
else if (model == 'leapForward') name = 'Linear Regression (w/ Forward Selection)'
else if (model == 'leapSeq') name = 'Linear Regression (w/ Stepwise Selection)'
else return()
return (name)
}
The following function returns a data frame with the result of applying one or more machine learning models to a selected dataset. The metadata variable for prediction must be supplied.
perform_ML <- function(dataset, models, pred_var) {
res = data.frame(RMSE = numeric(0), Rsquared = numeric(0), RMSESD = numeric(0), RsquaredSD = numeric(0))
for (model in models) {
name = getModelName(model)
ml_res = train_models_performance(dataset, c(model), pred_var, "repeatedcv",
num.folds = 5, compute.varimp = F)
res[name,] = c(ml_res$performance$RMSE, ml_res$performance$Rsquared,
ml_res$performance$RMSESD, ml_res$performance$RsquaredSD)
assign('res', res, envir = .GlobalEnv)
}
return(res)
}
The following function returns a data frame with the result of applying a machine learning model to a dataset that is to be applied various preprocessing methods, including scaling, smoothing interpolation, background, offset and baseline corrections, first derivative and multiplicative scatter correction. The metadata variable for prediction must be supplied.
perform_ML_preproc <- function(dataset, model, pred_var) {
res = data.frame(RMSE = numeric(0), Rsquared = numeric(0), RMSESD = numeric(0), RsquaredSD = numeric(0))
ds.sc = specmine::scaling(dataset)
ds.wavelens = get_x_values_as_num(dataset)
x.axis.sm = seq(min(ds.wavelens), max(ds.wavelens),10)
ds.smooth = smoothing_interpolation(carotAg, method = "loess", x.axis = x.axis.sm)
ds.bg = data_correction(dataset, 'background')
ds.offset = data_correction(ds.bg, 'offset')
ds.baseline = data_correction(ds.offset, 'baseline')
ds.fd = first_derivative(dataset)
ds.msc = msc_correction(dataset)
datasets = list('No preprocessing' = dataset, 'Scaling' = ds.sc, 'Smoothing' = ds.smooth,
'Background cor' = ds.bg, 'Background + Offset cors' = ds.offset,
'Background + Offset + Baseline cors' = ds.baseline, 'First Derivative' = ds.fd,
'Multiplicative Scatter Cor' = ds.msc)
i = 1
for (ds in datasets) {
ml_res = train_models_performance(ds, c(model), pred_var, "repeatedcv", num.folds = 5, compute.varimp = F)
res[names(datasets)[i],] = c(ml_res$performance$RMSE, ml_res$performance$Rsquared,
ml_res$performance$RMSESD, ml_res$performance$RsquaredSD)
assign('res', res, envir = .GlobalEnv)
i = i + 1
}
return(res)
}
UV data is stored in 150 .xlsx files (3 replicates for each of the 50 genotypes), each file containing the read absorbances values between 200 to 700 nm.
files = list.files("data/UV")
datamat = matrix(nrow = 501, ncol = length(files))
rownames(datamat) = 200:700 #data recorded between 200-700nm
colnames(datamat) = gsub("\\.xls", "", files)
for (i in 1:length(files)){
tab_excel = read.xlsx(paste("data/UV/", files[i], sep = ""), sheetIndex = 1, header = F)
datamat[,i] = c(tab_excel[,2], rep(NA, 501-length(tab_excel[,2])))
}
datamat[1:6, 1:6]
## 101.1 101.2 101.3 102.1 102.2 102.3
## 200 0.08763 0.1863 0.10565 0.10565 0.1482 0.13221
## 201 0.09468 0.2184 0.13756 0.12944 0.1254 0.08732
## 202 0.06238 0.1792 0.08410 0.09159 0.1437 0.09159
## 203 0.11513 0.1776 0.13093 0.13497 0.1190 0.07799
## 204 0.11364 0.2038 0.05227 0.11364 0.1376 0.08368
## 205 0.13941 0.1820 0.10809 0.09691 0.1006 0.10809
Besides information regarding sample varieties and replicates, the metadata file also contains information about HPLC concentration measurements and CIELAB data.
file.metadata = "metadata/Carotenoides_Colorimetria.csv"
metadata = read_metadata(file.metadata)
description = "UV data for cassava cultivars - carotenoids"
label.x = "Wavelength"
label.values = "Absorbance"
head(metadata)
## Varieties Replicates Cielab_L Cielab_A Cielab_B CarotenoidsContent_TCCS Lutein Betacryptoxanthin
## 3.1 3 1 85.72 -2.70 22.28 4.853 0.03248 0.06543
## 3.2 3 2 86.18 -2.48 21.39 4.809 0.03248 0.06543
## 3.3 3 3 85.25 -2.64 22.38 4.951 0.03248 0.06543
## 5.1 5 1 85.47 -1.76 6.74 3.098 0.02598 0.07023
## 5.2 5 2 82.29 -2.00 7.02 4.046 0.02598 0.07023
## 5.3 5 3 84.99 -1.86 7.25 3.383 0.02598 0.07023
## Alphacarotene Cisbetacarotene transbetacarotene Lycopene TCCHPLC
## 3.1 0.06021 2.250 3.269 0 5.678
## 3.2 0.06021 2.250 3.269 0 5.678
## 3.3 0.06021 2.250 3.269 0 5.678
## 5.1 0.08319 2.679 2.860 0 5.719
## 5.2 0.08319 2.679 2.860 0 5.719
## 5.3 0.08319 2.679 2.860 0 5.719
After creating a matrix from the UV .xlsx files and reading the metadata, a dataset can be easily created.
Carotenoides_Colorimetria = create_dataset(type = "uvv-spectra", datamatrix = datamat, metadata = metadata,
label.x = label.x, label.values = label.values,
description = description)
sum_dataset(Carotenoides_Colorimetria)
## Dataset summary:
## Valid dataset
## Description: UV data for cassava cultivars - carotenoids
## Type of data: uvv-spectra
## Number of samples: 150
## Number of data points 501
## Number of metadata variables: 13
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 4224
## Mean of data values: 0.3301
## Median of data values: 0.1048
## Standard deviation: 0.6824
## Range of values: -0.06964 4.191
## Quantiles:
## 0% 25% 50% 75% 100%
## -0.06964 0.02003 0.10478 0.23166 4.19051
Because the majority of carotenoids exhibit absorption in the visible region of the spectrum, between 400 to 500 nm, a subset of the original dataset was created, with samples belonging to this wavelenght interval. Also, because the dataset has some missing values, as perceived by the summary above, missing values were replaced with the mean of the variables’ values.
carot_sub = subset_x_values_by_interval(Carotenoides_Colorimetria, 400, 500) # Absorbances between 400-500nm
carot_sub_nomissing = missingvalues_imputation(carot_sub, method = "mean")
sum_dataset(carot_sub_nomissing)
## Dataset summary:
## Valid dataset
## Description: UV data for cassava cultivars - carotenoids; Missing value imputation with method mean
## Type of data: uvv-spectra
## Number of samples: 150
## Number of data points 101
## Number of metadata variables: 13
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 0
## Mean of data values: 0.2316
## Median of data values: 0.187
## Standard deviation: 0.1907
## Range of values: -0.002721 1.574
## Quantiles:
## 0% 25% 50% 75% 100%
## -0.002721 0.130033 0.186963 0.261674 1.574271
The data was then aggregated, so that there are no replicates per genotype. (150 samples -> 50 samples)
indexes = rep(seq(1, num_samples(carot_sub_nomissing)/3), each = 3)
carotAg = aggregate_samples(carot_sub_nomissing, indexes, meta.to.remove = c("Replicates"))
sum_dataset(carotAg)
## Dataset summary:
## Valid dataset
## Description: UV data for cassava cultivars - carotenoids; Missing value imputation with method mean
## Type of data: uvv-spectra
## Number of samples: 50
## Number of data points 101
## Number of metadata variables: 12
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 0
## Mean of data values: 0.2316
## Median of data values: 0.1871
## Standard deviation: 0.188
## Range of values: 0.00136 1.299
## Quantiles:
## 0% 25% 50% 75% 100%
## 0.00136 0.13380 0.18708 0.26038 1.29949
The dataset is now ready to be used in the subsequent analysis.
The following step consisted in using a variety of machine learning regression approaches to the data, testing with different output variables and applying various preprocessing methods to the data.
To test model performance for prediction of carotenoids content the already mentioned machine learning models were applied over the created dataset, using different output variables. The chosen evaluation metric to compare model performance was the Root-Mean-Square Error (RMSE), since it explicitly shows how much the model predictions deviate, on average, from the actual values in the dataset.
models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 'widekernelpls',
'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm', 'leapBackward', 'leapForward', 'leapSeq')
#Using CarotenoidsContent_TCCS variable
res1 = perform_ML(carotAg, models, pred_var = 'CarotenoidsContent_TCCS')
res1[order(res1$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Ridge Regression 3.361 0.9453 2.606 0.06067
## Partial Least Squares (widekernelpls) 3.392 0.9392 2.419 0.05604
## Partial Least Squares (kernelpls) 3.515 0.9498 2.366 0.04934
## Partial Least Squares (simpls) 3.563 0.9293 2.649 0.12503
## Linear Regression (w/ Backwards Selection) 3.587 0.8794 2.809 0.17217
## Elastic Net 3.750 0.9244 3.028 0.12689
## Partial Least Squares (pls) 3.824 0.9238 2.884 0.15686
## Ridge Regression (w/ FS) 3.826 0.9353 2.642 0.09911
## Random Forest 3.838 0.9696 2.215 0.03146
## Support Vector Machines (e1071) 3.860 0.9205 3.112 0.15125
## Support Vector Machines (kernlab) 4.342 0.9228 3.345 0.14426
## Linear Regression (w/ Forward Selection) 4.355 0.8581 3.628 0.21425
## Linear Regression (w/ Stepwise Selection) 4.761 0.8179 4.176 0.24028
## K-Nearest Neighbors 5.245 0.8721 3.902 0.15636
## Lasso 5.369 0.8270 4.485 0.23804
## Conditional Inference Random Forest 6.764 0.7787 3.095 0.12982
## Conditional Inference Tree 7.552 0.6522 3.576 0.19633
## Decision Trees 7.647 0.6644 3.198 0.19817
## Linear Regression 18.372 0.5572 31.137 0.34458
mean(get_metadata(carotAg)$CarotenoidsContent_TCCS) # CarotenoidsContent_TCCS variable mean values
## [1] 10.67
The results using the “CarotenoidsContent_TCCS” variable show that the models that achieved the lowest RMSE values for the given data included ridge regression with RMSE of 3.361 and partial least squares (widekernelpls and kernelpls) with RMSE of 3.392 and 3.515, respectively. These values could, however, be better, considering the average value of the “CarotenoidsContent_TCCS” variable.
Overall the coefficient of determination (\(R^{2}\)) shows a good fit of the predictions to the observations. The linear regression model without feature selection showed the worst results, with a 18.372 RMSE and 0.5572 \(R^{2}\).
#Using TCCHPLC variable
res2 = perform_ML(carotAg, models, pred_var = 'TCCHPLC')
res2[order(res2$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Partial Least Squares (kernelpls) 5.725 0.5707 4.038 0.3318
## Partial Least Squares (simpls) 5.770 0.5962 3.751 0.3275
## Partial Least Squares (widekernelpls) 5.843 0.5930 3.948 0.3269
## Support Vector Machines (e1071) 5.881 0.5235 3.937 0.3119
## Partial Least Squares (pls) 5.888 0.5992 4.034 0.3227
## Elastic Net 5.899 0.5939 3.557 0.3148
## Ridge Regression (w/ FS) 6.018 0.6326 4.017 0.3127
## Support Vector Machines (kernlab) 6.263 0.6171 4.362 0.2797
## Linear Regression (w/ Backwards Selection) 6.415 0.4996 3.838 0.3113
## K-Nearest Neighbors 6.557 0.4233 4.029 0.2852
## Conditional Inference Random Forest 6.715 0.5135 3.936 0.3083
## Ridge Regression 6.855 0.5322 4.317 0.2988
## Conditional Inference Tree 7.079 0.4570 3.794 0.2994
## Random Forest 7.105 0.3762 3.339 0.3058
## Decision Trees 7.375 0.4819 3.346 0.2853
## Linear Regression (w/ Stepwise Selection) 7.735 0.4760 6.605 0.3510
## Linear Regression (w/ Forward Selection) 8.303 0.4804 6.755 0.2756
## Lasso 18.403 0.2409 12.110 0.2671
## Linear Regression 513.237 0.2688 1649.554 0.2781
mean(get_metadata(carotAg)$TCCHPLC) # TCCHPLC variable mean values
## [1] 10.84
The results using the “TCCHPLC” variable show that, overall, RMSE values increased compared to when using the “CarotenoidsContent_TCCS” variable. The models that achieved the lowest RMSE values for the given data included partial least squares with methods “kernelpls”, “simpls” and “widekernelpls”, with RMSE of 5.725, 5.770 and 5.843, respectively, support vector machines with RMSE of 5.881 and elastic network with RMSE of 5.899.
Overall the coefficient of determination shows a poor fit of the predictions to the observations. The linear regression model without feature selection showed the worst results, with a 513.237 RMSE and 0.2688 \(R^{2}\). Lasso regression also performed much poorly in this case, with a 18.403 RMSE.
#Using transbetacarotene variable
res3 = perform_ML(carotAg, models, pred_var = 'transbetacarotene')
res3[order(res3$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Ridge Regression (w/ FS) 4.051 0.40198 3.970 0.3178
## Elastic Net 4.084 0.42169 4.135 0.3501
## Partial Least Squares (pls) 4.137 0.45437 4.172 0.3346
## Partial Least Squares (kernelpls) 4.169 0.51105 4.183 0.3267
## Partial Least Squares (simpls) 4.217 0.49752 4.278 0.3177
## Ridge Regression 4.253 0.32796 4.184 0.3446
## Support Vector Machines (e1071) 4.344 0.42478 4.306 0.3365
## Partial Least Squares (widekernelpls) 4.362 0.42517 4.322 0.3125
## Support Vector Machines (kernlab) 4.389 0.50181 4.218 0.3303
## K-Nearest Neighbors 4.536 0.22342 4.089 0.2083
## Conditional Inference Random Forest 4.724 0.39563 3.985 0.2772
## Linear Regression (w/ Backwards Selection) 4.918 0.27839 4.177 0.2350
## Conditional Inference Tree 4.929 0.24248 3.954 0.2621
## Linear Regression (w/ Forward Selection) 5.023 0.34750 4.157 0.3227
## Decision Trees 5.133 0.08755 4.003 0.1218
## Random Forest 5.641 0.22644 3.829 0.2584
## Linear Regression (w/ Stepwise Selection) 5.782 0.30538 4.320 0.2974
## Lasso 16.450 0.17465 14.823 0.2256
## Linear Regression 271.132 0.25855 482.988 0.2680
mean(get_metadata(carotAg)$transbetacarotene) # transbetacarotene variable mean values
## [1] 5.897
Transbetacarotene concentrations were also used, considering that it was the carotenoid with highest concentration levels. The results using the “transbetacarotene” variable show that, overall, RMSE values increased compared to when using the “CarotenoidsContent_TCCS” variable and decreased compared to when using the “TCCHPLC” variable. The models that achieved the lowest RMSE values for the given data included ridge regression (w/ feature selection) with RMSE of 4.051, elastic net with a RMSE of 4.084 and partial least squares (pls) with RMSE of 4.137
Overall the coefficient of determination shows a poor fit of the predictions to the observations. The linear regression model without feature selection showed the worst results as in the previous cases, with a 271.132 RMSE and 0.25855 \(R^{2}\). Lasso regression also performed much poorly in this case, with a 16.450 RMSE.
All the results above point to a better model performance whrn using the “CarotenoidsContent_TCCS” metadata variable. This was somewhat expected since this concentrations were calculated from UV data automatically. However, the variable we are more interested in is “TCCHPLC” since it corresponds to the concentrations measured by HPLC and it was, therefore, the variable used in the subsequent analysis.
For the best models from the previous analysis (when using the “TCCHPLC” metadata variable) the variable importance was calculated. Those models were partial least squares, support vector machines and elastic network.
# Partial least squares
varImp1 = train_models_performance(carotAg, c('kernelpls'), 'TCCHPLC', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Support vector machines
varImp2 = train_models_performance(carotAg, c('svmLinear2'), 'TCCHPLC', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Elastic Net
varImp3 = train_models_performance(carotAg, c('enet'), 'TCCHPLC', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Top 20 variables: Partial least squares | Support vector machines | Elastic Net
div = rep(' | ', dim(varImp1$vips[[1]])[1])
cbind(varImp1$vips[[1]], div, varImp2$vips[[1]], div, varImp3$vips[[1]])[1:20,]
## Overall Mean div Overall Mean div Overall Mean
## 449 100.00 100.00 | 100.00 100.00 | 100.00 100.00
## 448 99.93 99.93 | 99.78 99.78 | 99.78 99.78
## 450 99.76 99.76 | 99.72 99.72 | 99.72 99.72
## 447 99.66 99.66 | 99.41 99.41 | 99.41 99.41
## 446 98.89 98.89 | 99.03 99.03 | 99.03 99.03
## 451 98.31 98.31 | 98.35 98.35 | 98.35 98.35
## 445 97.89 97.89 | 98.20 98.20 | 98.20 98.20
## 452 97.80 97.80 | 97.85 97.85 | 97.85 97.85
## 444 96.43 96.43 | 97.12 97.12 | 97.12 97.12
## 453 96.06 96.06 | 97.08 97.08 | 97.08 97.08
## 443 94.58 94.58 | 95.96 95.96 | 95.96 95.96
## 454 94.45 94.45 | 95.03 95.03 | 95.03 95.03
## 442 92.71 92.71 | 94.72 94.72 | 94.72 94.72
## 455 92.13 92.13 | 93.07 93.07 | 93.07 93.07
## 441 91.43 91.43 | 90.25 90.25 | 90.25 90.25
## 456 90.93 90.93 | 88.18 88.18 | 88.18 88.18
## 440 89.02 89.02 | 87.48 87.48 | 87.48 87.48
## 457 88.53 88.53 | 86.83 86.83 | 86.83 86.83
## 458 87.34 87.34 | 86.16 86.16 | 86.16 86.16
## 439 86.77 86.77 | 85.53 85.53 | 85.53 85.53
The results for variable importance show that predictors with most impact on results are the ones around the 450nm wavelength, being the 449nm variable the most important one.
The next step consisted in testing the best models from the analysis using the “TCCHPLC” metadata variable (partial least squares, support vector machines and elastic network) on a preprocessed dataset, to see if the model performance improved.
# Partial least squares
res4 = perform_ML_preproc(carotAg, 'kernelpls', 'TCCHPLC')
res4[order(res4$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Background + Offset cors 5.658 0.5894 4.145 0.3167
## No preprocessing 5.702 0.6206 3.871 0.3022
## Scaling 5.727 0.6206 3.960 0.3338
## Smoothing 5.737 0.5695 3.926 0.3160
## Background cor 5.758 0.5538 4.043 0.3352
## Background + Offset + Baseline cors 6.007 0.5801 4.181 0.3212
## First Derivative 6.432 0.4771 3.904 0.3354
## Multiplicative Scatter Cor 11.802 0.2321 12.659 0.2362
Applying the partial least squares model to the preprocessed datasets showed improvement of model performance when using a combination of background, offset and baseline corrections (RMSE 5.658) as preprocessing methods.
# Support vector Machines
res5 = perform_ML_preproc(carotAg, 'svmLinear2', 'TCCHPLC')
res5[order(res5$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Smoothing 5.773 0.6053 4.144 0.2957
## Background + Offset cors 5.936 0.5927 4.040 0.3123
## Background cor 6.175 0.5956 4.369 0.3174
## No preprocessing 6.194 0.5581 4.387 0.3185
## Scaling 6.447 0.5740 4.400 0.3277
## Background + Offset + Baseline cors 9.397 0.4780 6.150 0.3145
## First Derivative 10.774 0.4482 6.596 0.3153
## Multiplicative Scatter Cor 11.621 0.3245 9.831 0.2649
Applying the support vector machines model to the preprocessed datasets showed improvement of model performance when applying smoothing interpolation (RMSE 5.773), a combination of background, offset and baseline corrections (RMSE 5.936) and background correction to the dataset (RMSE 6.175).
# Elastic Network
res6 = perform_ML_preproc(carotAg, 'enet', 'TCCHPLC')
res6[order(res6$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Background cor 5.835 0.5967 3.729 0.3154
## No preprocessing 5.846 0.6269 3.675 0.3244
## Smoothing 5.851 0.5999 3.674 0.3150
## Scaling 5.999 0.5948 3.876 0.3098
## Background + Offset + Baseline cors 6.527 0.4202 3.700 0.3210
## Background + Offset cors 6.587 0.5706 4.147 0.3060
## Multiplicative Scatter Cor 7.459 0.3229 3.915 0.2955
## First Derivative 7.803 0.5023 5.645 0.3269
Applying the elastic network model to the preprocessed dataset showed improvement in model performance when using background correction (RMSE 5.835) as preprocessing methods.
The data was also filtered in order to determine if feature selection could improve model performance. A flat pattern filter with inter-quartile range as filter function was applied to the dataset, retaining 40%, 60% and 80% of the data each time.
#Filtering 80% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 80)
res7 = perform_ML(carotAg.filt, models, 'TCCHPLC')
# Results of 80% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res7-res2
res7_2 = cbind(round(res7,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res7_2[order(res7_2$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Ridge Regression (w/ FS) 4.091 0.7141 1.466 0.2168 | -1.92679 0.08155
## Support Vector Machines (e1071) 4.962 0.6365 4.039 0.2974 | -0.91884 0.11295
## Elastic Net 5.226 0.6341 3.882 0.3188 | -0.67341 0.04018
## Ridge Regression 5.385 0.5996 3.970 0.3215 | -1.47047 0.06740
## Partial Least Squares (widekernelpls) 5.409 0.5717 3.874 0.3255 | -0.43347 -0.02129
## Support Vector Machines (kernlab) 5.442 0.5871 4.135 0.3202 | -0.82049 -0.02999
## Partial Least Squares (kernelpls) 5.458 0.5505 3.731 0.3032 | -0.26635 -0.02015
## Partial Least Squares (simpls) 5.563 0.5955 3.931 0.3278 | -0.20714 -0.00070
## Partial Least Squares (pls) 5.581 0.5904 3.841 0.3477 | -0.30729 -0.00888
## K-Nearest Neighbors 6.533 0.4546 3.990 0.2846 | -0.02334 0.03133
## Linear Regression (w/ Backwards Selection) 6.595 0.5261 3.744 0.3440 | 0.18012 0.02648
## Linear Regression (w/ Forward Selection) 6.616 0.5200 4.185 0.3456 | -1.68655 0.03959
## Lasso 6.645 0.5152 3.633 0.3200 | -11.75798 0.27432
## Linear Regression (w/ Stepwise Selection) 6.661 0.5144 4.316 0.3592 | -1.07448 0.03841
## Conditional Inference Random Forest 6.708 0.5213 3.888 0.2878 | -0.00623 0.00781
## Conditional Inference Tree 7.073 0.4311 3.440 0.2967 | -0.00638 -0.02593
## Random Forest 7.192 0.3895 3.785 0.3076 | 0.08720 0.01329
## Decision Trees 7.286 0.3926 3.294 0.2937 | -0.08906 -0.08938
## Linear Regression 12.534 0.3206 6.213 0.3027 | -500.70380 0.05184
Filtering 80% of the data showed an overall increase in model performance, with RMSE values decreasing in comparison to the results using the original dataset. Also, it massively increased the performance of the linear model (without selection), decreasing the RMSE by 500 units. Ridge regression (RMSE 4.091), SVMS (RMSE 4.962) and elastic network (RMSE 5.226) models had the best performance.
#Filtering 60% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 60)
res8 = perform_ML(carotAg.filt, models, 'TCCHPLC')
# Results of 60% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res8-res2
res8_2 = cbind(round(res8,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res8_2[order(res8_2$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Ridge Regression (w/ FS) 4.778 0.7225 3.576 0.3232 | -1.24013 0.08987
## Support Vector Machines (e1071) 5.052 0.6195 4.028 0.3103 | -0.82826 0.09602
## Ridge Regression 5.256 0.6556 3.939 0.2921 | -1.59950 0.12341
## Support Vector Machines (kernlab) 5.314 0.6051 4.127 0.3080 | -0.94894 -0.01192
## Partial Least Squares (widekernelpls) 5.564 0.5908 3.786 0.3001 | -0.27888 -0.00226
## Elastic Net 5.600 0.6177 4.073 0.3203 | -0.29891 0.02378
## Partial Least Squares (pls) 5.675 0.6224 4.027 0.3141 | -0.21307 0.02311
## Partial Least Squares (kernelpls) 5.689 0.5988 3.910 0.3174 | -0.03559 0.02817
## Partial Least Squares (simpls) 5.780 0.5604 4.086 0.3248 | 0.00978 -0.03578
## Linear Regression (w/ Forward Selection) 6.563 0.5485 4.762 0.3251 | -1.74017 0.06807
## K-Nearest Neighbors 6.588 0.4490 3.729 0.3019 | 0.03182 0.02569
## Conditional Inference Random Forest 6.740 0.5150 3.802 0.3058 | 0.02539 0.00152
## Linear Regression (w/ Stepwise Selection) 6.838 0.4619 4.428 0.3840 | -0.89759 -0.01417
## Linear Regression (w/ Backwards Selection) 6.900 0.5083 6.579 0.3558 | 0.48483 0.00866
## Random Forest 6.994 0.3906 3.869 0.3156 | -0.11121 0.01434
## Conditional Inference Tree 7.002 0.4273 3.721 0.2816 | -0.07776 -0.02973
## Decision Trees 7.120 0.4182 3.494 0.3049 | -0.25446 -0.06377
## Lasso 25.806 0.3420 53.673 0.3067 | 7.40305 0.10108
## Linear Regression 514.449 0.3039 2066.774 0.2799 | 1.21193 0.03508
Filtering 60% of the data, also showed an overall increase in model performance, with RMSE values decreasing in comparison to the results using the original dataset. Here, ridge regression (with and withou FS) showed best RMSE values, with 4.778 and 5.256 RMSE, respectively. SVMs also showed good performance with RMSE of 5.052.
#Filtering 40% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 40)
res9 = perform_ML(carotAg.filt, models, 'TCCHPLC')
# Results of 40% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res9-res2
res9_2 = cbind(round(res9,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res9_2[order(res9_2$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Support Vector Machines (e1071) 5.318 0.6197 3.959 0.2904 | -0.56264 0.09614
## Elastic Net 5.484 0.6062 3.929 0.3314 | -0.41515 0.01234
## Support Vector Machines (kernlab) 5.499 0.5945 4.116 0.3143 | -0.76331 -0.02262
## Partial Least Squares (widekernelpls) 5.569 0.5877 4.075 0.3054 | -0.27335 -0.00535
## Ridge Regression 5.629 0.6341 4.094 0.3300 | -1.22593 0.10197
## Partial Least Squares (simpls) 5.742 0.5858 3.951 0.3263 | -0.02804 -0.01044
## Partial Least Squares (kernelpls) 5.745 0.5919 3.949 0.3217 | 0.02038 0.02125
## Partial Least Squares (pls) 5.789 0.5840 3.799 0.3187 | -0.09871 -0.01523
## Ridge Regression (w/ FS) 5.935 0.5806 4.192 0.3165 | -0.08308 -0.05201
## Linear Regression (w/ Backwards Selection) 6.211 0.5404 4.019 0.3207 | -0.20407 0.04072
## K-Nearest Neighbors 6.463 0.4871 3.995 0.2800 | -0.09316 0.06384
## Conditional Inference Random Forest 6.646 0.5203 3.887 0.2866 | -0.06877 0.00679
## Conditional Inference Tree 6.791 0.4626 3.884 0.2868 | -0.28859 0.00558
## Decision Trees 6.817 0.5025 3.911 0.2459 | -0.55748 0.02055
## Random Forest 7.016 0.4021 3.453 0.2857 | -0.08870 0.02585
## Linear Regression (w/ Stepwise Selection) 7.582 0.4614 3.873 0.3747 | -0.15293 -0.01460
## Linear Regression (w/ Forward Selection) 7.765 0.4322 5.458 0.3186 | -0.53767 -0.04827
## Lasso 15.062 0.2878 13.411 0.2821 | -3.34068 0.04687
## Linear Regression 260.984 0.3011 428.663 0.2756 | -252.25366 0.03231
Filtering 40% of the data, showed similar results to the previous case, with an overall increase in model performance, with RMSE values decreasing in comparison to the results using the original dataset. Here, best RMSE values were achieved by SVMs (from packages e1071 and kernlab) with RMSE of 5.318 and 5.499, respectively, and elastic network with RMSE of 5.484. However, filtering 80% of data showed better results in comparison to this case and filtering 60% of data.
A machine learning analysis using the CIELAB data was also performed.
The CIELAB data is stored in the metadata file. Therefore, it needs to be extracted first to create the cielab dataset.
color.values = t(get_metadata(carotAg)[2:4]) #L a b
filtered.meta = get_metadata(carotAg)[5:12]
carotCielab = create_dataset(datamatrix = color.values, metadata = filtered.meta, label.x = "cielab",
label.values = "color values", description = "Dataset from cielab values")
head(carotCielab$data)[,1:12] #Cielab values for first 12 samples
## 101.1 102.1 103.1 105.1 11.1 119.1 123.1 125.1 21.1 23.1 27.1 3.1
## Cielab_L 77.670 85.017 81.25 69.25 83.59 69.510 82.893 68.563 74.113 70.240 83.983 85.717
## Cielab_A -3.397 -3.663 -4.46 -4.95 -3.44 -5.457 -2.123 -4.733 -4.277 -1.437 -2.140 -2.607
## Cielab_B 16.493 18.477 18.49 31.96 16.81 37.693 8.213 36.790 20.107 16.160 8.683 22.017
sum_dataset(carotCielab) # Dataset summary
## Dataset summary:
## Valid dataset
## Description: Dataset from cielab values
## Type of data: undefined
## Number of samples: 50
## Number of data points 3
## Number of metadata variables: 8
## Label of x-axis values: cielab
## Label of data points: color values
## Number of missing values in data: 0
## Mean of data values: 31.99
## Median of data values: 18.69
## Standard deviation: 35.84
## Range of values: -5.457 88.28
## Quantiles:
## 0% 25% 50% 75% 100%
## -5.457 -3.070 18.685 75.292 88.283
The same machine learning models used in the UV dataset were used for the CIELAB dataset, with the exception of linear regression models with selection, as it does not make sense to use these considering there are only 3 features in the dataset (L, a and b values). The metadata variable used for prediction was “TCCHPLC” .
models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls',
'widekernelpls', 'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm')
#Using TCCHPLC variable
res10 = perform_ML(carotCielab, models, pred_var = 'TCCHPLC')
# Results w/ CIELAB data and difference to unprocessed UV data results (Two last columns)
diff = res10-res2[-c(17,18,19),]
res10_2 = cbind(round(res10,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res10_2[order(res10_2$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Elastic Net 6.534 0.4129 2.996 0.2830 | 0.6345 -0.18095
## Support Vector Machines (kernlab) 6.534 0.3662 3.465 0.2739 | 0.2716 -0.25090
## Ridge Regression 6.584 0.4213 2.862 0.2966 | -0.2707 -0.11088
## Partial Least Squares (pls) 6.622 0.3946 3.151 0.2600 | 0.7343 -0.20461
## Support Vector Machines (e1071) 6.645 0.3840 3.270 0.3081 | 0.7643 -0.13954
## Ridge Regression (w/ FS) 6.653 0.3895 3.210 0.2809 | 0.6349 -0.24309
## Lasso 6.669 0.4110 3.025 0.2985 | -11.7339 0.17012
## Partial Least Squares (widekernelpls) 6.696 0.3960 3.037 0.3010 | 0.8534 -0.19708
## Linear Regression 6.749 0.4004 3.195 0.2848 | -506.4886 0.13160
## Partial Least Squares (kernelpls) 6.756 0.4319 3.240 0.2960 | 1.0308 -0.13878
## Partial Least Squares (simpls) 6.789 0.4142 3.212 0.2773 | 1.0188 -0.18205
## Conditional Inference Random Forest 6.930 0.4085 3.318 0.2538 | 0.2157 -0.10503
## K-Nearest Neighbors 7.278 0.2569 3.355 0.2319 | 0.7210 -0.16636
## Conditional Inference Tree 7.307 0.3842 2.863 0.2451 | 0.2275 -0.07285
## Random Forest 7.571 0.2938 3.450 0.2716 | 0.4660 -0.08241
## Decision Trees 7.641 0.3534 3.533 0.2531 | 0.2663 -0.12851
From the results above it is clear that there is an overall decrease in model performance when using CIELAB data in comparison to when using UV data, with increased RMSE values. However, the linear model performed much better than in the case of UV data with a RMSE of 6.749. Lasso regression also performed better comparing to when using UV data, with a RMSE of 6.669. The best model performance was achieved by elastic network with RMSE of 6.534, SVMs with RMSE of 6.534 and ridge regression with RMSE of 6.584.
The variable importance was calculated for the models that achieved better performance using CIELAB data. These models were elastic network, SVMs and ridge regression.
# Elastic Network
varImp4 = train_models_performance(carotCielab, c('enet'), 'TCCHPLC', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Support vector Machines
varImp5 = train_models_performance(carotCielab, c('svmLinear'), 'TCCHPLC', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Ridge Regression
varImp6 = train_models_performance(carotCielab, c('ridge'), 'TCCHPLC', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Variable Importance: Elastic Network | Support vector machines | Ridge Regression
div = rep(' | ', dim(varImp4$vips[[1]])[1])
cbind(varImp4$vips[[1]], div, varImp5$vips[[1]], div, varImp6$vips[[1]])
## Overall Mean div Overall Mean div Overall Mean
## Cielab_B 100.0000 100.0000 | 100.0000 100.0000 | 100.0000 100.0000
## Cielab_A 0.5596 0.5596 | 0.5596 0.5596 | 0.5596 0.5596
## Cielab_L 0.0000 0.0000 | 0.0000 0.0000 | 0.0000 0.0000
The results for variable importance show that the predictor with most impact on results is the CIELAB b value.
The dataset was then scalled to test whether CIELAB data scaling could improve results.
carotCielab.sc = specmine::scaling(carotCielab)
sum_dataset(carotCielab.sc)
## Dataset summary:
## Valid dataset
## Description: Dataset from cielab values; Scaling with method auto
## Type of data: undefined
## Number of samples: 50
## Number of data points 3
## Number of metadata variables: 8
## Label of x-axis values: cielab
## Label of data points: color values
## Number of missing values in data: 0
## Mean of data values: 1.49e-16
## Median of data values: 0.06326
## Standard deviation: 0.9933
## Range of values: -2.187 3.695
## Quantiles:
## 0% 25% 50% 75% 100%
## -2.18663 -0.49244 0.06326 0.52084 3.69515
res11 = perform_ML(carotCielab.sc, models, pred_var = 'TCCHPLC')
# Results w/ scalled CIELAB data and difference to unprocessed CIELAB data results (Two last columns)
diff = res11-res10
res11_10 = cbind(round(res11,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res11_10[order(res11_10$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Support Vector Machines (e1071) 6.467 0.3800 3.116 0.2806 | -0.17772 -0.00393
## Support Vector Machines (kernlab) 6.523 0.4093 3.305 0.3068 | -0.01158 0.04316
## Partial Least Squares (widekernelpls) 6.535 0.3935 3.074 0.2944 | -0.16054 -0.00248
## Partial Least Squares (simpls) 6.548 0.4175 3.007 0.3027 | -0.24076 0.00334
## Ridge Regression (w/ FS) 6.564 0.4289 3.360 0.2752 | -0.08906 0.03941
## Elastic Net 6.587 0.4468 3.505 0.3012 | 0.05361 0.03392
## Lasso 6.598 0.4303 3.294 0.3135 | -0.07043 0.01932
## Ridge Regression 6.730 0.3962 2.966 0.2905 | 0.14568 -0.02512
## Partial Least Squares (pls) 6.752 0.3746 3.159 0.2902 | 0.12935 -0.02000
## Linear Regression 6.765 0.3883 2.937 0.3284 | 0.01591 -0.01207
## Partial Least Squares (kernelpls) 6.832 0.3547 3.091 0.2944 | 0.07681 -0.07723
## Conditional Inference Random Forest 6.881 0.3756 3.707 0.2443 | -0.04892 -0.03283
## Conditional Inference Tree 7.237 0.3911 3.131 0.2537 | -0.06968 0.00691
## K-Nearest Neighbors 7.431 0.2521 3.508 0.2673 | 0.15381 -0.00479
## Decision Trees 7.682 0.3380 3.518 0.2466 | 0.04137 -0.01538
## Random Forest 7.703 0.2899 3.546 0.2518 | 0.13271 -0.00385
Applying the machine learning models to scalled CIELAB data showed mixed results, with increased and decreased model performance depending on the model used. These changes were, however, small.
A machine learning analysis using fused UV and CIELAB data was also performed.
Two datasets were created, one using 80% of filtered UV data and another using the entire data
# Not filtered
carot.fus = low_level_fusion(list(carotAg, carotCielab))
sum_dataset(carot.fus)
## Dataset summary:
## Valid dataset
## Description: Data integration from types: uvv-spectra,undefined
## Type of data: integrated-data
## Number of samples: 50
## Number of data points 104
## Number of metadata variables: 12
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 0
## Mean of data values: 1.148
## Median of data values: 0.1881
## Standard deviation: 8.069
## Range of values: -5.457 88.28
## Quantiles:
## 0% 25% 50% 75% 100%
## -5.4567 0.1335 0.1881 0.2673 88.2833
# 80% data filtered
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 80)
carot.fus.filt = low_level_fusion(list(carotAg.filt, carotCielab))
sum_dataset(carot.fus.filt)
## Dataset summary:
## Valid dataset
## Description: Data integration from types: uvv-spectra,undefined
## Type of data: integrated-data
## Number of samples: 50
## Number of data points 23
## Number of metadata variables: 12
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 0
## Mean of data values: 4.43
## Median of data values: 0.2416
## Standard deviation: 16.75
## Range of values: -5.457 88.28
## Quantiles:
## 0% 25% 50% 75% 100%
## -5.4567 0.1893 0.2416 0.3700 88.2833
The same machine learning models applied in the UV dataset were used for the UV and CIELAB fusion datasets. The metadata variable used for prediction was “TCCHPLC” .
models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 'widekernelpls',
'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm', 'leapBackward', 'leapForward', 'leapSeq')
# Using unfiltered dataset
res12 = perform_ML(carot.fus, models, pred_var = 'TCCHPLC')
# Results w/ unfiltered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res12-res2
res12_2 = cbind(round(res12,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res12_2[order(res12_2$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Ridge Regression (w/ FS) 5.981 0.5781 3.832 0.3119 | -0.03728 -0.05444
## Partial Least Squares (kernelpls) 6.010 0.5263 3.661 0.3106 | 0.28552 -0.04438
## Partial Least Squares (widekernelpls) 6.031 0.4448 3.841 0.3220 | 0.18847 -0.14820
## Partial Least Squares (simpls) 6.082 0.5032 3.759 0.3277 | 0.31264 -0.09297
## Elastic Net 6.160 0.6031 3.551 0.3314 | 0.26108 0.00920
## Partial Least Squares (pls) 6.187 0.4834 3.661 0.3238 | 0.29893 -0.11580
## Support Vector Machines (kernlab) 6.299 0.5249 4.232 0.2700 | 0.03584 -0.09221
## Support Vector Machines (e1071) 6.379 0.5477 4.273 0.3121 | 0.49848 0.02420
## Linear Regression (w/ Backwards Selection) 6.385 0.5419 3.888 0.3149 | -0.03043 0.04222
## Conditional Inference Random Forest 6.531 0.5158 3.902 0.2892 | -0.18345 0.00233
## Conditional Inference Tree 6.923 0.4351 3.737 0.2969 | -0.15601 -0.02197
## Random Forest 7.114 0.3527 3.431 0.2833 | 0.00927 -0.02354
## K-Nearest Neighbors 7.355 0.2622 3.415 0.2168 | 0.79798 -0.16101
## Decision Trees 7.789 0.3427 3.084 0.2420 | 0.41456 -0.13923
## Linear Regression (w/ Stepwise Selection) 8.052 0.5295 4.423 0.3233 | 0.31701 0.05349
## Linear Regression (w/ Forward Selection) 8.279 0.4734 7.324 0.3164 | -0.02357 -0.00705
## Ridge Regression 8.469 0.4723 4.898 0.2712 | 1.61379 -0.05990
## Lasso 18.784 0.2545 12.304 0.2825 | 0.38157 0.01365
## Linear Regression 548.940 0.2992 1430.369 0.2976 | 35.70205 0.03036
The machine learning analysis with unprocessed fusion data showed a decrease in model performance, with overall increase in RMSE values when comparing to the unprocessed UV data results. The best performance was achieved by ridge regression model (with selection) with a RMSE of 5.981.
The variable importance was calculated for the models that achieved better performance using unprocessed fusion data. These models were ridge regression, partial least squares and elastic network.
# Ridge Regression
varImp7 = train_models_performance(carot.fus, c('foba'), 'TCCHPLC', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Partial Least Squares
varImp8 = train_models_performance(carot.fus, c('kernelpls'), 'TCCHPLC', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Elastic Network
varImp9 = train_models_performance(carot.fus, c('enet'), 'TCCHPLC', "repeatedcv",
num.folds = 5, compute.varimp = T)
# Variable Importance: Ridge Regression | Partial Least Squares | Elastic Network
div = rep(' | ', dim(varImp7$vips[[1]])[1])
cbind(varImp7$vips[[1]], div, varImp8$vips[[1]], div, varImp9$vips[[1]])
## Overall Mean div Overall Mean div Overall Mean
## 473 100.0000 100.0000 | 100.00000 100.00000 | 100.0000 100.0000
## 474 99.9113 99.9113 | 61.82867 61.82867 | 99.9113 99.9113
## 472 99.8868 99.8868 | 46.83905 46.83905 | 99.8868 99.8868
## 475 99.7654 99.7654 | 5.41013 5.41013 | 99.7654 99.7654
## 471 99.6128 99.6128 | 5.38756 5.38756 | 99.6128 99.6128
## 476 99.3395 99.3395 | 5.36948 5.36948 | 99.3395 99.3395
## 470 99.2810 99.2810 | 5.35150 5.35150 | 99.2810 99.2810
## 469 99.1404 99.1404 | 5.31719 5.31719 | 99.1404 99.1404
## 477 98.8480 98.8480 | 5.29151 5.29151 | 98.8480 98.8480
## 468 98.8306 98.8306 | 5.26227 5.26227 | 98.8306 98.8306
## 467 98.3802 98.3802 | 5.21587 5.21587 | 98.3802 98.3802
## 478 98.0079 98.0079 | 5.19450 5.19450 | 98.0079 98.0079
## 466 97.8873 97.8873 | 5.19340 5.19340 | 97.8873 97.8873
## 465 97.2259 97.2259 | 5.18352 5.18352 | 97.2259 97.2259
## 464 96.0949 96.0949 | 5.14655 5.14655 | 96.0949 96.0949
## 463 95.2659 95.2659 | 5.12585 5.12585 | 95.2659 95.2659
## 459 94.9843 94.9843 | 5.10804 5.10804 | 94.9843 94.9843
## 460 94.7241 94.7241 | 5.10741 5.10741 | 94.7241 94.7241
## 458 94.4573 94.4573 | 5.06545 5.06545 | 94.4573 94.4573
## 462 94.2039 94.2039 | 5.05908 5.05908 | 94.2039 94.2039
## 480 94.1510 94.1510 | 5.05102 5.05102 | 94.1510 94.1510
## 479 94.0085 94.0085 | 5.02047 5.02047 | 94.0085 94.0085
## 457 93.9935 93.9935 | 5.01932 5.01932 | 93.9935 93.9935
## 486 93.6229 93.6229 | 4.97254 4.97254 | 93.6229 93.6229
## 487 93.5485 93.5485 | 4.93916 4.93916 | 93.5485 93.5485
## 488 93.5467 93.5467 | 4.92974 4.92974 | 93.5467 93.5467
## 456 93.5197 93.5197 | 4.87376 4.87376 | 93.5197 93.5197
## 481 93.4256 93.4256 | 4.86100 4.86100 | 93.4256 93.4256
## 489 93.3707 93.3707 | 4.83570 4.83570 | 93.3707 93.3707
## 455 93.2616 93.2616 | 4.83495 4.83495 | 93.2616 93.2616
## 461 93.0875 93.0875 | 4.78544 4.78544 | 93.0875 93.0875
## 454 92.8529 92.8529 | 4.74143 4.74143 | 92.8529 92.8529
## 453 92.7929 92.7929 | 4.72591 4.72591 | 92.7929 92.7929
## 452 92.7572 92.7572 | 4.71433 4.71433 | 92.7572 92.7572
## 448 92.7186 92.7186 | 4.68942 4.68942 | 92.7186 92.7186
## 446 92.6654 92.6654 | 4.68730 4.68730 | 92.6654 92.6654
## 451 92.6466 92.6466 | 4.63035 4.63035 | 92.6466 92.6466
## 447 92.5967 92.5967 | 4.62820 4.62820 | 92.5967 92.5967
## 449 92.5944 92.5944 | 4.62320 4.62320 | 92.5944 92.5944
## 450 92.5356 92.5356 | 4.62091 4.62091 | 92.5356 92.5356
## 445 92.5343 92.5343 | 4.57829 4.57829 | 92.5343 92.5343
## 444 92.3814 92.3814 | 4.52236 4.52236 | 92.3814 92.3814
## 482 92.2752 92.2752 | 4.46326 4.46326 | 92.2752 92.2752
## 443 92.0251 92.0251 | 4.36791 4.36791 | 92.0251 92.0251
## 442 91.5698 91.5698 | 4.24604 4.24604 | 91.5698 91.5698
## 485 91.3638 91.3638 | 4.14840 4.14840 | 91.3638 91.3638
## 441 91.2927 91.2927 | 4.07951 4.07951 | 91.2927 91.2927
## 490 91.2712 91.2712 | 3.97698 3.97698 | 91.2712 91.2712
## 484 91.2403 91.2403 | 3.87960 3.87960 | 91.2403 91.2403
## 483 91.2153 91.2153 | 3.77792 3.77792 | 91.2153 91.2153
## 440 91.1322 91.1322 | 3.63267 3.63267 | 91.1322 91.1322
## 439 90.8556 90.8556 | 3.61297 3.61297 | 90.8556 90.8556
## 438 90.2867 90.2867 | 3.48471 3.48471 | 90.2867 90.2867
## 491 90.1094 90.1094 | 3.39727 3.39727 | 90.1094 90.1094
## 437 89.7082 89.7082 | 3.30467 3.30467 | 89.7082 89.7082
## 436 89.2866 89.2866 | 3.27905 3.27905 | 89.2866 89.2866
## 435 88.9328 88.9328 | 3.14163 3.14163 | 88.9328 88.9328
## 434 88.5372 88.5372 | 3.11541 3.11541 | 88.5372 88.5372
## 433 87.9509 87.9509 | 3.07926 3.07926 | 87.9509 87.9509
## 432 87.6793 87.6793 | 3.04597 3.04597 | 87.6793 87.6793
## 431 87.3251 87.3251 | 3.01064 3.01064 | 87.3251 87.3251
## 430 87.0876 87.0876 | 2.97654 2.97654 | 87.0876 87.0876
## 429 86.4354 86.4354 | 2.94674 2.94674 | 86.4354 86.4354
## 428 85.8934 85.8934 | 2.93409 2.93409 | 85.8934 85.8934
## 427 85.6262 85.6262 | 2.91394 2.91394 | 85.6262 85.6262
## 492 85.5259 85.5259 | 2.85108 2.85108 | 85.5259 85.5259
## 426 85.2602 85.2602 | 2.84014 2.84014 | 85.2602 85.2602
## 425 84.8398 84.8398 | 2.73474 2.73474 | 84.8398 84.8398
## 424 84.5834 84.5834 | 2.72985 2.72985 | 84.5834 84.5834
## 423 83.9714 83.9714 | 2.65158 2.65158 | 83.9714 83.9714
## 422 83.7231 83.7231 | 2.53016 2.53016 | 83.7231 83.7231
## 419 83.1661 83.1661 | 2.51231 2.51231 | 83.1661 83.1661
## 421 82.9022 82.9022 | 2.41503 2.41503 | 82.9022 82.9022
## 418 82.7248 82.7248 | 2.24439 2.24439 | 82.7248 82.7248
## 417 82.1483 82.1483 | 2.20749 2.20749 | 82.1483 82.1483
## 420 82.0810 82.0810 | 2.12849 2.12849 | 82.0810 82.0810
## 416 81.4623 81.4623 | 2.01512 2.01512 | 81.4623 81.4623
## 415 80.7710 80.7710 | 1.93270 1.93270 | 80.7710 80.7710
## 414 79.2893 79.2893 | 1.84125 1.84125 | 79.2893 79.2893
## 493 79.0456 79.0456 | 1.83911 1.83911 | 79.0456 79.0456
## 413 77.9810 77.9810 | 1.67628 1.67628 | 77.9810 77.9810
## 412 76.4233 76.4233 | 1.65685 1.65685 | 76.4233 76.4233
## 411 75.0567 75.0567 | 1.59053 1.59053 | 75.0567 75.0567
## 495 73.4449 73.4449 | 1.49947 1.49947 | 73.4449 73.4449
## 410 73.4330 73.4330 | 1.47414 1.47414 | 73.4330 73.4330
## 409 72.5276 72.5276 | 1.44002 1.44002 | 72.5276 72.5276
## 494 72.4653 72.4653 | 1.31672 1.31672 | 72.4653 72.4653
## 408 71.2844 71.2844 | 1.31045 1.31045 | 71.2844 71.2844
## 407 70.0532 70.0532 | 1.27632 1.27632 | 70.0532 70.0532
## 496 69.4530 69.4530 | 1.17053 1.17053 | 69.4530 69.4530
## 406 68.1506 68.1506 | 1.14647 1.14647 | 68.1506 68.1506
## 497 68.1173 68.1173 | 1.01843 1.01843 | 68.1173 68.1173
## 405 67.1213 67.1213 | 0.88107 0.88107 | 67.1213 67.1213
## 404 66.0282 66.0282 | 0.73976 0.73976 | 66.0282 66.0282
## 403 65.3844 65.3844 | 0.64795 0.64795 | 65.3844 65.3844
## Cielab_B 65.2941 65.2941 | 0.52680 0.52680 | 65.2941 65.2941
## 402 64.1510 64.1510 | 0.47331 0.47331 | 64.1510 64.1510
## 401 63.5632 63.5632 | 0.37289 0.37289 | 63.5632 63.5632
## 499 62.9532 62.9532 | 0.32512 0.32512 | 62.9532 62.9532
## 400 62.5822 62.5822 | 0.26140 0.26140 | 62.5822 62.5822
## 498 61.4938 61.4938 | 0.21537 0.21537 | 61.4938 61.4938
## 500 59.9483 59.9483 | 0.12386 0.12386 | 59.9483 59.9483
## Cielab_A 0.3654 0.3654 | 0.08284 0.08284 | 0.3654 0.3654
## Cielab_L 0.0000 0.0000 | 0.00000 0.00000 | 0.0000 0.0000
The results for variable importance show that the predictors with most impact on results are the ones around the 475nm wavelength, with the variable with most importance being the one corresponding to the 473nm wavelength.
# Using dataset w/ 80% data filtered
res13 = perform_ML(carot.fus.filt, models, pred_var = 'TCCHPLC')
# Results w/ 80% filtered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res13-res2
res13_2 = cbind(round(res13,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res13_2[order(res13_2$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Support Vector Machines (e1071) 5.230 0.5699 4.256 0.3325 | -0.65073 0.04635
## Support Vector Machines (kernlab) 5.664 0.5736 3.989 0.3041 | -0.59864 -0.04344
## Ridge Regression (w/ FS) 5.822 0.4950 3.831 0.3096 | -0.19622 -0.13759
## Elastic Net 6.055 0.5412 3.650 0.3327 | 0.15572 -0.05270
## Linear Regression (w/ Backwards Selection) 6.129 0.5769 3.624 0.3497 | -0.28577 0.07728
## Ridge Regression 6.149 0.5420 4.097 0.3355 | -0.70643 0.00981
## Linear Regression (w/ Stepwise Selection) 6.167 0.5045 3.682 0.3638 | -1.56792 0.02849
## Partial Least Squares (kernelpls) 6.178 0.3924 3.729 0.3416 | 0.45283 -0.17825
## Partial Least Squares (widekernelpls) 6.372 0.4839 3.778 0.3133 | 0.52905 -0.10908
## Partial Least Squares (pls) 6.405 0.3725 3.519 0.3126 | 0.51683 -0.22677
## Partial Least Squares (simpls) 6.461 0.4239 3.712 0.3357 | 0.69104 -0.17228
## Conditional Inference Random Forest 6.562 0.5141 4.113 0.3170 | -0.15211 0.00064
## Linear Regression (w/ Forward Selection) 6.799 0.5017 4.095 0.3203 | -1.50357 0.02121
## Random Forest 6.859 0.3502 3.401 0.2760 | -0.24582 -0.02600
## Conditional Inference Tree 7.044 0.4405 3.755 0.3167 | -0.03502 -0.01660
## K-Nearest Neighbors 7.529 0.2429 3.231 0.1996 | 0.97217 -0.18039
## Decision Trees 7.632 0.3346 3.318 0.2554 | 0.25704 -0.14733
## Lasso 8.481 0.3257 3.866 0.3053 | -9.92112 0.08476
## Linear Regression 17.403 0.3234 12.997 0.3136 | -495.83481 0.05456
The machine learning analysis with filtered fusion data showed an overall increase in model performance when comparing to the results obtained with unprocessed UV data. The best performance was achieved by support vector machines (package e1071) with a RMSE of 5.230.
Both filtered and unfiltered datasets were scalled and the machine learning models applied to these scalled datasets.
# Using unfiltered dataset
carot.fus.sc = specmine::scaling(carot.fus)
res14 = perform_ML(carot.fus.sc, models, pred_var = 'TCCHPLC')
# Results w/ unfiltered scalled fusion data and difference to unprocessed UV data results (Two last columns)
diff = res14-res2
res14_2 = cbind(round(res14,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res14_2[order(res14_2$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Ridge Regression (w/ FS) 5.555 0.5875 3.367 0.2981 | -0.46293 -0.04509
## Partial Least Squares (kernelpls) 5.673 0.5851 3.874 0.3297 | -0.05204 0.01440
## Partial Least Squares (widekernelpls) 5.676 0.5805 3.954 0.3146 | -0.16619 -0.01253
## Partial Least Squares (simpls) 5.686 0.6008 3.830 0.3193 | -0.08341 0.00459
## Partial Least Squares (pls) 5.779 0.6015 3.980 0.3180 | -0.10955 0.00229
## Elastic Net 6.089 0.5704 3.595 0.3142 | 0.18974 -0.02349
## Support Vector Machines (kernlab) 6.134 0.5299 4.064 0.3265 | -0.12848 -0.08722
## Conditional Inference Random Forest 6.230 0.5282 4.081 0.3023 | -0.48430 0.01466
## Support Vector Machines (e1071) 6.239 0.5059 3.966 0.3103 | 0.35813 -0.01765
## Linear Regression (w/ Backwards Selection) 6.370 0.5005 4.118 0.3030 | -0.04483 0.00089
## K-Nearest Neighbors 6.615 0.4235 3.751 0.2764 | 0.05843 0.00026
## Conditional Inference Tree 6.844 0.4447 3.666 0.3091 | -0.23569 -0.01229
## Linear Regression (w/ Stepwise Selection) 6.895 0.4522 3.937 0.3530 | -0.84007 -0.02384
## Random Forest 7.185 0.3555 3.497 0.3137 | 0.07982 -0.02071
## Decision Trees 7.450 0.3575 3.637 0.2773 | 0.07547 -0.12447
## Linear Regression (w/ Forward Selection) 7.872 0.4280 5.514 0.3271 | -0.43100 -0.05244
## Ridge Regression 9.473 0.5319 5.808 0.2654 | 2.61802 -0.00024
## Lasso 19.002 0.2275 10.383 0.2497 | 0.59964 -0.01341
## Linear Regression 522.783 0.2867 1791.650 0.2795 | 9.54573 0.01786
The machine learning analysis with scalled fusion data showed mixed results, with increased and decreased model performance depending on the model used when comparing to the unprocessed UV data results. The best performance was achieved by ridge regression model (with selection) with a RMSE of 5.555.
# Using dataset w/ 80% data filtered
carot.fus.filt.sc = specmine::scaling(carot.fus.filt)
res15 = perform_ML(carot.fus.filt.sc, models, pred_var = 'TCCHPLC')
# Results w/ 40% filtered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res15-res2
res15_2 = cbind(round(res15,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res15_2[order(res15_2$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Support Vector Machines (e1071) 5.290 0.5863 3.964 0.3226 | -0.59023 0.06274
## Support Vector Machines (kernlab) 5.621 0.5269 4.179 0.3141 | -0.64153 -0.09021
## Partial Least Squares (kernelpls) 5.701 0.6101 3.906 0.3054 | -0.02333 0.03941
## Partial Least Squares (widekernelpls) 5.800 0.5723 3.962 0.3333 | -0.04211 -0.02070
## Ridge Regression (w/ FS) 5.815 0.5014 3.707 0.3298 | -0.20282 -0.13115
## Partial Least Squares (pls) 5.865 0.6040 3.732 0.3115 | -0.02286 0.00478
## Partial Least Squares (simpls) 5.904 0.5834 4.005 0.3077 | 0.13434 -0.01284
## Elastic Net 6.014 0.5010 3.539 0.3167 | 0.11476 -0.09284
## Ridge Regression 6.121 0.5446 3.839 0.3252 | -0.73378 0.01247
## Linear Regression (w/ Stepwise Selection) 6.172 0.4696 3.845 0.3402 | -1.56280 -0.00646
## Linear Regression (w/ Backwards Selection) 6.190 0.5517 3.752 0.3522 | -0.22557 0.05202
## Linear Regression (w/ Forward Selection) 6.332 0.4665 3.747 0.3544 | -1.97084 -0.01390
## K-Nearest Neighbors 6.594 0.4208 3.685 0.2841 | 0.03708 -0.00249
## Conditional Inference Random Forest 6.660 0.4963 3.681 0.2929 | -0.05507 -0.01724
## Random Forest 6.886 0.3775 3.619 0.2873 | -0.21827 0.00129
## Conditional Inference Tree 6.934 0.4319 3.666 0.2686 | -0.14497 -0.02518
## Decision Trees 7.531 0.3532 3.451 0.2668 | 0.15612 -0.12877
## Lasso 8.349 0.3603 3.512 0.3103 | -10.05353 0.11944
## Linear Regression 17.405 0.2767 10.207 0.2588 | -495.83278 0.00788
Using filtered and scalled fusion data resulted in an overall increase in model performance when comparing to the unprocessed UV data results. The best performance was achieved by support vector machines (package e1071) with a RMSE of 5.290.
UV Data:
CIELAB Data:
Fusion Data: