The aim of this work is to validate a quantification method for carotenoid contents in roots of M. esculenta from colorimetric data using the CIE L * a * b * system. Assuming that the statistical techniques of prognostic analysis as well as machine learning can correlate colorimetric data easily obtained in the field, with the levels obtained through traditional techniques for compounds quantification, such as UV-visible spectrophotometry or the HPLC and, from this, construct prediction models of carotenoids content for this type of biomass.
Roots of fifty M. esculenta genotypes belonging to the EPAGRI’s germplasm bank were sampled in the 2014/2015 season. Carotenoids were extracted from fresh roots and the absorbances of the organosolvent extracts were collected on a UV-visible spectrophotometer using a spectral window from 200 to 700 ƞm. Aliquots (10 µl) of the extracts were also injected into a liquid chromatograph. The color attributes of the samples were measured by a colorimeter and the results were expressed according to the CIELAB color space scale.
To run this script the following packages are necessary:
library(specmine)
library(xlsx)
Setting working directory:
setwd("C:/Users/Telma/Desktop/CassavaCarotenoids")
set.seed(12345)
The machine learning models used in this analysis are listed in the table below. These belong to the caret package, which is used by specmine.
Model | “Method” Value | Built-in Feature Selection |
---|---|---|
Conditional Inference Random Forest | cforest | YES |
Conditional Inference Tree | ctree | YES |
Decision Trees | rpart | YES |
Elastic Net | enet | YES |
K-Nearest Neighbors | knn | NO |
Lasso Regression | lasso | YES |
Linear Regression | lm | NO |
Linear Regression (w/ Backwards Selection) | leapBackward | YES |
Linear Regression (w/ Forward Selection) | leapForward | YES |
Linear Regression (w/ Stepwise Selection) | leapSeq | YES |
Partial Least Squares | kernelpls, pls, simpls, widekernelpls | YES |
Random Forest | rf | YES |
Ridge Regression | ridge | NO |
Ridge Regression | foba | YES |
Support Vector Machines (kernlab package) | svmLinear | NO |
Support Vector Machines (e1071 package) | svmLinear2 | NO |
The following function is used to retrieve the model name given the “method” value.
getModelName <- function(model) {
if (model == 'lasso') name = 'Lasso'
else if (model == 'ridge') name = 'Ridge Regression'
else if (model == 'foba') name = 'Ridge Regression (w/ FS)'
else if (model == 'rf') name = 'Random Forest'
else if (model == 'cforest') name = 'Conditional Inference Random Forest'
else if (model == 'enet') name = 'Elastic Net'
else if (model == 'pls') name = 'Partial Least Squares (pls)'
else if (model == 'kernelpls') name = 'Partial Least Squares (kernelpls)'
else if (model == 'simpls') name = 'Partial Least Squares (simpls)'
else if (model == 'widekernelpls') name = 'Partial Least Squares (widekernelpls)'
else if (model == 'rpart') name = 'Decision Trees'
else if (model == 'ctree') name = 'Conditional Inference Tree'
else if (model == 'svmLinear') name = 'Support Vector Machines (kernlab)'
else if (model == 'svmLinear2') name = 'Support Vector Machines (e1071)'
else if (model == 'knn') name = 'K-Nearest Neighbors'
else if (model == 'lm') name = 'Linear Regression'
else if (model == 'leapBackward') name = 'Linear Regression (w/ Backwards Selection)'
else if (model == 'leapForward') name = 'Linear Regression (w/ Forward Selection)'
else if (model == 'leapSeq') name = 'Linear Regression (w/ Stepwise Selection)'
else return()
return (name)
}
The following function returns a data frame with the result of applying one or more machine learning models to a selected dataset. The metadata variable for prediction must be supplied.
perform_ML <- function(dataset, models, pred_var) {
res = data.frame(RMSE = numeric(0), Rsquared = numeric(0), RMSESD = numeric(0), RsquaredSD = numeric(0))
for (model in models) {
name = getModelName(model)
ml_res = train_models_performance(dataset, c(model), pred_var, "repeatedcv",
num.folds = 5, compute.varimp = F)
res[name,] = c(ml_res$performance$RMSE, ml_res$performance$Rsquared,
ml_res$performance$RMSESD, ml_res$performance$RsquaredSD)
assign('res', res, envir = .GlobalEnv)
}
return(res)
}
The following function returns a data frame with the result of applying a machine learning model to a dataset that is to be applied various preprocessing methods, including scaling, smoothing interpolation, background, offset and baseline corrections, first derivative and multiplicative scatter correction. The metadata variable for prediction must be supplied.
perform_ML_preproc <- function(dataset, model, pred_var) {
res = data.frame(RMSE = numeric(0), Rsquared = numeric(0), RMSESD = numeric(0), RsquaredSD = numeric(0))
ds.sc = specmine::scaling(dataset)
ds.wavelens = get_x_values_as_num(dataset)
x.axis.sm = seq(min(ds.wavelens), max(ds.wavelens),10)
ds.smooth = smoothing_interpolation(carotAg, method = "loess", x.axis = x.axis.sm)
ds.bg = data_correction(dataset, 'background')
ds.offset = data_correction(ds.bg, 'offset')
ds.baseline = data_correction(ds.offset, 'baseline')
ds.fd = first_derivative(dataset)
ds.msc = msc_correction(dataset)
datasets = list('No preprocessing' = dataset, 'Scaling' = ds.sc, 'Smoothing' = ds.smooth,
'Background cor' = ds.bg, 'Background + Offset cors' = ds.offset,
'Background + Offset + Baseline cors' = ds.baseline, 'First Derivative' = ds.fd,
'Multiplicative Scatter Cor' = ds.msc)
i = 1
for (ds in datasets) {
ml_res = train_models_performance(ds, c(model), pred_var, "repeatedcv", num.folds = 5, compute.varimp = F)
res[names(datasets)[i],] = c(ml_res$performance$RMSE, ml_res$performance$Rsquared,
ml_res$performance$RMSESD, ml_res$performance$RsquaredSD)
assign('res', res, envir = .GlobalEnv)
i = i + 1
}
return(res)
}
UV data is stored in 150 .xlsx files (3 replicates for each of the 50 genotypes), each file containing the read absorbances values between 200 to 700 ƞm.
files = list.files("data/UV")
datamat = matrix(nrow = 501, ncol = length(files))
rownames(datamat) = 200:700 #data recorded between 200-700nm
colnames(datamat) = gsub("\\.xls", "", files)
for (i in 1:length(files)){
tab_excel = read.xlsx(paste("data/UV/", files[i], sep = ""), sheetIndex = 1, header = F)
datamat[,i] = c(tab_excel[,2], rep(NA, 501-length(tab_excel[,2])))
}
datamat[1:6, 1:6]
## 101.1 101.2 101.3 102.1 102.2 102.3
## 200 0.08763 0.1863 0.10565 0.10565 0.1482 0.13221
## 201 0.09468 0.2184 0.13756 0.12944 0.1254 0.08732
## 202 0.06238 0.1792 0.08410 0.09159 0.1437 0.09159
## 203 0.11513 0.1776 0.13093 0.13497 0.1190 0.07799
## 204 0.11364 0.2038 0.05227 0.11364 0.1376 0.08368
## 205 0.13941 0.1820 0.10809 0.09691 0.1006 0.10809
Besides information regarding sample varieties and replicates, the metadata file also contains information about HPLC concentration measurements and CIELAB data.
file.metadata = "metadata/Carotenoides_Colorimetria.csv"
metadata = read_metadata(file.metadata)
description = "UV data for cassava cultivars - carotenoids"
label.x = "Wavelength"
label.values = "Absorbance"
head(metadata)
## Varieties Replicates Cielab_L Cielab_A Cielab_B CarotenoidsContent_TCCS Lutein Betacryptoxanthin
## 3.1 3 1 85.72 -2.70 22.28 4.853 0.03248 0.06543
## 3.2 3 2 86.18 -2.48 21.39 4.809 0.03248 0.06543
## 3.3 3 3 85.25 -2.64 22.38 4.951 0.03248 0.06543
## 5.1 5 1 85.47 -1.76 6.74 3.098 0.02598 0.07023
## 5.2 5 2 82.29 -2.00 7.02 4.046 0.02598 0.07023
## 5.3 5 3 84.99 -1.86 7.25 3.383 0.02598 0.07023
## Alphacarotene Cisbetacarotene transbetacarotene Lycopene TCCHPLC
## 3.1 0.06021 2.250 3.269 0 5.678
## 3.2 0.06021 2.250 3.269 0 5.678
## 3.3 0.06021 2.250 3.269 0 5.678
## 5.1 0.08319 2.679 2.860 0 5.719
## 5.2 0.08319 2.679 2.860 0 5.719
## 5.3 0.08319 2.679 2.860 0 5.719
After creating a matrix from the UV .xlsx files and reading the metadata, a dataset can be easily created.
Carotenoides_Colorimetria = create_dataset(type = "uvv-spectra", datamatrix = datamat, metadata = metadata,
label.x = label.x, label.values = label.values,
description = description)
sum_dataset(Carotenoides_Colorimetria)
## Dataset summary:
## Valid dataset
## Description: UV data for cassava cultivars - carotenoids
## Type of data: uvv-spectra
## Number of samples: 150
## Number of data points 501
## Number of metadata variables: 13
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 4224
## Mean of data values: 0.3301
## Median of data values: 0.1048
## Standard deviation: 0.6824
## Range of values: -0.06964 4.191
## Quantiles:
## 0% 25% 50% 75% 100%
## -0.06964 0.02003 0.10478 0.23166 4.19051
Because the majority of carotenoids exhibit absorption in the visible region of the spectrum, between 400 to 500 ƞm, a subset of the original dataset was created, with samples belonging to this wavelenght interval. Also, because the dataset has some missing values, as perceived by the summary above, missing values were replaced with the mean of the variables’ values.
carot_sub = subset_x_values_by_interval(Carotenoides_Colorimetria, 400, 500) # Absorbances between 400-500nm
carot_sub_nomissing = missingvalues_imputation(carot_sub, method = "mean")
sum_dataset(carot_sub_nomissing)
## Dataset summary:
## Valid dataset
## Description: UV data for cassava cultivars - carotenoids; Missing value imputation with method mean
## Type of data: uvv-spectra
## Number of samples: 150
## Number of data points 101
## Number of metadata variables: 13
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 0
## Mean of data values: 0.2316
## Median of data values: 0.187
## Standard deviation: 0.1907
## Range of values: -0.002721 1.574
## Quantiles:
## 0% 25% 50% 75% 100%
## -0.002721 0.130033 0.186963 0.261674 1.574271
The data was then aggregated, so that there are no replicates per genotype. (150 samples -> 50 samples)
indexes = rep(seq(1, num_samples(carot_sub_nomissing)/3), each = 3)
carotAg = aggregate_samples(carot_sub_nomissing, indexes, meta.to.remove = c("Replicates"))
sum_dataset(carotAg)
## Dataset summary:
## Valid dataset
## Description: UV data for cassava cultivars - carotenoids; Missing value imputation with method mean
## Type of data: uvv-spectra
## Number of samples: 50
## Number of data points 101
## Number of metadata variables: 12
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 0
## Mean of data values: 0.2316
## Median of data values: 0.1871
## Standard deviation: 0.188
## Range of values: 0.00136 1.299
## Quantiles:
## 0% 25% 50% 75% 100%
## 0.00136 0.13380 0.18708 0.26038 1.29949
The dataset is now ready to be used in the subsequent analysis.
The following step consisted in using a variety of machine learning regression approaches to determine which model and/or variables could best predict carotenoids content in roots of M. esculenta.
To determine which of the metadata variables would better perform in the prediction of carotenoids content, the already mentioned machine learning models were applied over the created dataset using different output variables. The chosen evaluation metric to compare model performance was the Root-Mean-Square Error (RMSE), since it explicitly shows how much the model predictions deviate, on average, from the actual values in the dataset.
models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 'widekernelpls',
'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm', 'leapBackward', 'leapForward', 'leapSeq')
#Using CarotenoidsContent_TCCS variable
res1 = perform_ML(carotAg, models, pred_var = 'CarotenoidsContent_TCCS')
res1[order(res1$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Partial Least Squares (simpls) 3.492 0.9208 2.760 0.11351
## Support Vector Machines (e1071) 3.709 0.9316 2.823 0.08047
## Partial Least Squares (widekernelpls) 3.732 0.9238 3.197 0.14257
## Random Forest 3.768 0.9483 2.224 0.05348
## Elastic Net 3.793 0.9185 3.539 0.13289
## Partial Least Squares (pls) 3.800 0.9529 2.098 0.06209
## Ridge Regression (w/ FS) 3.855 0.9478 2.506 0.04542
## Ridge Regression 3.877 0.9283 3.344 0.08096
## Support Vector Machines (kernlab) 3.928 0.9409 2.743 0.06560
## Partial Least Squares (kernelpls) 4.096 0.8962 3.502 0.18642
## Linear Regression (w/ Stepwise Selection) 4.158 0.9192 3.211 0.11126
## Linear Regression (w/ Forward Selection) 4.178 0.8883 3.865 0.17517
## Linear Regression (w/ Backwards Selection) 4.392 0.8711 2.935 0.13775
## K-Nearest Neighbors 4.732 0.9224 5.058 0.08588
## Lasso 5.207 0.8174 4.008 0.25323
## Conditional Inference Random Forest 6.713 0.7917 3.604 0.12296
## Conditional Inference Tree 7.363 0.7114 3.053 0.16803
## Decision Trees 7.582 0.6833 3.051 0.20625
## Linear Regression 109.408 0.5563 378.967 0.32466
mean(get_metadata(carotAg)$CarotenoidsContent_TCCS) # CarotenoidsContent_TCCS variable mean values
## [1] 10.67
The results using the “CarotenoidsContent_TCCS” variable show that the models that achieved the lowest RMSE values for the given data included partial least squares (simpls and widekernelpls) with RMSE of 3.492 and 3.732, support vector machines (from e1071 package) with RMSE of 3.709 and random forests with RMSE of 3.768. These values could, however, be better, considering the average value of the “CarotenoidsContent_TCCS” variable.
Overall the coefficient of determination (\(R^{2}\)) shows a good fit of the predictions to the observations. The linear regression model without feature selection showed the worst results, with a 109.408 RMSE and 0.5563 \(R^{2}\).
#Using TCCHPLC variable
res2 = perform_ML(carotAg, models, pred_var = 'TCCHPLC')
res2[order(res2$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Partial Least Squares (pls) 5.643 0.5971 4.049 0.3146
## Partial Least Squares (widekernelpls) 5.779 0.5701 3.840 0.3298
## Partial Least Squares (simpls) 5.789 0.5721 3.877 0.3213
## Support Vector Machines (e1071) 5.844 0.5975 4.099 0.2965
## Partial Least Squares (kernelpls) 5.878 0.5661 3.877 0.3498
## Ridge Regression (w/ FS) 5.880 0.6038 3.791 0.3322
## Support Vector Machines (kernlab) 5.907 0.5892 4.306 0.3088
## Elastic Net 5.934 0.6340 3.795 0.2997
## K-Nearest Neighbors 6.277 0.4451 3.985 0.2909
## Linear Regression (w/ Backwards Selection) 6.373 0.5226 3.921 0.2853
## Decision Trees 6.795 0.4736 3.947 0.3026
## Conditional Inference Random Forest 6.806 0.5588 4.034 0.3079
## Conditional Inference Tree 6.916 0.4805 3.808 0.2880
## Random Forest 7.275 0.3596 3.351 0.2736
## Ridge Regression 7.282 0.6163 4.579 0.2862
## Linear Regression (w/ Stepwise Selection) 8.341 0.5265 5.628 0.3311
## Linear Regression (w/ Forward Selection) 8.783 0.4716 6.292 0.3254
## Lasso 17.508 0.2494 14.130 0.2657
## Linear Regression 863.264 0.2830 3171.947 0.2985
mean(get_metadata(carotAg)$TCCHPLC) # TCCHPLC variable mean values
## [1] 10.84
The results using the “TCCHPLC” variable show that, overall, RMSE values increased compared to when using the “CarotenoidsContent_TCCS” variable. The models that achieved the lowest RMSE values for the given data included partial least squares with methods “pls”, “widekernelpls” and “simpls”, with RMSE of 5.643, 5.779 and 5.789, respectively.
Overall the coefficient of determination shows a poor fit of the predictions to the observations. The linear regression model without feature selection showed the worst results, with a 863.264 RMSE and 0.2830 \(R^{2}\). Lasso regression also performed much poorly in this case, with a 17.508 RMSE.
#Using transbetacarotene variable
res3 = perform_ML(carotAg, models, pred_var = 'transbetacarotene')
res3[order(res3$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Ridge Regression (w/ FS) 4.159 0.35640 4.057 0.3325
## Elastic Net 4.191 0.41274 4.202 0.3430
## Partial Least Squares (kernelpls) 4.211 0.42217 4.273 0.3081
## Support Vector Machines (e1071) 4.218 0.39924 4.330 0.3249
## Support Vector Machines (kernlab) 4.230 0.46608 4.219 0.3147
## Partial Least Squares (pls) 4.265 0.47090 4.131 0.3096
## Partial Least Squares (simpls) 4.309 0.36296 4.257 0.3156
## Partial Least Squares (widekernelpls) 4.324 0.45308 4.215 0.2861
## Ridge Regression 4.407 0.31655 4.219 0.3184
## K-Nearest Neighbors 4.597 0.22467 4.043 0.2204
## Conditional Inference Random Forest 4.703 0.36963 4.124 0.2694
## Conditional Inference Tree 4.894 0.28851 4.103 0.2487
## Linear Regression (w/ Forward Selection) 5.142 0.31153 4.534 0.3017
## Decision Trees 5.189 0.05344 3.750 0.0612
## Linear Regression (w/ Backwards Selection) 5.355 0.27887 4.194 0.2421
## Random Forest 5.753 0.23993 3.882 0.2553
## Linear Regression (w/ Stepwise Selection) 6.135 0.20603 5.378 0.2999
## Lasso 16.145 0.18959 14.917 0.2694
## Linear Regression 329.642 0.28887 621.822 0.2943
mean(get_metadata(carotAg)$transbetacarotene) # transbetacarotene variable mean values
## [1] 5.897
Transbetacarotene concentrations were also used, considering that it was the carotenoid with highest concentration levels. The results using the “transbetacarotene” variable show that, overall, RMSE values increased compared to when using the “CarotenoidsContent_TCCS” variable. The models that achieved the lowest RMSE values for the given data included ridge regression (w/ feature selection) with RMSE of 4.159, elastic net with a RMSE of 4.191 and partial least squares (kernelpls) with RMSE of 4.211.
Overall the coefficient of determination shows a poor fit of the predictions to the observations. The linear regression model without feature selection showed the worst results as in the previous cases, with a 329.642 RMSE and 0.28887 \(R^{2}\). Lasso regression also performed much poorly in this case, with a 16.145 RMSE.
All the results above point to a better model performance whrn using the “CarotenoidsContent_TCCS” metadata variable. This was, therefore, the chosen variable used in the subsequent analysis.
The next step consisted in testing the best models from the previous analysis (when using the “CarotenoidsContent_TCCS” metadata variable) on a preprocessed dataset, to see if the model performance improved. Those models were partial least squares, support vector machines and random forests.
# Partial least squares
res4 = perform_ML_preproc(carotAg, 'simpls', 'CarotenoidsContent_TCCS')
res4[order(res4$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## No preprocessing 3.592 0.9326 2.848 0.08055
## Background + Offset + Baseline cors 3.639 0.9253 3.201 0.07748
## Background cor 3.748 0.9068 3.173 0.15577
## Smoothing 3.915 0.9411 2.828 0.08749
## Scaling 4.023 0.9367 2.613 0.07894
## Background + Offset cors 4.231 0.9138 2.860 0.08263
## First Derivative 5.114 0.7738 4.123 0.33141
## Multiplicative Scatter Cor 9.180 0.2886 7.773 0.30671
Applying the partial least squares model to the preprocessed datasets showed no improvement of model performance when using any of the preprocessing methods.
# Support vector machines
res5 = perform_ML_preproc(carotAg, 'svmLinear2', 'CarotenoidsContent_TCCS')
res5[order(res5$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## No preprocessing 3.954 0.8900 3.797 0.1847
## First Derivative 3.973 0.8591 4.009 0.2296
## Scaling 4.049 0.9084 3.430 0.1812
## Background cor 4.189 0.8999 3.596 0.1776
## Smoothing 4.365 0.9250 3.116 0.1089
## Background + Offset cors 5.350 0.8479 4.842 0.2611
## Background + Offset + Baseline cors 5.846 0.7873 5.248 0.2638
## Multiplicative Scatter Cor 8.431 0.5065 4.978 0.3216
Applying the support vector machines model to the preprocessed datasets showed no improvement of model performance when using any of the preprocessing methods.
# Random forests
res6 = perform_ML_preproc(carotAg, 'rf', 'CarotenoidsContent_TCCS')
res6[order(res6$RMSE),] #ordered by RMSE values
## RMSE Rsquared RMSESD RsquaredSD
## Smoothing 3.664 0.9617 2.027 0.03688
## Scaling 3.678 0.9501 2.616 0.05652
## Background cor 3.808 0.9580 2.509 0.03858
## No preprocessing 3.810 0.9673 2.707 0.03217
## Background + Offset cors 4.175 0.9310 2.263 0.06727
## Background + Offset + Baseline cors 4.338 0.8986 2.998 0.09572
## First Derivative 4.638 0.8910 3.619 0.10289
## Multiplicative Scatter Cor 5.958 0.5513 2.815 0.37261
Applying the random forests model to the preprocessed dataset showed little improvement in model performance when using smoothing interpolation (RMSE of 3.567), scaling (RMSE of 3.678) and background correction (RMSE of 3.808) as preprocessing methods.
The data was also filtered in order to determine if feature selection could improve model performance. A flat pattern filter with inter-quartile range as filter function was applied to the dataset, retaining 40%, 60% and 80% of the data each time.
#Filtering 80% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 80)
res7 = perform_ML(carotAg.filt, models, 'CarotenoidsContent_TCCS')
# Results of 80% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res7-res1
res7_1 = cbind(round(res7,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res7_1[order(res7_1$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Support Vector Machines (e1071) 3.775 0.9272 2.865 0.10695 | 0.06607 -0.00439
## Ridge Regression (w/ FS) 3.784 0.9609 2.880 0.04962 | -0.07122 0.01313
## Elastic Net 3.819 0.9200 2.784 0.10532 | 0.02604 0.00153
## Support Vector Machines (kernlab) 4.004 0.9354 2.985 0.09008 | 0.07612 -0.00551
## Ridge Regression 4.026 0.9179 2.921 0.11112 | 0.14857 -0.01042
## Partial Least Squares (widekernelpls) 4.029 0.9165 2.783 0.10537 | 0.29741 -0.00733
## Random Forest 4.068 0.8764 3.461 0.17183 | 0.29960 -0.07194
## Partial Least Squares (pls) 4.105 0.9278 2.537 0.09569 | 0.30487 -0.02514
## Partial Least Squares (simpls) 4.218 0.9288 2.595 0.09155 | 0.72582 0.00805
## Partial Least Squares (kernelpls) 4.290 0.9470 2.748 0.07729 | 0.19427 0.05079
## Linear Regression (w/ Backwards Selection) 4.413 0.8663 3.434 0.16604 | 0.02065 -0.00480
## Linear Regression (w/ Stepwise Selection) 4.618 0.8835 3.240 0.15953 | 0.46010 -0.03570
## Linear Regression 4.895 0.7794 3.662 0.24235 | -104.51391 0.22306
## Lasso 5.181 0.7879 4.131 0.22043 | -0.02559 -0.02956
## K-Nearest Neighbors 5.346 0.8538 3.898 0.16235 | 0.61380 -0.06857
## Linear Regression (w/ Forward Selection) 5.574 0.7897 3.579 0.33281 | 1.39583 -0.09862
## Conditional Inference Random Forest 6.696 0.7714 2.901 0.12981 | -0.01732 -0.02031
## Decision Trees 7.163 0.7370 2.819 0.15669 | -0.41977 0.05371
## Conditional Inference Tree 7.293 0.7095 2.774 0.16195 | -0.06946 -0.00189
Filtering 80% of the data showed an overall decrease in model performance, with RMSE values increasing in comparison to the results using the original dataset. However, it massively increased the performance of the linear model (without selection), decreasing the RMSE by 104 units.
#Filtering 60% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 60)
res8 = perform_ML(carotAg.filt, models, 'CarotenoidsContent_TCCS')
# Results of 60% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res8-res1
res8_1 = cbind(round(res8,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res8_1[order(res8_1$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Elastic Net 2.975 0.9325 2.465 0.10707 | -0.81828 0.01401
## Ridge Regression 3.082 0.9597 2.287 0.07438 | -0.79492 0.03133
## Partial Least Squares (widekernelpls) 3.115 0.9515 2.494 0.04963 | -0.61633 0.02769
## Partial Least Squares (pls) 3.146 0.9432 2.406 0.06523 | -0.65462 -0.00970
## Support Vector Machines (kernlab) 3.184 0.9426 2.817 0.10225 | -0.74389 0.00166
## Ridge Regression (w/ FS) 3.269 0.9549 2.696 0.08504 | -0.58561 0.00713
## Partial Least Squares (simpls) 3.330 0.9412 2.612 0.06337 | -0.16275 0.02041
## Support Vector Machines (e1071) 3.414 0.9108 3.349 0.18377 | -0.29580 -0.02077
## Partial Least Squares (kernelpls) 3.549 0.9412 2.946 0.10128 | -0.54735 0.04500
## Random Forest 3.764 0.9592 2.564 0.03871 | -0.00474 0.01090
## Linear Regression (w/ Stepwise Selection) 4.408 0.8678 3.784 0.17481 | 0.25013 -0.05146
## Linear Regression (w/ Backwards Selection) 4.571 0.8832 3.584 0.18819 | 0.17868 0.01204
## K-Nearest Neighbors 4.624 0.9353 4.362 0.06769 | -0.10839 0.01297
## Linear Regression (w/ Forward Selection) 4.785 0.9189 4.420 0.15484 | 0.60672 0.03060
## Lasso 6.578 0.9142 12.959 0.12363 | 1.37104 0.09679
## Conditional Inference Random Forest 6.669 0.7811 3.427 0.12917 | -0.04438 -0.01059
## Decision Trees 7.255 0.7333 3.127 0.15816 | -0.32758 0.04998
## Conditional Inference Tree 7.482 0.7537 3.436 0.16905 | 0.11911 0.04232
## Linear Regression 35.529 0.6011 49.871 0.36117 | -73.87949 0.04477
Filtering 60% of the data, on the other hand, showed an overall increase in model performance, with RMSE values decreasing in comparison to the results using the original dataset. Here, elastic network model showed the lowest RMSE value so far, with a value of 2.975.
#Filtering 40% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 40)
res9 = perform_ML(carotAg.filt, models, 'CarotenoidsContent_TCCS')
# Results of 40% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res9-res1
res9_1 = cbind(round(res9,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res9_1[order(res9_1$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Elastic Net 2.788 0.9573 2.122 0.04566 | -1.00560 0.03886
## Partial Least Squares (widekernelpls) 3.022 0.9350 2.444 0.12042 | -0.70960 0.01115
## Linear Regression (w/ Forward Selection) 3.032 0.9488 2.332 0.05595 | -1.14588 0.06053
## Partial Least Squares (pls) 3.202 0.9672 2.095 0.04005 | -0.59860 0.01428
## Ridge Regression 3.287 0.9468 2.412 0.04975 | -0.59062 0.01848
## Ridge Regression (w/ FS) 3.387 0.9429 2.267 0.10049 | -0.46729 -0.00480
## Partial Least Squares (kernelpls) 3.399 0.9236 2.990 0.14775 | -0.69725 0.02743
## Random Forest 3.556 0.9311 2.533 0.09180 | -0.21273 -0.01720
## Partial Least Squares (simpls) 3.620 0.9412 3.000 0.13589 | 0.12811 0.02045
## Support Vector Machines (kernlab) 3.725 0.9360 3.091 0.12235 | -0.20346 -0.00496
## Support Vector Machines (e1071) 3.817 0.9128 3.203 0.17841 | 0.10740 -0.01876
## Linear Regression (w/ Stepwise Selection) 4.016 0.8875 3.789 0.18218 | -0.14253 -0.03173
## Linear Regression (w/ Backwards Selection) 4.096 0.8863 3.186 0.11189 | -0.29628 0.01516
## Lasso 4.282 0.9266 3.412 0.13305 | -0.92460 0.10920
## K-Nearest Neighbors 4.994 0.9290 3.993 0.06097 | 0.26195 0.00662
## Conditional Inference Random Forest 6.734 0.7649 2.720 0.12479 | 0.02156 -0.02687
## Decision Trees 7.168 0.7257 2.969 0.14286 | -0.41461 0.04241
## Conditional Inference Tree 7.264 0.7414 3.145 0.17168 | -0.09836 0.03003
## Linear Regression 85.785 0.5815 406.272 0.36570 | -23.62301 0.02515
Filtering 40% of the data, showed even better results than the previous case, with an overall increase in model performance, with RMSE values decreasing in comparison to the results using the original dataset. Here, elastic network model showed an even lower RMSE value of 2.788.
A machine learning analysis using the CIELAB data was also performed.
The CIELAB data is stored in the metadata file. Therefore, it needs to be extracted first to create the cielab dataset.
color.values = t(get_metadata(carotAg)[2:4]) #L a b
filtered.meta = get_metadata(carotAg)[5:12]
carotCielab = create_dataset(datamatrix = color.values, metadata = filtered.meta, label.x = "cielab",
label.values = "color values", description = "Dataset from cielab values")
head(carotCielab$data)[,1:12] #Cielab values for first 12 samples
## 101.1 102.1 103.1 105.1 11.1 119.1 123.1 125.1 21.1 23.1 27.1 3.1
## Cielab_L 77.670 85.017 81.25 69.25 83.59 69.510 82.893 68.563 74.113 70.240 83.983 85.717
## Cielab_A -3.397 -3.663 -4.46 -4.95 -3.44 -5.457 -2.123 -4.733 -4.277 -1.437 -2.140 -2.607
## Cielab_B 16.493 18.477 18.49 31.96 16.81 37.693 8.213 36.790 20.107 16.160 8.683 22.017
sum_dataset(carotCielab) # Dataset summary
## Dataset summary:
## Valid dataset
## Description: Dataset from cielab values
## Type of data: undefined
## Number of samples: 50
## Number of data points 3
## Number of metadata variables: 8
## Label of x-axis values: cielab
## Label of data points: color values
## Number of missing values in data: 0
## Mean of data values: 31.99
## Median of data values: 18.69
## Standard deviation: 35.84
## Range of values: -5.457 88.28
## Quantiles:
## 0% 25% 50% 75% 100%
## -5.457 -3.070 18.685 75.292 88.283
The same machine learning models used in the UV dataset were used for the CIELAB dataset, with the exception of linear regression models with selection, as it does not make sense to use these considering there are only 3 features in the dataset (L, a and b values). The metadata variable used for prediction was “CarotenoidsContent_TCCS” .
models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls',
'widekernelpls', 'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm')
#Using CarotenoidsContent_TCCS variable
res10 = perform_ML(carotCielab, models, pred_var = 'CarotenoidsContent_TCCS')
# Results w/ CIELAB data and difference to unprocessed UV data results (Two last columns)
diff = res10-res1[-c(17,18,19),]
res10_1 = cbind(round(res10,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res10_1[order(res10_1$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Linear Regression 6.295 0.5933 2.050 0.3114 | -103.113 0.03702
## Lasso 6.412 0.5503 2.191 0.2915 | 1.205 -0.26716
## Ridge Regression 6.417 0.5681 1.874 0.3271 | 2.540 -0.36027
## Elastic Net 6.456 0.5785 2.354 0.3005 | 2.663 -0.33996
## K-Nearest Neighbors 6.636 0.5336 3.660 0.3659 | 1.904 -0.38878
## Ridge Regression (w/ FS) 6.638 0.5628 2.200 0.3159 | 2.783 -0.38498
## Random Forest 6.647 0.5124 4.079 0.3346 | 2.879 -0.43592
## Partial Least Squares (pls) 6.939 0.5916 2.270 0.2887 | 3.139 -0.36136
## Partial Least Squares (simpls) 6.990 0.6022 2.410 0.2777 | 3.498 -0.31859
## Support Vector Machines (e1071) 7.015 0.5350 3.394 0.3027 | 3.306 -0.39664
## Partial Least Squares (kernelpls) 7.121 0.5827 2.315 0.2755 | 3.024 -0.31353
## Partial Least Squares (widekernelpls) 7.125 0.6221 2.691 0.2882 | 3.394 -0.30171
## Support Vector Machines (kernlab) 7.294 0.5040 3.719 0.3179 | 3.366 -0.43688
## Conditional Inference Random Forest 8.162 0.4385 4.159 0.2706 | 1.449 -0.35320
## Conditional Inference Tree 9.388 0.3063 3.570 0.2011 | 2.026 -0.40503
## Decision Trees 9.990 0.2679 3.170 0.2505 | 2.408 -0.41536
From the results above it is clear that there is an overall decrease in model performance when using CIELAB data in comparison to when using UV data, with increased RMSE values. However, the linear model performed better that any other model with a RMSE of 6.295, unlike when using UV data where it performed worst in almost every case.
The dataset was then scalled to test whether CIELAB data scaling could improve results.
carotCielab.sc = specmine::scaling(carotCielab)
sum_dataset(carotCielab.sc)
## Dataset summary:
## Valid dataset
## Description: Dataset from cielab values; Scaling with method auto
## Type of data: undefined
## Number of samples: 50
## Number of data points 3
## Number of metadata variables: 8
## Label of x-axis values: cielab
## Label of data points: color values
## Number of missing values in data: 0
## Mean of data values: 1.49e-16
## Median of data values: 0.06326
## Standard deviation: 0.9933
## Range of values: -2.187 3.695
## Quantiles:
## 0% 25% 50% 75% 100%
## -2.18663 -0.49244 0.06326 0.52084 3.69515
res11 = perform_ML(carotCielab.sc, models, pred_var = 'CarotenoidsContent_TCCS')
# Results w/ scalled CIELAB data and difference to unprocessed CIELAB data results (Two last columns)
diff = res11-res10
res11_10 = cbind(round(res11,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res11_10[order(res11_10$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Ridge Regression (w/ FS) 6.469 0.6087 2.199 0.2984 | -0.16921 0.04594
## Ridge Regression 6.497 0.5909 2.223 0.3051 | 0.08049 0.02284
## Elastic Net 6.515 0.5736 2.280 0.2943 | 0.05852 -0.00493
## Linear Regression 6.651 0.5587 2.207 0.2954 | 0.35576 -0.03466
## Lasso 6.757 0.5759 2.867 0.2945 | 0.34593 0.02567
## Partial Least Squares (widekernelpls) 6.771 0.5416 2.285 0.3121 | -0.35489 -0.08055
## Partial Least Squares (kernelpls) 6.865 0.5404 2.318 0.3196 | -0.25558 -0.04225
## Support Vector Machines (kernlab) 6.919 0.5284 4.051 0.3021 | -0.37585 0.02440
## Partial Least Squares (simpls) 7.043 0.5433 2.955 0.2718 | 0.05302 -0.05884
## Partial Least Squares (pls) 7.085 0.5385 2.695 0.2821 | 0.14515 -0.05306
## Support Vector Machines (e1071) 7.136 0.5000 3.350 0.3098 | 0.12065 -0.03496
## K-Nearest Neighbors 7.267 0.5257 4.555 0.3774 | 0.63094 -0.00786
## Random Forest 7.280 0.4481 4.149 0.3327 | 0.63345 -0.06426
## Conditional Inference Random Forest 8.021 0.4546 4.547 0.2727 | -0.14069 0.01608
## Conditional Inference Tree 9.636 0.3393 3.778 0.2672 | 0.24707 0.03295
## Decision Trees 9.737 0.3168 3.303 0.2785 | -0.25366 0.04889
Applying the machine learning models to scalled CIELAB data showed mixed results, with increased and decreased model performance depending on the model used. These changes were, however, small.
A machine learning analysis using fused UV and CIELAB data was also performed.
Two datasets were created, one using 40% of filtered UV data and another using the entire data
# Not filtered
carot.fus = low_level_fusion(list(carotAg, carotCielab))
sum_dataset(carot.fus)
## Dataset summary:
## Valid dataset
## Description: Data integration from types: uvv-spectra,undefined
## Type of data: integrated-data
## Number of samples: 50
## Number of data points 104
## Number of metadata variables: 12
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 0
## Mean of data values: 1.148
## Median of data values: 0.1881
## Standard deviation: 8.069
## Range of values: -5.457 88.28
## Quantiles:
## 0% 25% 50% 75% 100%
## -5.4567 0.1335 0.1881 0.2673 88.2833
# 40% data filtered
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 40)
carot.fus.filt = low_level_fusion(list(carotAg.filt, carotCielab))
sum_dataset(carot.fus.filt)
## Dataset summary:
## Valid dataset
## Description: Data integration from types: uvv-spectra,undefined
## Type of data: integrated-data
## Number of samples: 50
## Number of data points 63
## Number of metadata variables: 12
## Label of x-axis values: Wavelength
## Label of data points: Absorbance
## Number of missing values in data: 0
## Mean of data values: 1.782
## Median of data values: 0.217
## Standard deviation: 10.32
## Range of values: -5.457 88.28
## Quantiles:
## 0% 25% 50% 75% 100%
## -5.4567 0.1700 0.2170 0.3074 88.2833
The same machine learning models applied in the UV dataset were used for the UV and CIELAB fusion datasets. The metadata variable used for prediction was “CarotenoidsContent_TCCS” .
models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 'widekernelpls',
'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm', 'leapBackward', 'leapForward', 'leapSeq')
# Using unfiltered dataset
res12 = perform_ML(carot.fus, models, pred_var = 'CarotenoidsContent_TCCS')
# Results w/ unfiltered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res12-res1
res12_1 = cbind(round(res12,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res12_1[order(res12_1$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Ridge Regression (w/ FS) 3.570 0.9298 2.570 0.09139 | -0.28492 -0.01790
## Partial Least Squares (pls) 3.682 0.8931 2.208 0.16247 | -0.11817 -0.05986
## Partial Least Squares (simpls) 3.706 0.8746 2.115 0.22102 | 0.21384 -0.04617
## Random Forest 3.758 0.9444 3.193 0.06612 | -0.01067 -0.00397
## Elastic Net 3.775 0.9179 3.135 0.12661 | -0.01812 -0.00055
## Partial Least Squares (kernelpls) 3.804 0.8312 2.036 0.25565 | -0.29205 -0.06502
## Support Vector Machines (e1071) 3.875 0.8887 2.945 0.13634 | 0.16551 -0.04289
## Partial Least Squares (widekernelpls) 4.017 0.9247 3.009 0.09406 | 0.28545 0.00087
## Linear Regression (w/ Backwards Selection) 4.479 0.8020 3.793 0.28868 | 0.08673 -0.06914
## Support Vector Machines (kernlab) 4.612 0.8800 3.530 0.17235 | 0.68415 -0.06088
## Linear Regression (w/ Stepwise Selection) 4.718 0.7973 3.942 0.26786 | 0.55925 -0.12196
## Linear Regression (w/ Forward Selection) 4.829 0.8743 4.892 0.18620 | 0.65048 -0.01400
## Ridge Regression 4.839 0.8510 3.693 0.20439 | 0.96199 -0.07729
## Lasso 4.983 0.8076 3.935 0.25721 | -0.22357 -0.00988
## K-Nearest Neighbors 6.412 0.6320 3.724 0.32438 | 1.67962 -0.29035
## Conditional Inference Random Forest 6.663 0.7671 3.738 0.12083 | -0.05004 -0.02466
## Conditional Inference Tree 7.566 0.6697 3.125 0.14793 | 0.20316 -0.04168
## Decision Trees 8.021 0.6997 3.349 0.21125 | 0.43851 0.01642
## Linear Regression 37.304 0.5489 106.894 0.35402 | -72.10456 -0.00743
The machine learning analysis with unprocessed fusion data showed mixed results, with increased and decreased model performance depending on the model used when comparing to the unprocessed UV data results. The best performance was achieved by ridge regression model (with selection) with a RMSE of 3.570.
# Using dataset w/ 40% data filtered
res13 = perform_ML(carot.fus.filt, models, pred_var = 'CarotenoidsContent_TCCS')
# Results w/ 40% filtered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res13-res1
res13_1 = cbind(round(res13,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res13_1[order(res13_1$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Partial Least Squares (widekernelpls) 3.084 0.9319 2.047 0.08297 | -0.64814 0.00804
## Ridge Regression (w/ FS) 3.270 0.9474 2.222 0.04949 | -0.58469 -0.00033
## Elastic Net 3.297 0.8981 3.157 0.15709 | -0.49655 -0.02037
## Partial Least Squares (kernelpls) 3.439 0.9337 2.324 0.07546 | -0.65670 0.03750
## Partial Least Squares (pls) 3.497 0.9492 2.314 0.05237 | -0.30310 -0.00371
## Partial Least Squares (simpls) 3.508 0.9028 3.208 0.14473 | 0.01595 -0.01804
## Ridge Regression 3.581 0.9175 2.596 0.12414 | -0.29581 -0.01084
## Random Forest 3.634 0.9422 2.461 0.06038 | -0.13417 -0.00615
## Support Vector Machines (kernlab) 3.698 0.9109 2.622 0.13331 | -0.22996 -0.03005
## Support Vector Machines (e1071) 3.822 0.9046 2.336 0.12510 | 0.11254 -0.02704
## Linear Regression (w/ Forward Selection) 3.973 0.8968 3.630 0.15522 | -0.20577 0.00848
## Linear Regression (w/ Stepwise Selection) 3.973 0.9137 3.111 0.14123 | -0.18510 -0.00557
## Linear Regression (w/ Backwards Selection) 4.216 0.8961 3.468 0.17868 | -0.17696 0.02496
## Lasso 4.520 0.8970 3.549 0.13102 | -0.68690 0.07961
## K-Nearest Neighbors 6.445 0.5573 3.881 0.36184 | 1.71233 -0.36511
## Conditional Inference Random Forest 6.813 0.8037 2.580 0.13109 | 0.10060 0.01198
## Decision Trees 7.398 0.7618 3.077 0.16124 | -0.18477 0.07850
## Conditional Inference Tree 7.471 0.7600 2.738 0.16739 | 0.10842 0.04860
## Linear Regression 34.045 0.5631 87.680 0.35951 | -75.36335 0.00675
The machine learning analysis with filtered fusion data showed an overall increase in model performance when comparing to the results obtained with unprocessed UV data. The best performance was achieved by partial least squares model (“widekernelpls”) with a RMSE of 3.084.
Both filtered and unfiltered datasets were scalled and the machine learning models applied to these scalled datasets.
# Using unfiltered dataset
carot.fus.sc = specmine::scaling(carot.fus)
res14 = perform_ML(carot.fus.sc, models, pred_var = 'CarotenoidsContent_TCCS')
# Results w/ unfiltered scalled fusion data and difference to unprocessed UV data results (Two last columns)
diff = res14-res1
res14_1 = cbind(round(res14,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res14_1[order(res14_1$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Ridge Regression (w/ FS) 3.363 0.9562 2.444 0.04904 | -0.49190 0.00846
## Elastic Net 3.555 0.9134 3.354 0.14244 | -0.23861 -0.00503
## Random Forest 3.702 0.9431 2.547 0.07687 | -0.06679 -0.00520
## Partial Least Squares (widekernelpls) 3.821 0.9270 2.940 0.09650 | 0.08971 0.00315
## Partial Least Squares (simpls) 3.828 0.9047 2.888 0.13734 | 0.33613 -0.01612
## Linear Regression (w/ Forward Selection) 3.955 0.8887 3.734 0.16237 | -0.22347 0.00037
## Partial Least Squares (kernelpls) 4.018 0.9099 2.484 0.14167 | -0.07785 0.01372
## Partial Least Squares (pls) 4.044 0.9445 2.562 0.06779 | 0.24382 -0.00844
## Support Vector Machines (e1071) 4.149 0.9021 3.013 0.12431 | 0.43919 -0.02955
## Ridge Regression 4.277 0.8650 3.677 0.17924 | 0.39998 -0.06330
## Support Vector Machines (kernlab) 4.285 0.9174 2.871 0.09592 | 0.35657 -0.02351
## Linear Regression (w/ Stepwise Selection) 4.501 0.8917 5.755 0.17352 | 0.34253 -0.02755
## Linear Regression (w/ Backwards Selection) 4.625 0.7971 3.493 0.25177 | 0.23213 -0.07401
## K-Nearest Neighbors 5.085 0.9319 3.843 0.06019 | 0.35300 0.00957
## Lasso 5.380 0.7681 4.480 0.26323 | 0.17360 -0.04938
## Conditional Inference Random Forest 6.675 0.7819 2.961 0.14885 | -0.03773 -0.00982
## Conditional Inference Tree 7.430 0.7408 2.923 0.18633 | 0.06707 0.02940
## Decision Trees 8.279 0.6797 2.552 0.26684 | 0.69642 -0.00359
## Linear Regression 57.460 0.5357 185.598 0.32405 | -51.94857 -0.02059
The machine learning analysis with scalled fusion data showed mixed results, with increased and decreased model performance depending on the model used when comparing to the unprocessed UV data results. The best performance was achieved by ridge regression model (with selection) with a RMSE of 3.363.
# Using dataset w/ 40% data filtered
carot.fus.filt.sc = specmine::scaling(carot.fus.filt)
res15 = perform_ML(carot.fus.filt.sc, models, pred_var = 'CarotenoidsContent_TCCS')
# Results w/ 40% filtered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res15-res1
res15_1 = cbind(round(res15,5), div = rep(' |', nrow(diff)), round(diff[-c(3,4)],5))
res15_1[order(res15_1$RMSE),]
## RMSE Rsquared RMSESD RsquaredSD div RMSE Rsquared
## Ridge Regression (w/ FS) 3.291 0.9358 2.390 0.10474 | -0.56366 -0.01194
## Partial Least Squares (pls) 3.295 0.9213 2.160 0.09737 | -0.50543 -0.03162
## Partial Least Squares (widekernelpls) 3.354 0.9260 2.448 0.11008 | -0.37727 0.00213
## Ridge Regression 3.413 0.9142 2.451 0.11500 | -0.46427 -0.01414
## Partial Least Squares (simpls) 3.421 0.9255 2.393 0.08205 | -0.07146 0.00472
## Partial Least Squares (kernelpls) 3.459 0.9032 2.261 0.10488 | -0.63719 0.00695
## Linear Regression (w/ Stepwise Selection) 3.490 0.9463 2.800 0.05425 | -0.66874 0.02704
## Elastic Net 3.499 0.9349 3.058 0.11903 | -0.29455 0.01644
## Linear Regression (w/ Forward Selection) 3.695 0.9222 2.430 0.08538 | -0.48370 0.03385
## Random Forest 3.712 0.9458 2.497 0.06147 | -0.05596 -0.00253
## Support Vector Machines (e1071) 3.795 0.9365 2.946 0.08364 | 0.08592 0.00492
## Support Vector Machines (kernlab) 4.023 0.8901 2.892 0.17257 | 0.09496 -0.05087
## Linear Regression (w/ Backwards Selection) 4.130 0.8808 3.474 0.15742 | -0.26264 0.00972
## Lasso 4.456 0.8864 3.621 0.19410 | -0.75126 0.06900
## K-Nearest Neighbors 4.966 0.9238 3.985 0.08355 | 0.23403 0.00144
## Conditional Inference Random Forest 6.536 0.7824 3.561 0.12064 | -0.17719 -0.00927
## Decision Trees 7.042 0.7087 2.832 0.15897 | -0.54042 0.02545
## Conditional Inference Tree 7.487 0.7142 1.911 0.16668 | 0.12455 0.00279
## Linear Regression 23.035 0.5023 35.928 0.34654 | -86.37294 -0.05395
Using filtered and scalled fusion data resulted in an overall increase in model performance when comparing to the unprocessed UV data results. The best performance was achieved by ridge regression model (with selection) with a RMSE of 3.291.
UV Data:
CIELAB Data:
Fusion Data: