1 Introduction

The aim of this work is to validate a quantification method for carotenoid contents in roots of M. esculenta from colorimetric data using the CIE L * a * b * system. Assuming that the statistical techniques of prognostic analysis as well as machine learning can correlate colorimetric data easily obtained in the field, with the levels obtained through traditional techniques for compounds quantification, such as UV-visible spectrophotometry or the HPLC and, from this, construct prediction models of carotenoids content for this type of biomass.

Roots of fifty M. esculenta genotypes belonging to the EPAGRI’s germplasm bank were sampled in the 2014/2015 season. Carotenoids were extracted from fresh roots and the absorbances of the organosolvent extracts were collected on a UV-visible spectrophotometer using a spectral window from 200 to 700 nm. Aliquots (10 Âµl) of the extracts were also injected into a liquid chromatograph. The color attributes of the samples were measured by a colorimeter and the results were expressed according to the CIELAB color space scale.

2 Necessary tools

To run this script the following packages are necessary:

library(specmine)
library(xlsx)

Setting working directory:

setwd("C:/Users/Telma/Desktop/CassavaCarotenoids")
set.seed(12345)

2.1 Used Models

The machine learning models used in this analysis are listed in the table below. These belong to the caret package, which is used by specmine.

**Table 1** - Machine learning models used in this analysis. The first column shows the model’s name, the second column shows the value that should be given to the function and the third column indicates whether or not the model has built-in feature selection. For more information on any of the models visit https://topepo.github.io/caret/available-models.html
Model	“Method” Value	Built-in Feature Selection
Conditional Inference Random Forest	cforest	YES
Conditional Inference Tree	ctree	YES
Decision Trees	rpart	YES
Elastic Net	enet	YES
K-Nearest Neighbors	knn	NO
Lasso Regression	lasso	YES
Linear Regression	lm	NO
Linear Regression (w/ Backwards Selection)	leapBackward	YES
Linear Regression (w/ Forward Selection)	leapForward	YES
Linear Regression (w/ Stepwise Selection)	leapSeq	YES
Partial Least Squares	kernelpls, pls, simpls, widekernelpls	YES
Random Forest	rf	YES
Ridge Regression	ridge	NO
Ridge Regression	foba	YES
Support Vector Machines (kernlab package)	svmLinear	NO
Support Vector Machines (e1071 package)	svmLinear2	NO

2.2 Auxiliary functions

The following function is used to retrieve the model name given the “method” value.

getModelName <- function(model) {
  if (model == 'lasso') name = 'Lasso'
  else if (model == 'ridge') name = 'Ridge Regression'
  else if (model == 'foba') name = 'Ridge Regression (w/ FS)'
  else if (model == 'rf') name = 'Random Forest'
  else if (model == 'cforest') name = 'Conditional Inference Random Forest'
  else if (model == 'enet') name = 'Elastic Net'
  else if (model == 'pls') name = 'Partial Least Squares (pls)'
  else if (model == 'kernelpls') name = 'Partial Least Squares (kernelpls)'
  else if (model == 'simpls') name = 'Partial Least Squares (simpls)'
  else if (model == 'widekernelpls') name = 'Partial Least Squares (widekernelpls)'
  else if (model == 'rpart') name = 'Decision Trees'
  else if (model == 'ctree') name = 'Conditional Inference Tree'
  else if (model == 'svmLinear') name = 'Support Vector Machines (kernlab)'
  else if (model == 'svmLinear2') name = 'Support Vector Machines (e1071)'
  else if (model == 'knn') name = 'K-Nearest Neighbors'
  else if (model == 'lm') name = 'Linear Regression'
  else if (model == 'leapBackward') name = 'Linear Regression (w/ Backwards Selection)'
  else if (model == 'leapForward') name = 'Linear Regression (w/ Forward Selection)'
  else if (model == 'leapSeq') name = 'Linear Regression (w/ Stepwise Selection)'
  else return()
  return (name)
}

The following function returns a data frame with the result of applying one or more machine learning models to a selected dataset. The metadata variable for prediction must be supplied.

perform_ML <- function(dataset, models, pred_var) {
  res = data.frame(RMSE = numeric(0), Rsquared = numeric(0), RMSESD = numeric(0), RsquaredSD = numeric(0))
  for (model in models) {
    name = getModelName(model)
    ml_res = train_models_performance(dataset, c(model), pred_var, "repeatedcv", 
                                      num.folds = 5, compute.varimp = F)
    res[name,] = c(ml_res$performance$RMSE, ml_res$performance$Rsquared, 
                   ml_res$performance$RMSESD, ml_res$performance$RsquaredSD)
    assign('res', res, envir = .GlobalEnv)
  }
  return(res)
}

The following function returns a data frame with the result of applying a machine learning model to a dataset that is to be applied various preprocessing methods, including scaling, smoothing interpolation, background, offset and baseline corrections, first derivative and multiplicative scatter correction. The metadata variable for prediction must be supplied.

perform_ML_preproc <- function(dataset, model, pred_var) {
  res = data.frame(RMSE = numeric(0), Rsquared = numeric(0), RMSESD = numeric(0), RsquaredSD = numeric(0))
  
  ds.sc = specmine::scaling(dataset)
  ds.wavelens = get_x_values_as_num(dataset)
  x.axis.sm = seq(min(ds.wavelens), max(ds.wavelens),10)
  ds.smooth = smoothing_interpolation(carotAg, method = "loess", x.axis = x.axis.sm)
  ds.bg = data_correction(dataset, 'background')
  ds.offset = data_correction(ds.bg, 'offset')
  ds.baseline = data_correction(ds.offset, 'baseline')
  ds.fd = first_derivative(dataset)
  ds.msc = msc_correction(dataset)
  
  datasets = list('No preprocessing' = dataset, 'Scaling' = ds.sc, 'Smoothing' = ds.smooth, 
                  'Background cor' = ds.bg, 'Background + Offset cors' = ds.offset, 
                  'Background + Offset + Baseline cors' = ds.baseline, 'First Derivative' = ds.fd,
                  'Multiplicative Scatter Cor' = ds.msc)
  i = 1
  for (ds in datasets) {
    ml_res = train_models_performance(ds, c(model), pred_var, "repeatedcv", num.folds = 5, compute.varimp = F)
    res[names(datasets)[i],] = c(ml_res$performance$RMSE, ml_res$performance$Rsquared,
                                 ml_res$performance$RMSESD, ml_res$performance$RsquaredSD)
    assign('res', res, envir = .GlobalEnv)
    i = i + 1
  }
  return(res)
}

3 UV Data

3.1 Read data from xlsx files

UV data is stored in 150 .xlsx files (3 replicates for each of the 50 genotypes), each file containing the read absorbances values between 200 to 700 nm.

files = list.files("data/UV")
datamat = matrix(nrow = 501, ncol = length(files))
rownames(datamat) = 200:700   #data recorded between 200-700nm
colnames(datamat) = gsub("\\.xls", "", files)

for (i in 1:length(files)){
  tab_excel = read.xlsx(paste("data/UV/", files[i], sep = ""), sheetIndex = 1, header = F)
  datamat[,i] = c(tab_excel[,2], rep(NA, 501-length(tab_excel[,2]))) 
}

datamat[1:6, 1:6]

##       101.1  101.2   101.3   102.1  102.2   102.3
## 200 0.08763 0.1863 0.10565 0.10565 0.1482 0.13221
## 201 0.09468 0.2184 0.13756 0.12944 0.1254 0.08732
## 202 0.06238 0.1792 0.08410 0.09159 0.1437 0.09159
## 203 0.11513 0.1776 0.13093 0.13497 0.1190 0.07799
## 204 0.11364 0.2038 0.05227 0.11364 0.1376 0.08368
## 205 0.13941 0.1820 0.10809 0.09691 0.1006 0.10809

3.2 Read metadata

Besides information regarding sample varieties and replicates, the metadata file also contains information about HPLC concentration measurements and CIELAB data.

file.metadata = "metadata/Carotenoides_Colorimetria.csv"
metadata = read_metadata(file.metadata)
description = "UV data for cassava cultivars - carotenoids"
label.x = "Wavelength"
label.values = "Absorbance"

head(metadata)

##     Varieties Replicates Cielab_L Cielab_A Cielab_B CarotenoidsContent_TCCS  Lutein Betacryptoxanthin
## 3.1         3          1    85.72    -2.70    22.28                   4.853 0.03248           0.06543
## 3.2         3          2    86.18    -2.48    21.39                   4.809 0.03248           0.06543
## 3.3         3          3    85.25    -2.64    22.38                   4.951 0.03248           0.06543
## 5.1         5          1    85.47    -1.76     6.74                   3.098 0.02598           0.07023
## 5.2         5          2    82.29    -2.00     7.02                   4.046 0.02598           0.07023
## 5.3         5          3    84.99    -1.86     7.25                   3.383 0.02598           0.07023
##     Alphacarotene Cisbetacarotene transbetacarotene Lycopene TCCHPLC
## 3.1       0.06021           2.250             3.269        0   5.678
## 3.2       0.06021           2.250             3.269        0   5.678
## 3.3       0.06021           2.250             3.269        0   5.678
## 5.1       0.08319           2.679             2.860        0   5.719
## 5.2       0.08319           2.679             2.860        0   5.719
## 5.3       0.08319           2.679             2.860        0   5.719

3.3 Create the dataset

After creating a matrix from the UV .xlsx files and reading the metadata, a dataset can be easily created.

Carotenoides_Colorimetria = create_dataset(type = "uvv-spectra", datamatrix = datamat, metadata = metadata, 
                                           label.x = label.x, label.values = label.values, 
                                           description = description)

sum_dataset(Carotenoides_Colorimetria)

## Dataset summary:
## Valid dataset
## Description:  UV data for cassava cultivars - carotenoids 
## Type of data:  uvv-spectra 
## Number of samples:  150 
## Number of data points 501 
## Number of metadata variables:  13 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  4224 
## Mean of data values:  0.3301 
## Median of data values:  0.1048 
## Standard deviation:  0.6824 
## Range of values:  -0.06964 4.191 
## Quantiles: 
##       0%      25%      50%      75%     100% 
## -0.06964  0.02003  0.10478  0.23166  4.19051

Because the majority of carotenoids exhibit absorption in the visible region of the spectrum, between 400 to 500 nm, a subset of the original dataset was created, with samples belonging to this wavelenght interval. Also, because the dataset has some missing values, as perceived by the summary above, missing values were replaced with the mean of the variables’ values.

carot_sub = subset_x_values_by_interval(Carotenoides_Colorimetria, 400, 500) # Absorbances between 400-500nm
carot_sub_nomissing = missingvalues_imputation(carot_sub, method = "mean")
sum_dataset(carot_sub_nomissing)

## Dataset summary:
## Valid dataset
## Description:  UV data for cassava cultivars - carotenoids; Missing value imputation with method mean 
## Type of data:  uvv-spectra 
## Number of samples:  150 
## Number of data points 101 
## Number of metadata variables:  13 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  0 
## Mean of data values:  0.2316 
## Median of data values:  0.187 
## Standard deviation:  0.1907 
## Range of values:  -0.002721 1.574 
## Quantiles: 
##        0%       25%       50%       75%      100% 
## -0.002721  0.130033  0.186963  0.261674  1.574271

The data was then aggregated, so that there are no replicates per genotype. (150 samples -> 50 samples)

indexes = rep(seq(1, num_samples(carot_sub_nomissing)/3), each = 3)
carotAg = aggregate_samples(carot_sub_nomissing, indexes, meta.to.remove = c("Replicates"))
sum_dataset(carotAg)

## Dataset summary:
## Valid dataset
## Description:  UV data for cassava cultivars - carotenoids; Missing value imputation with method mean 
## Type of data:  uvv-spectra 
## Number of samples:  50 
## Number of data points 101 
## Number of metadata variables:  12 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  0 
## Mean of data values:  0.2316 
## Median of data values:  0.1871 
## Standard deviation:  0.188 
## Range of values:  0.00136 1.299 
## Quantiles: 
##      0%     25%     50%     75%    100% 
## 0.00136 0.13380 0.18708 0.26038 1.29949

The dataset is now ready to be used in the subsequent analysis.

3.4 Machine Learning

The following step consisted in using a variety of machine learning regression approaches to the data, testing with different output variables and applying various preprocessing methods to the data.

3.4.1 Testing Output Variables

To test model performance for prediction of carotenoids content the already mentioned machine learning models were applied over the created dataset, using different output variables. The chosen evaluation metric to compare model performance was the Root-Mean-Square Error (RMSE), since it explicitly shows how much the model predictions deviate, on average, from the actual values in the dataset.

models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 'widekernelpls',
           'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm', 'leapBackward', 'leapForward', 'leapSeq')

#Using CarotenoidsContent_TCCS variable
res1 = perform_ML(carotAg, models, pred_var = 'CarotenoidsContent_TCCS')

res1[order(res1$RMSE),] #ordered by RMSE values

##                                              RMSE Rsquared RMSESD RsquaredSD
## Ridge Regression                            3.361   0.9453  2.606    0.06067
## Partial Least Squares (widekernelpls)       3.392   0.9392  2.419    0.05604
## Partial Least Squares (kernelpls)           3.515   0.9498  2.366    0.04934
## Partial Least Squares (simpls)              3.563   0.9293  2.649    0.12503
## Linear Regression (w/ Backwards Selection)  3.587   0.8794  2.809    0.17217
## Elastic Net                                 3.750   0.9244  3.028    0.12689
## Partial Least Squares (pls)                 3.824   0.9238  2.884    0.15686
## Ridge Regression (w/ FS)                    3.826   0.9353  2.642    0.09911
## Random Forest                               3.838   0.9696  2.215    0.03146
## Support Vector Machines (e1071)             3.860   0.9205  3.112    0.15125
## Support Vector Machines (kernlab)           4.342   0.9228  3.345    0.14426
## Linear Regression (w/ Forward Selection)    4.355   0.8581  3.628    0.21425
## Linear Regression (w/ Stepwise Selection)   4.761   0.8179  4.176    0.24028
## K-Nearest Neighbors                         5.245   0.8721  3.902    0.15636
## Lasso                                       5.369   0.8270  4.485    0.23804
## Conditional Inference Random Forest         6.764   0.7787  3.095    0.12982
## Conditional Inference Tree                  7.552   0.6522  3.576    0.19633
## Decision Trees                              7.647   0.6644  3.198    0.19817
## Linear Regression                          18.372   0.5572 31.137    0.34458

mean(get_metadata(carotAg)$CarotenoidsContent_TCCS) # CarotenoidsContent_TCCS variable mean values

## [1] 10.67

The results using the “CarotenoidsContent_TCCS” variable show that the models that achieved the lowest RMSE values for the given data included ridge regression with RMSE of 3.361 and partial least squares (widekernelpls and kernelpls) with RMSE of 3.392 and 3.515, respectively. These values could, however, be better, considering the average value of the “CarotenoidsContent_TCCS” variable.

Overall the coefficient of determination (\(R^{2}\)) shows a good fit of the predictions to the observations. The linear regression model without feature selection showed the worst results, with a 18.372 RMSE and 0.5572 \(R^{2}\).

#Using TCCHPLC variable
res2 = perform_ML(carotAg, models, pred_var = 'TCCHPLC')

res2[order(res2$RMSE),] #ordered by RMSE values

##                                               RMSE Rsquared   RMSESD RsquaredSD
## Partial Least Squares (kernelpls)            5.725   0.5707    4.038     0.3318
## Partial Least Squares (simpls)               5.770   0.5962    3.751     0.3275
## Partial Least Squares (widekernelpls)        5.843   0.5930    3.948     0.3269
## Support Vector Machines (e1071)              5.881   0.5235    3.937     0.3119
## Partial Least Squares (pls)                  5.888   0.5992    4.034     0.3227
## Elastic Net                                  5.899   0.5939    3.557     0.3148
## Ridge Regression (w/ FS)                     6.018   0.6326    4.017     0.3127
## Support Vector Machines (kernlab)            6.263   0.6171    4.362     0.2797
## Linear Regression (w/ Backwards Selection)   6.415   0.4996    3.838     0.3113
## K-Nearest Neighbors                          6.557   0.4233    4.029     0.2852
## Conditional Inference Random Forest          6.715   0.5135    3.936     0.3083
## Ridge Regression                             6.855   0.5322    4.317     0.2988
## Conditional Inference Tree                   7.079   0.4570    3.794     0.2994
## Random Forest                                7.105   0.3762    3.339     0.3058
## Decision Trees                               7.375   0.4819    3.346     0.2853
## Linear Regression (w/ Stepwise Selection)    7.735   0.4760    6.605     0.3510
## Linear Regression (w/ Forward Selection)     8.303   0.4804    6.755     0.2756
## Lasso                                       18.403   0.2409   12.110     0.2671
## Linear Regression                          513.237   0.2688 1649.554     0.2781

mean(get_metadata(carotAg)$TCCHPLC) # TCCHPLC variable mean values

## [1] 10.84

The results using the “TCCHPLC” variable show that, overall, RMSE values increased compared to when using the “CarotenoidsContent_TCCS” variable. The models that achieved the lowest RMSE values for the given data included partial least squares with methods “kernelpls”, “simpls” and “widekernelpls”, with RMSE of 5.725, 5.770 and 5.843, respectively, support vector machines with RMSE of 5.881 and elastic network with RMSE of 5.899.

Overall the coefficient of determination shows a poor fit of the predictions to the observations. The linear regression model without feature selection showed the worst results, with a 513.237 RMSE and 0.2688 \(R^{2}\). Lasso regression also performed much poorly in this case, with a 18.403 RMSE.

#Using transbetacarotene variable
res3 = perform_ML(carotAg, models, pred_var = 'transbetacarotene')

res3[order(res3$RMSE),] #ordered by RMSE values

##                                               RMSE Rsquared  RMSESD RsquaredSD
## Ridge Regression (w/ FS)                     4.051  0.40198   3.970     0.3178
## Elastic Net                                  4.084  0.42169   4.135     0.3501
## Partial Least Squares (pls)                  4.137  0.45437   4.172     0.3346
## Partial Least Squares (kernelpls)            4.169  0.51105   4.183     0.3267
## Partial Least Squares (simpls)               4.217  0.49752   4.278     0.3177
## Ridge Regression                             4.253  0.32796   4.184     0.3446
## Support Vector Machines (e1071)              4.344  0.42478   4.306     0.3365
## Partial Least Squares (widekernelpls)        4.362  0.42517   4.322     0.3125
## Support Vector Machines (kernlab)            4.389  0.50181   4.218     0.3303
## K-Nearest Neighbors                          4.536  0.22342   4.089     0.2083
## Conditional Inference Random Forest          4.724  0.39563   3.985     0.2772
## Linear Regression (w/ Backwards Selection)   4.918  0.27839   4.177     0.2350
## Conditional Inference Tree                   4.929  0.24248   3.954     0.2621
## Linear Regression (w/ Forward Selection)     5.023  0.34750   4.157     0.3227
## Decision Trees                               5.133  0.08755   4.003     0.1218
## Random Forest                                5.641  0.22644   3.829     0.2584
## Linear Regression (w/ Stepwise Selection)    5.782  0.30538   4.320     0.2974
## Lasso                                       16.450  0.17465  14.823     0.2256
## Linear Regression                          271.132  0.25855 482.988     0.2680

mean(get_metadata(carotAg)$transbetacarotene) # transbetacarotene variable mean values

## [1] 5.897

Transbetacarotene concentrations were also used, considering that it was the carotenoid with highest concentration levels. The results using the “transbetacarotene” variable show that, overall, RMSE values increased compared to when using the “CarotenoidsContent_TCCS” variable and decreased compared to when using the “TCCHPLC” variable. The models that achieved the lowest RMSE values for the given data included ridge regression (w/ feature selection) with RMSE of 4.051, elastic net with a RMSE of 4.084 and partial least squares (pls) with RMSE of 4.137

Overall the coefficient of determination shows a poor fit of the predictions to the observations. The linear regression model without feature selection showed the worst results as in the previous cases, with a 271.132 RMSE and 0.25855 \(R^{2}\). Lasso regression also performed much poorly in this case, with a 16.450 RMSE.

All the results above point to a better model performance whrn using the “CarotenoidsContent_TCCS” metadata variable. This was somewhat expected since this concentrations were calculated from UV data automatically. However, the variable we are more interested in is “TCCHPLC” since it corresponds to the concentrations measured by HPLC and it was, therefore, the variable used in the subsequent analysis.

3.4.2 Variable Importance

For the best models from the previous analysis (when using the “TCCHPLC” metadata variable) the variable importance was calculated. Those models were partial least squares, support vector machines and elastic network.

# Partial least squares
varImp1 = train_models_performance(carotAg, c('kernelpls'), 'TCCHPLC', "repeatedcv", 
                                      num.folds = 5, compute.varimp = T)

# Support vector machines
varImp2 = train_models_performance(carotAg, c('svmLinear2'), 'TCCHPLC', "repeatedcv", 
                                      num.folds = 5, compute.varimp = T)

# Elastic Net
varImp3 = train_models_performance(carotAg, c('enet'), 'TCCHPLC', "repeatedcv", 
                                      num.folds = 5, compute.varimp = T)

# Top 20 variables: Partial least squares | Support vector machines | Elastic Net
div = rep(' | ', dim(varImp1$vips[[1]])[1])
cbind(varImp1$vips[[1]], div, varImp2$vips[[1]], div, varImp3$vips[[1]])[1:20,]

##     Overall   Mean div Overall   Mean div Overall   Mean
## 449  100.00 100.00  |   100.00 100.00  |   100.00 100.00
## 448   99.93  99.93  |    99.78  99.78  |    99.78  99.78
## 450   99.76  99.76  |    99.72  99.72  |    99.72  99.72
## 447   99.66  99.66  |    99.41  99.41  |    99.41  99.41
## 446   98.89  98.89  |    99.03  99.03  |    99.03  99.03
## 451   98.31  98.31  |    98.35  98.35  |    98.35  98.35
## 445   97.89  97.89  |    98.20  98.20  |    98.20  98.20
## 452   97.80  97.80  |    97.85  97.85  |    97.85  97.85
## 444   96.43  96.43  |    97.12  97.12  |    97.12  97.12
## 453   96.06  96.06  |    97.08  97.08  |    97.08  97.08
## 443   94.58  94.58  |    95.96  95.96  |    95.96  95.96
## 454   94.45  94.45  |    95.03  95.03  |    95.03  95.03
## 442   92.71  92.71  |    94.72  94.72  |    94.72  94.72
## 455   92.13  92.13  |    93.07  93.07  |    93.07  93.07
## 441   91.43  91.43  |    90.25  90.25  |    90.25  90.25
## 456   90.93  90.93  |    88.18  88.18  |    88.18  88.18
## 440   89.02  89.02  |    87.48  87.48  |    87.48  87.48
## 457   88.53  88.53  |    86.83  86.83  |    86.83  86.83
## 458   87.34  87.34  |    86.16  86.16  |    86.16  86.16
## 439   86.77  86.77  |    85.53  85.53  |    85.53  85.53

The results for variable importance show that predictors with most impact on results are the ones around the 450nm wavelength, being the 449nm variable the most important one.

3.4.3 Preprocessed Data

The next step consisted in testing the best models from the analysis using the “TCCHPLC” metadata variable (partial least squares, support vector machines and elastic network) on a preprocessed dataset, to see if the model performance improved.

# Partial least squares
res4 = perform_ML_preproc(carotAg, 'kernelpls', 'TCCHPLC')

res4[order(res4$RMSE),] #ordered by RMSE values

##                                       RMSE Rsquared RMSESD RsquaredSD
## Background + Offset cors             5.658   0.5894  4.145     0.3167
## No preprocessing                     5.702   0.6206  3.871     0.3022
## Scaling                              5.727   0.6206  3.960     0.3338
## Smoothing                            5.737   0.5695  3.926     0.3160
## Background cor                       5.758   0.5538  4.043     0.3352
## Background + Offset + Baseline cors  6.007   0.5801  4.181     0.3212
## First Derivative                     6.432   0.4771  3.904     0.3354
## Multiplicative Scatter Cor          11.802   0.2321 12.659     0.2362

Applying the partial least squares model to the preprocessed datasets showed improvement of model performance when using a combination of background, offset and baseline corrections (RMSE 5.658) as preprocessing methods.

# Support vector Machines
res5 = perform_ML_preproc(carotAg, 'svmLinear2', 'TCCHPLC')

res5[order(res5$RMSE),] #ordered by RMSE values

##                                       RMSE Rsquared RMSESD RsquaredSD
## Smoothing                            5.773   0.6053  4.144     0.2957
## Background + Offset cors             5.936   0.5927  4.040     0.3123
## Background cor                       6.175   0.5956  4.369     0.3174
## No preprocessing                     6.194   0.5581  4.387     0.3185
## Scaling                              6.447   0.5740  4.400     0.3277
## Background + Offset + Baseline cors  9.397   0.4780  6.150     0.3145
## First Derivative                    10.774   0.4482  6.596     0.3153
## Multiplicative Scatter Cor          11.621   0.3245  9.831     0.2649

Applying the support vector machines model to the preprocessed datasets showed improvement of model performance when applying smoothing interpolation (RMSE 5.773), a combination of background, offset and baseline corrections (RMSE 5.936) and background correction to the dataset (RMSE 6.175).

# Elastic Network
res6 = perform_ML_preproc(carotAg, 'enet', 'TCCHPLC')

res6[order(res6$RMSE),] #ordered by RMSE values

##                                      RMSE Rsquared RMSESD RsquaredSD
## Background cor                      5.835   0.5967  3.729     0.3154
## No preprocessing                    5.846   0.6269  3.675     0.3244
## Smoothing                           5.851   0.5999  3.674     0.3150
## Scaling                             5.999   0.5948  3.876     0.3098
## Background + Offset + Baseline cors 6.527   0.4202  3.700     0.3210
## Background + Offset cors            6.587   0.5706  4.147     0.3060
## Multiplicative Scatter Cor          7.459   0.3229  3.915     0.2955
## First Derivative                    7.803   0.5023  5.645     0.3269

Applying the elastic network model to the preprocessed dataset showed improvement in model performance when using background correction (RMSE 5.835) as preprocessing methods.

3.4.4 Filtered Data

The data was also filtered in order to determine if feature selection could improve model performance. A flat pattern filter with inter-quartile range as filter function was applied to the dataset, retaining 40%, 60% and 80% of the data each time.

#Filtering 80% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 80)

res7 = perform_ML(carotAg.filt, models, 'TCCHPLC')

# Results of 80% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res7-res2
res7_2 = cbind(round(res7,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res7_2[order(res7_2$RMSE),]

##                                              RMSE Rsquared RMSESD RsquaredSD  div       RMSE Rsquared
## Ridge Regression (w/ FS)                    4.091   0.7141  1.466     0.2168    |   -1.92679  0.08155
## Support Vector Machines (e1071)             4.962   0.6365  4.039     0.2974    |   -0.91884  0.11295
## Elastic Net                                 5.226   0.6341  3.882     0.3188    |   -0.67341  0.04018
## Ridge Regression                            5.385   0.5996  3.970     0.3215    |   -1.47047  0.06740
## Partial Least Squares (widekernelpls)       5.409   0.5717  3.874     0.3255    |   -0.43347 -0.02129
## Support Vector Machines (kernlab)           5.442   0.5871  4.135     0.3202    |   -0.82049 -0.02999
## Partial Least Squares (kernelpls)           5.458   0.5505  3.731     0.3032    |   -0.26635 -0.02015
## Partial Least Squares (simpls)              5.563   0.5955  3.931     0.3278    |   -0.20714 -0.00070
## Partial Least Squares (pls)                 5.581   0.5904  3.841     0.3477    |   -0.30729 -0.00888
## K-Nearest Neighbors                         6.533   0.4546  3.990     0.2846    |   -0.02334  0.03133
## Linear Regression (w/ Backwards Selection)  6.595   0.5261  3.744     0.3440    |    0.18012  0.02648
## Linear Regression (w/ Forward Selection)    6.616   0.5200  4.185     0.3456    |   -1.68655  0.03959
## Lasso                                       6.645   0.5152  3.633     0.3200    |  -11.75798  0.27432
## Linear Regression (w/ Stepwise Selection)   6.661   0.5144  4.316     0.3592    |   -1.07448  0.03841
## Conditional Inference Random Forest         6.708   0.5213  3.888     0.2878    |   -0.00623  0.00781
## Conditional Inference Tree                  7.073   0.4311  3.440     0.2967    |   -0.00638 -0.02593
## Random Forest                               7.192   0.3895  3.785     0.3076    |    0.08720  0.01329
## Decision Trees                              7.286   0.3926  3.294     0.2937    |   -0.08906 -0.08938
## Linear Regression                          12.534   0.3206  6.213     0.3027    | -500.70380  0.05184

Filtering 80% of the data showed an overall increase in model performance, with RMSE values decreasing in comparison to the results using the original dataset. Also, it massively increased the performance of the linear model (without selection), decreasing the RMSE by 500 units. Ridge regression (RMSE 4.091), SVMS (RMSE 4.962) and elastic network (RMSE 5.226) models had the best performance.

#Filtering 60% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 60)

res8 = perform_ML(carotAg.filt, models, 'TCCHPLC')

# Results of 60% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res8-res2
res8_2 = cbind(round(res8,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res8_2[order(res8_2$RMSE),]

##                                               RMSE Rsquared   RMSESD RsquaredSD  div     RMSE Rsquared
## Ridge Regression (w/ FS)                     4.778   0.7225    3.576     0.3232    | -1.24013  0.08987
## Support Vector Machines (e1071)              5.052   0.6195    4.028     0.3103    | -0.82826  0.09602
## Ridge Regression                             5.256   0.6556    3.939     0.2921    | -1.59950  0.12341
## Support Vector Machines (kernlab)            5.314   0.6051    4.127     0.3080    | -0.94894 -0.01192
## Partial Least Squares (widekernelpls)        5.564   0.5908    3.786     0.3001    | -0.27888 -0.00226
## Elastic Net                                  5.600   0.6177    4.073     0.3203    | -0.29891  0.02378
## Partial Least Squares (pls)                  5.675   0.6224    4.027     0.3141    | -0.21307  0.02311
## Partial Least Squares (kernelpls)            5.689   0.5988    3.910     0.3174    | -0.03559  0.02817
## Partial Least Squares (simpls)               5.780   0.5604    4.086     0.3248    |  0.00978 -0.03578
## Linear Regression (w/ Forward Selection)     6.563   0.5485    4.762     0.3251    | -1.74017  0.06807
## K-Nearest Neighbors                          6.588   0.4490    3.729     0.3019    |  0.03182  0.02569
## Conditional Inference Random Forest          6.740   0.5150    3.802     0.3058    |  0.02539  0.00152
## Linear Regression (w/ Stepwise Selection)    6.838   0.4619    4.428     0.3840    | -0.89759 -0.01417
## Linear Regression (w/ Backwards Selection)   6.900   0.5083    6.579     0.3558    |  0.48483  0.00866
## Random Forest                                6.994   0.3906    3.869     0.3156    | -0.11121  0.01434
## Conditional Inference Tree                   7.002   0.4273    3.721     0.2816    | -0.07776 -0.02973
## Decision Trees                               7.120   0.4182    3.494     0.3049    | -0.25446 -0.06377
## Lasso                                       25.806   0.3420   53.673     0.3067    |  7.40305  0.10108
## Linear Regression                          514.449   0.3039 2066.774     0.2799    |  1.21193  0.03508

Filtering 60% of the data, also showed an overall increase in model performance, with RMSE values decreasing in comparison to the results using the original dataset. Here, ridge regression (with and withou FS) showed best RMSE values, with 4.778 and 5.256 RMSE, respectively. SVMs also showed good performance with RMSE of 5.052.

#Filtering 40% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 40)

res9 = perform_ML(carotAg.filt, models, 'TCCHPLC')

# Results of 40% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res9-res2
res9_2 = cbind(round(res9,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res9_2[order(res9_2$RMSE),]

##                                               RMSE Rsquared  RMSESD RsquaredSD  div       RMSE Rsquared
## Support Vector Machines (e1071)              5.318   0.6197   3.959     0.2904    |   -0.56264  0.09614
## Elastic Net                                  5.484   0.6062   3.929     0.3314    |   -0.41515  0.01234
## Support Vector Machines (kernlab)            5.499   0.5945   4.116     0.3143    |   -0.76331 -0.02262
## Partial Least Squares (widekernelpls)        5.569   0.5877   4.075     0.3054    |   -0.27335 -0.00535
## Ridge Regression                             5.629   0.6341   4.094     0.3300    |   -1.22593  0.10197
## Partial Least Squares (simpls)               5.742   0.5858   3.951     0.3263    |   -0.02804 -0.01044
## Partial Least Squares (kernelpls)            5.745   0.5919   3.949     0.3217    |    0.02038  0.02125
## Partial Least Squares (pls)                  5.789   0.5840   3.799     0.3187    |   -0.09871 -0.01523
## Ridge Regression (w/ FS)                     5.935   0.5806   4.192     0.3165    |   -0.08308 -0.05201
## Linear Regression (w/ Backwards Selection)   6.211   0.5404   4.019     0.3207    |   -0.20407  0.04072
## K-Nearest Neighbors                          6.463   0.4871   3.995     0.2800    |   -0.09316  0.06384
## Conditional Inference Random Forest          6.646   0.5203   3.887     0.2866    |   -0.06877  0.00679
## Conditional Inference Tree                   6.791   0.4626   3.884     0.2868    |   -0.28859  0.00558
## Decision Trees                               6.817   0.5025   3.911     0.2459    |   -0.55748  0.02055
## Random Forest                                7.016   0.4021   3.453     0.2857    |   -0.08870  0.02585
## Linear Regression (w/ Stepwise Selection)    7.582   0.4614   3.873     0.3747    |   -0.15293 -0.01460
## Linear Regression (w/ Forward Selection)     7.765   0.4322   5.458     0.3186    |   -0.53767 -0.04827
## Lasso                                       15.062   0.2878  13.411     0.2821    |   -3.34068  0.04687
## Linear Regression                          260.984   0.3011 428.663     0.2756    | -252.25366  0.03231

Filtering 40% of the data, showed similar results to the previous case, with an overall increase in model performance, with RMSE values decreasing in comparison to the results using the original dataset. Here, best RMSE values were achieved by SVMs (from packages e1071 and kernlab) with RMSE of 5.318 and 5.499, respectively, and elastic network with RMSE of 5.484. However, filtering 80% of data showed better results in comparison to this case and filtering 60% of data.

4 CIELAB Data

A machine learning analysis using the CIELAB data was also performed.

4.1 Create dataset

The CIELAB data is stored in the metadata file. Therefore, it needs to be extracted first to create the cielab dataset.

color.values = t(get_metadata(carotAg)[2:4]) #L a b
filtered.meta = get_metadata(carotAg)[5:12]

carotCielab = create_dataset(datamatrix = color.values, metadata = filtered.meta, label.x = "cielab",
                             label.values = "color values", description = "Dataset from cielab values")
head(carotCielab$data)[,1:12] #Cielab values for first 12 samples

##           101.1  102.1 103.1 105.1  11.1  119.1  123.1  125.1   21.1   23.1   27.1    3.1
## Cielab_L 77.670 85.017 81.25 69.25 83.59 69.510 82.893 68.563 74.113 70.240 83.983 85.717
## Cielab_A -3.397 -3.663 -4.46 -4.95 -3.44 -5.457 -2.123 -4.733 -4.277 -1.437 -2.140 -2.607
## Cielab_B 16.493 18.477 18.49 31.96 16.81 37.693  8.213 36.790 20.107 16.160  8.683 22.017

sum_dataset(carotCielab) # Dataset summary

## Dataset summary:
## Valid dataset
## Description:  Dataset from cielab values 
## Type of data:  undefined 
## Number of samples:  50 
## Number of data points 3 
## Number of metadata variables:  8 
## Label of x-axis values:  cielab 
## Label of data points:  color values 
## Number of missing values in data:  0 
## Mean of data values:  31.99 
## Median of data values:  18.69 
## Standard deviation:  35.84 
## Range of values:  -5.457 88.28 
## Quantiles: 
##     0%    25%    50%    75%   100% 
## -5.457 -3.070 18.685 75.292 88.283

4.2 Machine Learning

The same machine learning models used in the UV dataset were used for the CIELAB dataset, with the exception of linear regression models with selection, as it does not make sense to use these considering there are only 3 features in the dataset (L, a and b values). The metadata variable used for prediction was “TCCHPLC” .

4.2.1 Unprocessed data

models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls',
           'widekernelpls', 'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm')

#Using TCCHPLC variable
res10 = perform_ML(carotCielab, models, pred_var = 'TCCHPLC')

# Results w/ CIELAB data and difference to unprocessed UV data results (Two last columns)
diff = res10-res2[-c(17,18,19),]
res10_2 = cbind(round(res10,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res10_2[order(res10_2$RMSE),]

##                                        RMSE Rsquared RMSESD RsquaredSD  div      RMSE Rsquared
## Elastic Net                           6.534   0.4129  2.996     0.2830    |    0.6345 -0.18095
## Support Vector Machines (kernlab)     6.534   0.3662  3.465     0.2739    |    0.2716 -0.25090
## Ridge Regression                      6.584   0.4213  2.862     0.2966    |   -0.2707 -0.11088
## Partial Least Squares (pls)           6.622   0.3946  3.151     0.2600    |    0.7343 -0.20461
## Support Vector Machines (e1071)       6.645   0.3840  3.270     0.3081    |    0.7643 -0.13954
## Ridge Regression (w/ FS)              6.653   0.3895  3.210     0.2809    |    0.6349 -0.24309
## Lasso                                 6.669   0.4110  3.025     0.2985    |  -11.7339  0.17012
## Partial Least Squares (widekernelpls) 6.696   0.3960  3.037     0.3010    |    0.8534 -0.19708
## Linear Regression                     6.749   0.4004  3.195     0.2848    | -506.4886  0.13160
## Partial Least Squares (kernelpls)     6.756   0.4319  3.240     0.2960    |    1.0308 -0.13878
## Partial Least Squares (simpls)        6.789   0.4142  3.212     0.2773    |    1.0188 -0.18205
## Conditional Inference Random Forest   6.930   0.4085  3.318     0.2538    |    0.2157 -0.10503
## K-Nearest Neighbors                   7.278   0.2569  3.355     0.2319    |    0.7210 -0.16636
## Conditional Inference Tree            7.307   0.3842  2.863     0.2451    |    0.2275 -0.07285
## Random Forest                         7.571   0.2938  3.450     0.2716    |    0.4660 -0.08241
## Decision Trees                        7.641   0.3534  3.533     0.2531    |    0.2663 -0.12851

From the results above it is clear that there is an overall decrease in model performance when using CIELAB data in comparison to when using UV data, with increased RMSE values. However, the linear model performed much better than in the case of UV data with a RMSE of 6.749. Lasso regression also performed better comparing to when using UV data, with a RMSE of 6.669. The best model performance was achieved by elastic network with RMSE of 6.534, SVMs with RMSE of 6.534 and ridge regression with RMSE of 6.584.

4.2.2 Variable Importance

The variable importance was calculated for the models that achieved better performance using CIELAB data. These models were elastic network, SVMs and ridge regression.

# Elastic Network
varImp4 = train_models_performance(carotCielab, c('enet'), 'TCCHPLC', "repeatedcv", 
                                      num.folds = 5, compute.varimp = T)

# Support vector Machines
varImp5 = train_models_performance(carotCielab, c('svmLinear'), 'TCCHPLC', "repeatedcv", 
                                      num.folds = 5, compute.varimp = T)

# Ridge Regression
varImp6 = train_models_performance(carotCielab, c('ridge'), 'TCCHPLC', "repeatedcv", 
                                      num.folds = 5, compute.varimp = T)

# Variable Importance: Elastic Network | Support vector machines | Ridge Regression
div = rep(' | ', dim(varImp4$vips[[1]])[1])
cbind(varImp4$vips[[1]], div, varImp5$vips[[1]], div, varImp6$vips[[1]])

##           Overall     Mean div  Overall     Mean div  Overall     Mean
## Cielab_B 100.0000 100.0000  |  100.0000 100.0000  |  100.0000 100.0000
## Cielab_A   0.5596   0.5596  |    0.5596   0.5596  |    0.5596   0.5596
## Cielab_L   0.0000   0.0000  |    0.0000   0.0000  |    0.0000   0.0000

The results for variable importance show that the predictor with most impact on results is the CIELAB b value.

4.2.3 Scalled data

The dataset was then scalled to test whether CIELAB data scaling could improve results.

carotCielab.sc = specmine::scaling(carotCielab)
sum_dataset(carotCielab.sc)

## Dataset summary:
## Valid dataset
## Description:  Dataset from cielab values; Scaling with method auto 
## Type of data:  undefined 
## Number of samples:  50 
## Number of data points 3 
## Number of metadata variables:  8 
## Label of x-axis values:  cielab 
## Label of data points:  color values 
## Number of missing values in data:  0 
## Mean of data values:  1.49e-16 
## Median of data values:  0.06326 
## Standard deviation:  0.9933 
## Range of values:  -2.187 3.695 
## Quantiles: 
##       0%      25%      50%      75%     100% 
## -2.18663 -0.49244  0.06326  0.52084  3.69515

res11 = perform_ML(carotCielab.sc, models, pred_var = 'TCCHPLC')

# Results w/ scalled CIELAB data and difference to unprocessed CIELAB data results (Two last columns)
diff = res11-res10
res11_10 = cbind(round(res11,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res11_10[order(res11_10$RMSE),]

##                                        RMSE Rsquared RMSESD RsquaredSD  div     RMSE Rsquared
## Support Vector Machines (e1071)       6.467   0.3800  3.116     0.2806    | -0.17772 -0.00393
## Support Vector Machines (kernlab)     6.523   0.4093  3.305     0.3068    | -0.01158  0.04316
## Partial Least Squares (widekernelpls) 6.535   0.3935  3.074     0.2944    | -0.16054 -0.00248
## Partial Least Squares (simpls)        6.548   0.4175  3.007     0.3027    | -0.24076  0.00334
## Ridge Regression (w/ FS)              6.564   0.4289  3.360     0.2752    | -0.08906  0.03941
## Elastic Net                           6.587   0.4468  3.505     0.3012    |  0.05361  0.03392
## Lasso                                 6.598   0.4303  3.294     0.3135    | -0.07043  0.01932
## Ridge Regression                      6.730   0.3962  2.966     0.2905    |  0.14568 -0.02512
## Partial Least Squares (pls)           6.752   0.3746  3.159     0.2902    |  0.12935 -0.02000
## Linear Regression                     6.765   0.3883  2.937     0.3284    |  0.01591 -0.01207
## Partial Least Squares (kernelpls)     6.832   0.3547  3.091     0.2944    |  0.07681 -0.07723
## Conditional Inference Random Forest   6.881   0.3756  3.707     0.2443    | -0.04892 -0.03283
## Conditional Inference Tree            7.237   0.3911  3.131     0.2537    | -0.06968  0.00691
## K-Nearest Neighbors                   7.431   0.2521  3.508     0.2673    |  0.15381 -0.00479
## Decision Trees                        7.682   0.3380  3.518     0.2466    |  0.04137 -0.01538
## Random Forest                         7.703   0.2899  3.546     0.2518    |  0.13271 -0.00385

Applying the machine learning models to scalled CIELAB data showed mixed results, with increased and decreased model performance depending on the model used. These changes were, however, small.

5 UV and CIELAB Data Fusion

A machine learning analysis using fused UV and CIELAB data was also performed.

5.1 Create dataset

Two datasets were created, one using 80% of filtered UV data and another using the entire data

# Not filtered
carot.fus = low_level_fusion(list(carotAg, carotCielab))
sum_dataset(carot.fus)

## Dataset summary:
## Valid dataset
## Description:  Data integration from types: uvv-spectra,undefined 
## Type of data:  integrated-data 
## Number of samples:  50 
## Number of data points 104 
## Number of metadata variables:  12 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  0 
## Mean of data values:  1.148 
## Median of data values:  0.1881 
## Standard deviation:  8.069 
## Range of values:  -5.457 88.28 
## Quantiles: 
##      0%     25%     50%     75%    100% 
## -5.4567  0.1335  0.1881  0.2673 88.2833

# 80% data filtered
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 80)
carot.fus.filt = low_level_fusion(list(carotAg.filt, carotCielab))
sum_dataset(carot.fus.filt)

## Dataset summary:
## Valid dataset
## Description:  Data integration from types: uvv-spectra,undefined 
## Type of data:  integrated-data 
## Number of samples:  50 
## Number of data points 23 
## Number of metadata variables:  12 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  0 
## Mean of data values:  4.43 
## Median of data values:  0.2416 
## Standard deviation:  16.75 
## Range of values:  -5.457 88.28 
## Quantiles: 
##      0%     25%     50%     75%    100% 
## -5.4567  0.1893  0.2416  0.3700 88.2833

5.2 Machine Learning

The same machine learning models applied in the UV dataset were used for the UV and CIELAB fusion datasets. The metadata variable used for prediction was “TCCHPLC” .

5.2.1 Unprocessed data

models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 'widekernelpls',
           'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm', 'leapBackward', 'leapForward', 'leapSeq')

# Using unfiltered dataset
res12 = perform_ML(carot.fus, models, pred_var = 'TCCHPLC')

# Results w/ unfiltered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res12-res2
res12_2 = cbind(round(res12,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res12_2[order(res12_2$RMSE),]

##                                               RMSE Rsquared   RMSESD RsquaredSD  div     RMSE Rsquared
## Ridge Regression (w/ FS)                     5.981   0.5781    3.832     0.3119    | -0.03728 -0.05444
## Partial Least Squares (kernelpls)            6.010   0.5263    3.661     0.3106    |  0.28552 -0.04438
## Partial Least Squares (widekernelpls)        6.031   0.4448    3.841     0.3220    |  0.18847 -0.14820
## Partial Least Squares (simpls)               6.082   0.5032    3.759     0.3277    |  0.31264 -0.09297
## Elastic Net                                  6.160   0.6031    3.551     0.3314    |  0.26108  0.00920
## Partial Least Squares (pls)                  6.187   0.4834    3.661     0.3238    |  0.29893 -0.11580
## Support Vector Machines (kernlab)            6.299   0.5249    4.232     0.2700    |  0.03584 -0.09221
## Support Vector Machines (e1071)              6.379   0.5477    4.273     0.3121    |  0.49848  0.02420
## Linear Regression (w/ Backwards Selection)   6.385   0.5419    3.888     0.3149    | -0.03043  0.04222
## Conditional Inference Random Forest          6.531   0.5158    3.902     0.2892    | -0.18345  0.00233
## Conditional Inference Tree                   6.923   0.4351    3.737     0.2969    | -0.15601 -0.02197
## Random Forest                                7.114   0.3527    3.431     0.2833    |  0.00927 -0.02354
## K-Nearest Neighbors                          7.355   0.2622    3.415     0.2168    |  0.79798 -0.16101
## Decision Trees                               7.789   0.3427    3.084     0.2420    |  0.41456 -0.13923
## Linear Regression (w/ Stepwise Selection)    8.052   0.5295    4.423     0.3233    |  0.31701  0.05349
## Linear Regression (w/ Forward Selection)     8.279   0.4734    7.324     0.3164    | -0.02357 -0.00705
## Ridge Regression                             8.469   0.4723    4.898     0.2712    |  1.61379 -0.05990
## Lasso                                       18.784   0.2545   12.304     0.2825    |  0.38157  0.01365
## Linear Regression                          548.940   0.2992 1430.369     0.2976    | 35.70205  0.03036

The machine learning analysis with unprocessed fusion data showed a decrease in model performance, with overall increase in RMSE values when comparing to the unprocessed UV data results. The best performance was achieved by ridge regression model (with selection) with a RMSE of 5.981.

5.2.2 Variable Importance

The variable importance was calculated for the models that achieved better performance using unprocessed fusion data. These models were ridge regression, partial least squares and elastic network.

# Ridge Regression
varImp7 = train_models_performance(carot.fus, c('foba'), 'TCCHPLC', "repeatedcv", 
                                      num.folds = 5, compute.varimp = T)

# Partial Least Squares
varImp8 = train_models_performance(carot.fus, c('kernelpls'), 'TCCHPLC', "repeatedcv", 
                                      num.folds = 5, compute.varimp = T)

# Elastic Network
varImp9 = train_models_performance(carot.fus, c('enet'), 'TCCHPLC', "repeatedcv", 
                                      num.folds = 5, compute.varimp = T)

# Variable Importance: Ridge Regression | Partial Least Squares | Elastic Network
div = rep(' | ', dim(varImp7$vips[[1]])[1])
cbind(varImp7$vips[[1]], div, varImp8$vips[[1]], div, varImp9$vips[[1]])

##           Overall     Mean div   Overall      Mean div  Overall     Mean
## 473      100.0000 100.0000  |  100.00000 100.00000  |  100.0000 100.0000
## 474       99.9113  99.9113  |   61.82867  61.82867  |   99.9113  99.9113
## 472       99.8868  99.8868  |   46.83905  46.83905  |   99.8868  99.8868
## 475       99.7654  99.7654  |    5.41013   5.41013  |   99.7654  99.7654
## 471       99.6128  99.6128  |    5.38756   5.38756  |   99.6128  99.6128
## 476       99.3395  99.3395  |    5.36948   5.36948  |   99.3395  99.3395
## 470       99.2810  99.2810  |    5.35150   5.35150  |   99.2810  99.2810
## 469       99.1404  99.1404  |    5.31719   5.31719  |   99.1404  99.1404
## 477       98.8480  98.8480  |    5.29151   5.29151  |   98.8480  98.8480
## 468       98.8306  98.8306  |    5.26227   5.26227  |   98.8306  98.8306
## 467       98.3802  98.3802  |    5.21587   5.21587  |   98.3802  98.3802
## 478       98.0079  98.0079  |    5.19450   5.19450  |   98.0079  98.0079
## 466       97.8873  97.8873  |    5.19340   5.19340  |   97.8873  97.8873
## 465       97.2259  97.2259  |    5.18352   5.18352  |   97.2259  97.2259
## 464       96.0949  96.0949  |    5.14655   5.14655  |   96.0949  96.0949
## 463       95.2659  95.2659  |    5.12585   5.12585  |   95.2659  95.2659
## 459       94.9843  94.9843  |    5.10804   5.10804  |   94.9843  94.9843
## 460       94.7241  94.7241  |    5.10741   5.10741  |   94.7241  94.7241
## 458       94.4573  94.4573  |    5.06545   5.06545  |   94.4573  94.4573
## 462       94.2039  94.2039  |    5.05908   5.05908  |   94.2039  94.2039
## 480       94.1510  94.1510  |    5.05102   5.05102  |   94.1510  94.1510
## 479       94.0085  94.0085  |    5.02047   5.02047  |   94.0085  94.0085
## 457       93.9935  93.9935  |    5.01932   5.01932  |   93.9935  93.9935
## 486       93.6229  93.6229  |    4.97254   4.97254  |   93.6229  93.6229
## 487       93.5485  93.5485  |    4.93916   4.93916  |   93.5485  93.5485
## 488       93.5467  93.5467  |    4.92974   4.92974  |   93.5467  93.5467
## 456       93.5197  93.5197  |    4.87376   4.87376  |   93.5197  93.5197
## 481       93.4256  93.4256  |    4.86100   4.86100  |   93.4256  93.4256
## 489       93.3707  93.3707  |    4.83570   4.83570  |   93.3707  93.3707
## 455       93.2616  93.2616  |    4.83495   4.83495  |   93.2616  93.2616
## 461       93.0875  93.0875  |    4.78544   4.78544  |   93.0875  93.0875
## 454       92.8529  92.8529  |    4.74143   4.74143  |   92.8529  92.8529
## 453       92.7929  92.7929  |    4.72591   4.72591  |   92.7929  92.7929
## 452       92.7572  92.7572  |    4.71433   4.71433  |   92.7572  92.7572
## 448       92.7186  92.7186  |    4.68942   4.68942  |   92.7186  92.7186
## 446       92.6654  92.6654  |    4.68730   4.68730  |   92.6654  92.6654
## 451       92.6466  92.6466  |    4.63035   4.63035  |   92.6466  92.6466
## 447       92.5967  92.5967  |    4.62820   4.62820  |   92.5967  92.5967
## 449       92.5944  92.5944  |    4.62320   4.62320  |   92.5944  92.5944
## 450       92.5356  92.5356  |    4.62091   4.62091  |   92.5356  92.5356
## 445       92.5343  92.5343  |    4.57829   4.57829  |   92.5343  92.5343
## 444       92.3814  92.3814  |    4.52236   4.52236  |   92.3814  92.3814
## 482       92.2752  92.2752  |    4.46326   4.46326  |   92.2752  92.2752
## 443       92.0251  92.0251  |    4.36791   4.36791  |   92.0251  92.0251
## 442       91.5698  91.5698  |    4.24604   4.24604  |   91.5698  91.5698
## 485       91.3638  91.3638  |    4.14840   4.14840  |   91.3638  91.3638
## 441       91.2927  91.2927  |    4.07951   4.07951  |   91.2927  91.2927
## 490       91.2712  91.2712  |    3.97698   3.97698  |   91.2712  91.2712
## 484       91.2403  91.2403  |    3.87960   3.87960  |   91.2403  91.2403
## 483       91.2153  91.2153  |    3.77792   3.77792  |   91.2153  91.2153
## 440       91.1322  91.1322  |    3.63267   3.63267  |   91.1322  91.1322
## 439       90.8556  90.8556  |    3.61297   3.61297  |   90.8556  90.8556
## 438       90.2867  90.2867  |    3.48471   3.48471  |   90.2867  90.2867
## 491       90.1094  90.1094  |    3.39727   3.39727  |   90.1094  90.1094
## 437       89.7082  89.7082  |    3.30467   3.30467  |   89.7082  89.7082
## 436       89.2866  89.2866  |    3.27905   3.27905  |   89.2866  89.2866
## 435       88.9328  88.9328  |    3.14163   3.14163  |   88.9328  88.9328
## 434       88.5372  88.5372  |    3.11541   3.11541  |   88.5372  88.5372
## 433       87.9509  87.9509  |    3.07926   3.07926  |   87.9509  87.9509
## 432       87.6793  87.6793  |    3.04597   3.04597  |   87.6793  87.6793
## 431       87.3251  87.3251  |    3.01064   3.01064  |   87.3251  87.3251
## 430       87.0876  87.0876  |    2.97654   2.97654  |   87.0876  87.0876
## 429       86.4354  86.4354  |    2.94674   2.94674  |   86.4354  86.4354
## 428       85.8934  85.8934  |    2.93409   2.93409  |   85.8934  85.8934
## 427       85.6262  85.6262  |    2.91394   2.91394  |   85.6262  85.6262
## 492       85.5259  85.5259  |    2.85108   2.85108  |   85.5259  85.5259
## 426       85.2602  85.2602  |    2.84014   2.84014  |   85.2602  85.2602
## 425       84.8398  84.8398  |    2.73474   2.73474  |   84.8398  84.8398
## 424       84.5834  84.5834  |    2.72985   2.72985  |   84.5834  84.5834
## 423       83.9714  83.9714  |    2.65158   2.65158  |   83.9714  83.9714
## 422       83.7231  83.7231  |    2.53016   2.53016  |   83.7231  83.7231
## 419       83.1661  83.1661  |    2.51231   2.51231  |   83.1661  83.1661
## 421       82.9022  82.9022  |    2.41503   2.41503  |   82.9022  82.9022
## 418       82.7248  82.7248  |    2.24439   2.24439  |   82.7248  82.7248
## 417       82.1483  82.1483  |    2.20749   2.20749  |   82.1483  82.1483
## 420       82.0810  82.0810  |    2.12849   2.12849  |   82.0810  82.0810
## 416       81.4623  81.4623  |    2.01512   2.01512  |   81.4623  81.4623
## 415       80.7710  80.7710  |    1.93270   1.93270  |   80.7710  80.7710
## 414       79.2893  79.2893  |    1.84125   1.84125  |   79.2893  79.2893
## 493       79.0456  79.0456  |    1.83911   1.83911  |   79.0456  79.0456
## 413       77.9810  77.9810  |    1.67628   1.67628  |   77.9810  77.9810
## 412       76.4233  76.4233  |    1.65685   1.65685  |   76.4233  76.4233
## 411       75.0567  75.0567  |    1.59053   1.59053  |   75.0567  75.0567
## 495       73.4449  73.4449  |    1.49947   1.49947  |   73.4449  73.4449
## 410       73.4330  73.4330  |    1.47414   1.47414  |   73.4330  73.4330
## 409       72.5276  72.5276  |    1.44002   1.44002  |   72.5276  72.5276
## 494       72.4653  72.4653  |    1.31672   1.31672  |   72.4653  72.4653
## 408       71.2844  71.2844  |    1.31045   1.31045  |   71.2844  71.2844
## 407       70.0532  70.0532  |    1.27632   1.27632  |   70.0532  70.0532
## 496       69.4530  69.4530  |    1.17053   1.17053  |   69.4530  69.4530
## 406       68.1506  68.1506  |    1.14647   1.14647  |   68.1506  68.1506
## 497       68.1173  68.1173  |    1.01843   1.01843  |   68.1173  68.1173
## 405       67.1213  67.1213  |    0.88107   0.88107  |   67.1213  67.1213
## 404       66.0282  66.0282  |    0.73976   0.73976  |   66.0282  66.0282
## 403       65.3844  65.3844  |    0.64795   0.64795  |   65.3844  65.3844
## Cielab_B  65.2941  65.2941  |    0.52680   0.52680  |   65.2941  65.2941
## 402       64.1510  64.1510  |    0.47331   0.47331  |   64.1510  64.1510
## 401       63.5632  63.5632  |    0.37289   0.37289  |   63.5632  63.5632
## 499       62.9532  62.9532  |    0.32512   0.32512  |   62.9532  62.9532
## 400       62.5822  62.5822  |    0.26140   0.26140  |   62.5822  62.5822
## 498       61.4938  61.4938  |    0.21537   0.21537  |   61.4938  61.4938
## 500       59.9483  59.9483  |    0.12386   0.12386  |   59.9483  59.9483
## Cielab_A   0.3654   0.3654  |    0.08284   0.08284  |    0.3654   0.3654
## Cielab_L   0.0000   0.0000  |    0.00000   0.00000  |    0.0000   0.0000

The results for variable importance show that the predictors with most impact on results are the ones around the 475nm wavelength, with the variable with most importance being the one corresponding to the 473nm wavelength.

5.2.3 Filtered data

# Using dataset w/ 80% data filtered
res13 = perform_ML(carot.fus.filt, models, pred_var = 'TCCHPLC')

# Results w/ 80% filtered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res13-res2
res13_2 = cbind(round(res13,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res13_2[order(res13_2$RMSE),]

##                                              RMSE Rsquared RMSESD RsquaredSD  div       RMSE Rsquared
## Support Vector Machines (e1071)             5.230   0.5699  4.256     0.3325    |   -0.65073  0.04635
## Support Vector Machines (kernlab)           5.664   0.5736  3.989     0.3041    |   -0.59864 -0.04344
## Ridge Regression (w/ FS)                    5.822   0.4950  3.831     0.3096    |   -0.19622 -0.13759
## Elastic Net                                 6.055   0.5412  3.650     0.3327    |    0.15572 -0.05270
## Linear Regression (w/ Backwards Selection)  6.129   0.5769  3.624     0.3497    |   -0.28577  0.07728
## Ridge Regression                            6.149   0.5420  4.097     0.3355    |   -0.70643  0.00981
## Linear Regression (w/ Stepwise Selection)   6.167   0.5045  3.682     0.3638    |   -1.56792  0.02849
## Partial Least Squares (kernelpls)           6.178   0.3924  3.729     0.3416    |    0.45283 -0.17825
## Partial Least Squares (widekernelpls)       6.372   0.4839  3.778     0.3133    |    0.52905 -0.10908
## Partial Least Squares (pls)                 6.405   0.3725  3.519     0.3126    |    0.51683 -0.22677
## Partial Least Squares (simpls)              6.461   0.4239  3.712     0.3357    |    0.69104 -0.17228
## Conditional Inference Random Forest         6.562   0.5141  4.113     0.3170    |   -0.15211  0.00064
## Linear Regression (w/ Forward Selection)    6.799   0.5017  4.095     0.3203    |   -1.50357  0.02121
## Random Forest                               6.859   0.3502  3.401     0.2760    |   -0.24582 -0.02600
## Conditional Inference Tree                  7.044   0.4405  3.755     0.3167    |   -0.03502 -0.01660
## K-Nearest Neighbors                         7.529   0.2429  3.231     0.1996    |    0.97217 -0.18039
## Decision Trees                              7.632   0.3346  3.318     0.2554    |    0.25704 -0.14733
## Lasso                                       8.481   0.3257  3.866     0.3053    |   -9.92112  0.08476
## Linear Regression                          17.403   0.3234 12.997     0.3136    | -495.83481  0.05456

The machine learning analysis with filtered fusion data showed an overall increase in model performance when comparing to the results obtained with unprocessed UV data. The best performance was achieved by support vector machines (package e1071) with a RMSE of 5.230.

5.2.4 Scalled data

Both filtered and unfiltered datasets were scalled and the machine learning models applied to these scalled datasets.

# Using unfiltered dataset
carot.fus.sc = specmine::scaling(carot.fus)
res14 = perform_ML(carot.fus.sc, models, pred_var = 'TCCHPLC')

# Results w/ unfiltered scalled fusion data and difference to unprocessed UV data results (Two last columns)
diff = res14-res2
res14_2 = cbind(round(res14,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res14_2[order(res14_2$RMSE),]

##                                               RMSE Rsquared   RMSESD RsquaredSD  div     RMSE Rsquared
## Ridge Regression (w/ FS)                     5.555   0.5875    3.367     0.2981    | -0.46293 -0.04509
## Partial Least Squares (kernelpls)            5.673   0.5851    3.874     0.3297    | -0.05204  0.01440
## Partial Least Squares (widekernelpls)        5.676   0.5805    3.954     0.3146    | -0.16619 -0.01253
## Partial Least Squares (simpls)               5.686   0.6008    3.830     0.3193    | -0.08341  0.00459
## Partial Least Squares (pls)                  5.779   0.6015    3.980     0.3180    | -0.10955  0.00229
## Elastic Net                                  6.089   0.5704    3.595     0.3142    |  0.18974 -0.02349
## Support Vector Machines (kernlab)            6.134   0.5299    4.064     0.3265    | -0.12848 -0.08722
## Conditional Inference Random Forest          6.230   0.5282    4.081     0.3023    | -0.48430  0.01466
## Support Vector Machines (e1071)              6.239   0.5059    3.966     0.3103    |  0.35813 -0.01765
## Linear Regression (w/ Backwards Selection)   6.370   0.5005    4.118     0.3030    | -0.04483  0.00089
## K-Nearest Neighbors                          6.615   0.4235    3.751     0.2764    |  0.05843  0.00026
## Conditional Inference Tree                   6.844   0.4447    3.666     0.3091    | -0.23569 -0.01229
## Linear Regression (w/ Stepwise Selection)    6.895   0.4522    3.937     0.3530    | -0.84007 -0.02384
## Random Forest                                7.185   0.3555    3.497     0.3137    |  0.07982 -0.02071
## Decision Trees                               7.450   0.3575    3.637     0.2773    |  0.07547 -0.12447
## Linear Regression (w/ Forward Selection)     7.872   0.4280    5.514     0.3271    | -0.43100 -0.05244
## Ridge Regression                             9.473   0.5319    5.808     0.2654    |  2.61802 -0.00024
## Lasso                                       19.002   0.2275   10.383     0.2497    |  0.59964 -0.01341
## Linear Regression                          522.783   0.2867 1791.650     0.2795    |  9.54573  0.01786

The machine learning analysis with scalled fusion data showed mixed results, with increased and decreased model performance depending on the model used when comparing to the unprocessed UV data results. The best performance was achieved by ridge regression model (with selection) with a RMSE of 5.555.

# Using dataset w/ 80% data filtered
carot.fus.filt.sc = specmine::scaling(carot.fus.filt)
res15 = perform_ML(carot.fus.filt.sc, models, pred_var = 'TCCHPLC')

# Results w/ 40% filtered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res15-res2
res15_2 = cbind(round(res15,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res15_2[order(res15_2$RMSE),]

##                                              RMSE Rsquared RMSESD RsquaredSD  div       RMSE Rsquared
## Support Vector Machines (e1071)             5.290   0.5863  3.964     0.3226    |   -0.59023  0.06274
## Support Vector Machines (kernlab)           5.621   0.5269  4.179     0.3141    |   -0.64153 -0.09021
## Partial Least Squares (kernelpls)           5.701   0.6101  3.906     0.3054    |   -0.02333  0.03941
## Partial Least Squares (widekernelpls)       5.800   0.5723  3.962     0.3333    |   -0.04211 -0.02070
## Ridge Regression (w/ FS)                    5.815   0.5014  3.707     0.3298    |   -0.20282 -0.13115
## Partial Least Squares (pls)                 5.865   0.6040  3.732     0.3115    |   -0.02286  0.00478
## Partial Least Squares (simpls)              5.904   0.5834  4.005     0.3077    |    0.13434 -0.01284
## Elastic Net                                 6.014   0.5010  3.539     0.3167    |    0.11476 -0.09284
## Ridge Regression                            6.121   0.5446  3.839     0.3252    |   -0.73378  0.01247
## Linear Regression (w/ Stepwise Selection)   6.172   0.4696  3.845     0.3402    |   -1.56280 -0.00646
## Linear Regression (w/ Backwards Selection)  6.190   0.5517  3.752     0.3522    |   -0.22557  0.05202
## Linear Regression (w/ Forward Selection)    6.332   0.4665  3.747     0.3544    |   -1.97084 -0.01390
## K-Nearest Neighbors                         6.594   0.4208  3.685     0.2841    |    0.03708 -0.00249
## Conditional Inference Random Forest         6.660   0.4963  3.681     0.2929    |   -0.05507 -0.01724
## Random Forest                               6.886   0.3775  3.619     0.2873    |   -0.21827  0.00129
## Conditional Inference Tree                  6.934   0.4319  3.666     0.2686    |   -0.14497 -0.02518
## Decision Trees                              7.531   0.3532  3.451     0.2668    |    0.15612 -0.12877
## Lasso                                       8.349   0.3603  3.512     0.3103    |  -10.05353  0.11944
## Linear Regression                          17.405   0.2767 10.207     0.2588    | -495.83278  0.00788

Using filtered and scalled fusion data resulted in an overall increase in model performance when comparing to the unprocessed UV data results. The best performance was achieved by support vector machines (package e1071) with a RMSE of 5.290.

6 Results Summary

UV Data:

Metadata’s variable that shows the best results when used as output variable is “CarotenoidsContent_TCCS”. However, “TCCHPLC” variable is of most interest and was used in subsequent analysis;
The models that achieved the lowest RMSE values included partial least squares with methods “kernelpls”, “simpls” and “widekernelpls”, with RMSE of 5.725, 5.770 and 5.843, respectively, support vector machines with RMSE of 5.881 and elastic network with RMSE of 5.899;
Testing these on preprocessed data showed improvement of model performance when using a combination of background, offset and baseline corrections in the case of PLS and SVMs, background correction in the case of SVMs and elastic network and smoothing interpolation in the case of SVMs;
Testing all models with 80%/60%/40% data filtering showed that a 80% data filtering achieved best performance results (RMSE);
The results for variable importance show that predictors with most impact on results are the ones around the 450nm wavelength, being the 449 variable the most important one;

CIELAB Data:

Overall decrease in model performance using CIELAB data in comparison to when using UV data, with increased RMSE values;
Elastic network and SVMs performed better that any other model with a RMSE of 6.534;
The results for variable importance show that the predictor with most impact on results is the CIELAB b value;
Applying the machine learning models to scalled CIELAB data showed mixed results;

Fusion Data:

Using unprocessed fusion data showed a decrease in model performance, with overall increase in RMSE values when comparing to the unprocessed UV data results;
The best performance was achieved by ridge regression model (with selection) with a RMSE of 5.981.
The results for variable importance show that the predictors with most impact on results are the ones around the 475nm wavelength, with the variable with most importance being the one corresponding to the 473nm wavelength;
Using filtered fusion data showed an overall increase in model performance when comparing to the results obtained with unprocessed UV data. The best performance was achieved by support vector machines (package e1071) with a RMSE of 5.230;
The machine learning analysis with scalled fusion data showed mixed results, whereas using filtered and scalled fusion data resulted in an overall increase in model performance when comparing to the unprocessed UV data results.

Classification tools and carotenoid measurement in Manihot esculenta via metabolomics and data mining

Telma Afonso

February 7th, 2017