1 Introduction

The aim of this work is to validate a quantification method for carotenoid contents in roots of M. esculenta from colorimetric data using the CIE L * a * b * system. Assuming that the statistical techniques of prognostic analysis as well as machine learning can correlate colorimetric data easily obtained in the field, with the levels obtained through traditional techniques for compounds quantification, such as UV-visible spectrophotometry or the HPLC and, from this, construct prediction models of carotenoids content for this type of biomass.

Roots of fifty M. esculenta genotypes belonging to the EPAGRI’s germplasm bank were sampled in the 2014/2015 season. Carotenoids were extracted from fresh roots and the absorbances of the organosolvent extracts were collected on a UV-visible spectrophotometer using a spectral window from 200 to 700 nm. Aliquots (10 µl) of the extracts were also injected into a liquid chromatograph. The color attributes of the samples were measured by a colorimeter and the results were expressed according to the CIELAB color space scale.

2 Necessary tools

To run this script the following packages are necessary:

library(specmine)
library(xlsx)

Setting working directory:

setwd("C:/Users/Telma/Desktop/CassavaCarotenoids")
set.seed(12345)

2.1 Used Models

The machine learning models used in this analysis are listed in the table below. These belong to the caret package, which is used by specmine.

**Table 1** - Machine learning models used in this analysis. The first column shows the model’s name, the second column shows the value that should be given to the function and the third column indicates whether or not the model has built-in feature selection. For more information on any of the models visit https://topepo.github.io/caret/available-models.html
Model	“Method” Value	Built-in Feature Selection
Conditional Inference Random Forest	cforest	YES
Conditional Inference Tree	ctree	YES
Decision Trees	rpart	YES
Elastic Net	enet	YES
K-Nearest Neighbors	knn	NO
Lasso Regression	lasso	YES
Linear Regression	lm	NO
Linear Regression (w/ Backwards Selection)	leapBackward	YES
Linear Regression (w/ Forward Selection)	leapForward	YES
Linear Regression (w/ Stepwise Selection)	leapSeq	YES
Partial Least Squares	kernelpls, pls, simpls, widekernelpls	YES
Random Forest	rf	YES
Ridge Regression	ridge	NO
Ridge Regression	foba	YES
Support Vector Machines (kernlab package)	svmLinear	NO
Support Vector Machines (e1071 package)	svmLinear2	NO

2.2 Auxiliary functions

The following function is used to retrieve the model name given the “method” value.

getModelName <- function(model) {
  if (model == 'lasso') name = 'Lasso'
  else if (model == 'ridge') name = 'Ridge Regression'
  else if (model == 'foba') name = 'Ridge Regression (w/ FS)'
  else if (model == 'rf') name = 'Random Forest'
  else if (model == 'cforest') name = 'Conditional Inference Random Forest'
  else if (model == 'enet') name = 'Elastic Net'
  else if (model == 'pls') name = 'Partial Least Squares (pls)'
  else if (model == 'kernelpls') name = 'Partial Least Squares (kernelpls)'
  else if (model == 'simpls') name = 'Partial Least Squares (simpls)'
  else if (model == 'widekernelpls') name = 'Partial Least Squares (widekernelpls)'
  else if (model == 'rpart') name = 'Decision Trees'
  else if (model == 'ctree') name = 'Conditional Inference Tree'
  else if (model == 'svmLinear') name = 'Support Vector Machines (kernlab)'
  else if (model == 'svmLinear2') name = 'Support Vector Machines (e1071)'
  else if (model == 'knn') name = 'K-Nearest Neighbors'
  else if (model == 'lm') name = 'Linear Regression'
  else if (model == 'leapBackward') name = 'Linear Regression (w/ Backwards Selection)'
  else if (model == 'leapForward') name = 'Linear Regression (w/ Forward Selection)'
  else if (model == 'leapSeq') name = 'Linear Regression (w/ Stepwise Selection)'
  else return()
  return (name)
}

The following function returns a data frame with the result of applying one or more machine learning models to a selected dataset. The metadata variable for prediction must be supplied.

perform_ML <- function(dataset, models, pred_var) {
  res = data.frame(RMSE = numeric(0), Rsquared = numeric(0), RMSESD = numeric(0), RsquaredSD = numeric(0))
  for (model in models) {
    name = getModelName(model)
    ml_res = train_models_performance(dataset, c(model), pred_var, "repeatedcv", 
                                      num.folds = 5, compute.varimp = F)
    res[name,] = c(ml_res$performance$RMSE, ml_res$performance$Rsquared, 
                   ml_res$performance$RMSESD, ml_res$performance$RsquaredSD)
    assign('res', res, envir = .GlobalEnv)
  }
  return(res)
}

The following function returns a data frame with the result of applying a machine learning model to a dataset that is to be applied various preprocessing methods, including scaling, smoothing interpolation, background, offset and baseline corrections, first derivative and multiplicative scatter correction. The metadata variable for prediction must be supplied.

perform_ML_preproc <- function(dataset, model, pred_var) {
  res = data.frame(RMSE = numeric(0), Rsquared = numeric(0), RMSESD = numeric(0), RsquaredSD = numeric(0))
  
  ds.sc = specmine::scaling(dataset)
  ds.wavelens = get_x_values_as_num(dataset)
  x.axis.sm = seq(min(ds.wavelens), max(ds.wavelens),10)
  ds.smooth = smoothing_interpolation(carotAg, method = "loess", x.axis = x.axis.sm)
  ds.bg = data_correction(dataset, 'background')
  ds.offset = data_correction(ds.bg, 'offset')
  ds.baseline = data_correction(ds.offset, 'baseline')
  ds.fd = first_derivative(dataset)
  ds.msc = msc_correction(dataset)
  
  datasets = list('No preprocessing' = dataset, 'Scaling' = ds.sc, 'Smoothing' = ds.smooth, 
                  'Background cor' = ds.bg, 'Background + Offset cors' = ds.offset, 
                  'Background + Offset + Baseline cors' = ds.baseline, 'First Derivative' = ds.fd,
                  'Multiplicative Scatter Cor' = ds.msc)
  i = 1
  for (ds in datasets) {
    ml_res = train_models_performance(ds, c(model), pred_var, "repeatedcv", num.folds = 5, compute.varimp = F)
    res[names(datasets)[i],] = c(ml_res$performance$RMSE, ml_res$performance$Rsquared,
                                 ml_res$performance$RMSESD, ml_res$performance$RsquaredSD)
    assign('res', res, envir = .GlobalEnv)
    i = i + 1
  }
  return(res)
}

3 UV Data

3.1 Read data from xlsx files

UV data is stored in 150 .xlsx files (3 replicates for each of the 50 genotypes), each file containing the read absorbances values between 200 to 700 Æm.

files = list.files("data/UV")
datamat = matrix(nrow = 501, ncol = length(files))
rownames(datamat) = 200:700   #data recorded between 200-700nm
colnames(datamat) = gsub("\\.xls", "", files)

for (i in 1:length(files)){
  tab_excel = read.xlsx(paste("data/UV/", files[i], sep = ""), sheetIndex = 1, header = F)
  datamat[,i] = c(tab_excel[,2], rep(NA, 501-length(tab_excel[,2]))) 
}

datamat[1:6, 1:6]

##       101.1  101.2   101.3   102.1  102.2   102.3
## 200 0.08763 0.1863 0.10565 0.10565 0.1482 0.13221
## 201 0.09468 0.2184 0.13756 0.12944 0.1254 0.08732
## 202 0.06238 0.1792 0.08410 0.09159 0.1437 0.09159
## 203 0.11513 0.1776 0.13093 0.13497 0.1190 0.07799
## 204 0.11364 0.2038 0.05227 0.11364 0.1376 0.08368
## 205 0.13941 0.1820 0.10809 0.09691 0.1006 0.10809

3.2 Read metadata

Besides information regarding sample varieties and replicates, the metadata file also contains information about HPLC concentration measurements and CIELAB data.

file.metadata = "metadata/Carotenoides_Colorimetria.csv"
metadata = read_metadata(file.metadata)
description = "UV data for cassava cultivars - carotenoids"
label.x = "Wavelength"
label.values = "Absorbance"

head(metadata)

##     Varieties Replicates Cielab_L Cielab_A Cielab_B CarotenoidsContent_TCCS  Lutein Betacryptoxanthin
## 3.1         3          1    85.72    -2.70    22.28                   4.853 0.03248           0.06543
## 3.2         3          2    86.18    -2.48    21.39                   4.809 0.03248           0.06543
## 3.3         3          3    85.25    -2.64    22.38                   4.951 0.03248           0.06543
## 5.1         5          1    85.47    -1.76     6.74                   3.098 0.02598           0.07023
## 5.2         5          2    82.29    -2.00     7.02                   4.046 0.02598           0.07023
## 5.3         5          3    84.99    -1.86     7.25                   3.383 0.02598           0.07023
##     Alphacarotene Cisbetacarotene transbetacarotene Lycopene TCCHPLC
## 3.1       0.06021           2.250             3.269        0   5.678
## 3.2       0.06021           2.250             3.269        0   5.678
## 3.3       0.06021           2.250             3.269        0   5.678
## 5.1       0.08319           2.679             2.860        0   5.719
## 5.2       0.08319           2.679             2.860        0   5.719
## 5.3       0.08319           2.679             2.860        0   5.719

3.3 Create the dataset

After creating a matrix from the UV .xlsx files and reading the metadata, a dataset can be easily created.

Carotenoides_Colorimetria = create_dataset(type = "uvv-spectra", datamatrix = datamat, metadata = metadata, 
                                           label.x = label.x, label.values = label.values, 
                                           description = description)

sum_dataset(Carotenoides_Colorimetria)

## Dataset summary:
## Valid dataset
## Description:  UV data for cassava cultivars - carotenoids 
## Type of data:  uvv-spectra 
## Number of samples:  150 
## Number of data points 501 
## Number of metadata variables:  13 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  4224 
## Mean of data values:  0.3301 
## Median of data values:  0.1048 
## Standard deviation:  0.6824 
## Range of values:  -0.06964 4.191 
## Quantiles: 
##       0%      25%      50%      75%     100% 
## -0.06964  0.02003  0.10478  0.23166  4.19051

Because the majority of carotenoids exhibit absorption in the visible region of the spectrum, between 400 to 500 nm, a subset of the original dataset was created, with samples belonging to this wavelenght interval. Also, because the dataset has some missing values, as perceived by the summary above, missing values were replaced with the mean of the variables’ values.

carot_sub = subset_x_values_by_interval(Carotenoides_Colorimetria, 400, 500) # Absorbances between 400-500nm
carot_sub_nomissing = missingvalues_imputation(carot_sub, method = "mean")
sum_dataset(carot_sub_nomissing)

## Dataset summary:
## Valid dataset
## Description:  UV data for cassava cultivars - carotenoids; Missing value imputation with method mean 
## Type of data:  uvv-spectra 
## Number of samples:  150 
## Number of data points 101 
## Number of metadata variables:  13 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  0 
## Mean of data values:  0.2316 
## Median of data values:  0.187 
## Standard deviation:  0.1907 
## Range of values:  -0.002721 1.574 
## Quantiles: 
##        0%       25%       50%       75%      100% 
## -0.002721  0.130033  0.186963  0.261674  1.574271

The data was then aggregated, so that there are no replicates per genotype. (150 samples -> 50 samples)

indexes = rep(seq(1, num_samples(carot_sub_nomissing)/3), each = 3)
carotAg = aggregate_samples(carot_sub_nomissing, indexes, meta.to.remove = c("Replicates"))
sum_dataset(carotAg)

## Dataset summary:
## Valid dataset
## Description:  UV data for cassava cultivars - carotenoids; Missing value imputation with method mean 
## Type of data:  uvv-spectra 
## Number of samples:  50 
## Number of data points 101 
## Number of metadata variables:  12 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  0 
## Mean of data values:  0.2316 
## Median of data values:  0.1871 
## Standard deviation:  0.188 
## Range of values:  0.00136 1.299 
## Quantiles: 
##      0%     25%     50%     75%    100% 
## 0.00136 0.13380 0.18708 0.26038 1.29949

The dataset is now ready to be used in the subsequent analysis.

3.4 Machine Learning

The following step consisted in using a variety of machine learning regression approaches to the data, testing with different output variables and applying various preprocessing methods to the data.

3.4.1 Testing Output Variables

To test model performance for prediction of carotenoids content the already mentioned machine learning models were applied over the created dataset, using different output variables. The chosen evaluation metric to compare model performance was the Root-Mean-Square Error (RMSE), since it explicitly shows how much the model predictions deviate, on average, from the actual values in the dataset.

models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 'widekernelpls',
           'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm', 'leapBackward', 'leapForward', 'leapSeq')

#Using CarotenoidsContent_TCCS variable
res1 = perform_ML(carotAg, models, pred_var = 'CarotenoidsContent_TCCS')

res1[order(res1$RMSE),] #ordered by RMSE values

##                                              RMSE Rsquared RMSESD RsquaredSD
## Ridge Regression                            3.361   0.9453  2.606    0.06067
## Partial Least Squares (widekernelpls)       3.392   0.9392  2.419    0.05604
## Partial Least Squares (kernelpls)           3.515   0.9498  2.366    0.04934
## Partial Least Squares (simpls)              3.563   0.9293  2.649    0.12503
## Linear Regression (w/ Backwards Selection)  3.587   0.8794  2.809    0.17217
## Elastic Net                                 3.750   0.9244  3.028    0.12689
## Partial Least Squares (pls)                 3.824   0.9238  2.884    0.15686
## Ridge Regression (w/ FS)                    3.826   0.9353  2.642    0.09911
## Random Forest                               3.838   0.9696  2.215    0.03146
## Support Vector Machines (e1071)             3.860   0.9205  3.112    0.15125
## Support Vector Machines (kernlab)           4.342   0.9228  3.345    0.14426
## Linear Regression (w/ Forward Selection)    4.355   0.8581  3.628    0.21425
## Linear Regression (w/ Stepwise Selection)   4.761   0.8179  4.176    0.24028
## K-Nearest Neighbors                         5.245   0.8721  3.902    0.15636
## Lasso                                       5.369   0.8270  4.485    0.23804
## Conditional Inference Random Forest         6.764   0.7787  3.095    0.12982
## Conditional Inference Tree                  7.552   0.6522  3.576    0.19633
## Decision Trees                              7.647   0.6644  3.198    0.19817
## Linear Regression                          18.372   0.5572 31.137    0.34458

mean(get_metadata(carotAg)$CarotenoidsContent_TCCS) # CarotenoidsContent_TCCS variable mean values

## [1] 10.67

The results using the “CarotenoidsContent_TCCS” variable show that the models that achieved the lowest RMSE values for the given data included ridge regression with RMSE of 3.361 and partial least squares (“widekernelpls” and “kernelpls” methods) with RMSE of 3.392 and 3.515, respectively. These values could, however, be better, considering the average value of the “CarotenoidsContent_TCCS” variable.

Overall the coefficient of determination (\(R^{2}\)) shows a good fit of the predictions to the observations. The linear regression model without feature selection showed the worst results, with a 18.372 RMSE and 0.5572 \(R^{2}\).

#Using TCCHPLC variable
res2 = perform_ML(carotAg, models, pred_var = 'TCCHPLC')

res2[order(res2$RMSE),] #ordered by RMSE values

##                                               RMSE Rsquared   RMSESD RsquaredSD
## Partial Least Squares (kernelpls)            5.725   0.5707    4.038     0.3318
## Partial Least Squares (simpls)               5.770   0.5962    3.751     0.3275
## Partial Least Squares (widekernelpls)        5.843   0.5930    3.948     0.3269
## Support Vector Machines (e1071)              5.881   0.5235    3.937     0.3119
## Partial Least Squares (pls)                  5.888   0.5992    4.034     0.3227
## Elastic Net                                  5.899   0.5939    3.557     0.3148
## Ridge Regression (w/ FS)                     6.018   0.6326    4.017     0.3127
## Support Vector Machines (kernlab)            6.263   0.6171    4.362     0.2797
## Linear Regression (w/ Backwards Selection)   6.415   0.4996    3.838     0.3113
## K-Nearest Neighbors                          6.557   0.4233    4.029     0.2852
## Conditional Inference Random Forest          6.715   0.5135    3.936     0.3083
## Ridge Regression                             6.855   0.5322    4.317     0.2988
## Conditional Inference Tree                   7.079   0.4570    3.794     0.2994
## Random Forest                                7.105   0.3762    3.339     0.3058
## Decision Trees                               7.375   0.4819    3.346     0.2853
## Linear Regression (w/ Stepwise Selection)    7.735   0.4760    6.605     0.3510
## Linear Regression (w/ Forward Selection)     8.303   0.4804    6.755     0.2756
## Lasso                                       18.403   0.2409   12.110     0.2671
## Linear Regression                          513.237   0.2688 1649.554     0.2781

mean(get_metadata(carotAg)$TCCHPLC) # TCCHPLC variable mean values

## [1] 10.84

The results using the “TCCHPLC” variable show that, overall, RMSE values increased compared to when using the “CarotenoidsContent_TCCS” variable. The models that achieved the lowest RMSE values for the given data included partial least squares with methods “kernelpls”, “simpls” and “widekernelpls”, with RMSE of 5.725, 5.770 and 5.843, respectively, support vector machines with RMSE of 5.881 and elastic network with RMSE of 5.899.

Overall the coefficient of determination shows a poor fit of the predictions to the observations. The linear regression model without feature selection showed the worst results, with a 513.237 RMSE and 0.2688 \(R^{2}\). Lasso regression also performed much poorly in this case, with a 18.403 RMSE.

#Using transbetacarotene variable
res3 = perform_ML(carotAg, models, pred_var = 'transbetacarotene')

res3[order(res3$RMSE),] #ordered by RMSE values

##                                               RMSE Rsquared  RMSESD RsquaredSD
## Ridge Regression (w/ FS)                     4.051  0.40198   3.970     0.3178
## Elastic Net                                  4.084  0.42169   4.135     0.3501
## Partial Least Squares (pls)                  4.137  0.45437   4.172     0.3346
## Partial Least Squares (kernelpls)            4.169  0.51105   4.183     0.3267
## Partial Least Squares (simpls)               4.217  0.49752   4.278     0.3177
## Ridge Regression                             4.253  0.32796   4.184     0.3446
## Support Vector Machines (e1071)              4.344  0.42478   4.306     0.3365
## Partial Least Squares (widekernelpls)        4.362  0.42517   4.322     0.3125
## Support Vector Machines (kernlab)            4.389  0.50181   4.218     0.3303
## K-Nearest Neighbors                          4.536  0.22342   4.089     0.2083
## Conditional Inference Random Forest          4.724  0.39563   3.985     0.2772
## Linear Regression (w/ Backwards Selection)   4.918  0.27839   4.177     0.2350
## Conditional Inference Tree                   4.929  0.24248   3.954     0.2621
## Linear Regression (w/ Forward Selection)     5.023  0.34750   4.157     0.3227
## Decision Trees                               5.133  0.08755   4.003     0.1218
## Random Forest                                5.641  0.22644   3.829     0.2584
## Linear Regression (w/ Stepwise Selection)    5.782  0.30538   4.320     0.2974
## Lasso                                       16.450  0.17465  14.823     0.2256
## Linear Regression                          271.132  0.25855 482.988     0.2680

mean(get_metadata(carotAg)$transbetacarotene) # transbetacarotene variable mean values

## [1] 5.897

Transbetacarotene concentrations were also used, considering that it was the carotenoid with highest concentration levels. The results using the “transbetacarotene” variable show that, overall, RMSE values increased compared to when using the “CarotenoidsContent_TCCS” variable and decreased compared to when using the “TCCHPLC” variable. The models that achieved the lowest RMSE values for the given data included ridge regression (w/ feature selection) with RMSE of 4.051, elastic net with a RMSE of 4.084 and partial least squares (“pls” method) with RMSE of 4.137

Overall the coefficient of determination shows a poor fit of the predictions to the observations. The linear regression model without feature selection showed the worst results as in the previous cases, with a 271.132 RMSE and 0.25855 \(R^{2}\). Lasso regression also performed much poorly in this case, with a 16.450 RMSE.

All the results above point to a better model performance whrn using the “CarotenoidsContent_TCCS” metadata variable. This was somewhat expected since this concentrations were calculated from UV data using the Lambert-Beer formula. However, in this report the variable that will be used in the subsequent analysis is “transbetacarotene”.

3.4.2 Variable Importance

For the best models from the previous analysis (when using the “transbetacarotene” metadata variable) the variable importance was calculated. Those models were ridge regression (w/ feature selection), elastic network and partial least squares (“pls” method).

# Ridge regression (w/ feature selection)
varImp1 = train_models_performance(carotAg, c('foba'), 'transbetacarotene', "repeatedcv",
                                      num.folds = 5, compute.varimp = T)

# Elastic Network
varImp2 = train_models_performance(carotAg, c('enet'), 'transbetacarotene', "repeatedcv",
                                      num.folds = 5, compute.varimp = T)

# Partial Least Squares
varImp3 = train_models_performance(carotAg, c('pls'), 'transbetacarotene', "repeatedcv",
                                      num.folds = 5, compute.varimp = T)

# Top 20 variables: Ridge regression | Elastic Network | Partial Least Squares
div = rep(' | ', dim(varImp1$vips[[1]])[1])
cbind(varImp1$vips[[1]], div, varImp2$vips[[1]], div, varImp3$vips[[1]])[1:20,]

##     Overall   Mean div Overall   Mean div Overall   Mean
## 472  100.00 100.00  |   100.00 100.00  |   100.00 100.00
## 471   99.94  99.94  |    99.94  99.94  |    95.04  95.04
## 473   99.91  99.91  |    99.91  99.91  |    90.35  90.35
## 469   99.12  99.12  |    99.12  99.12  |    83.18  83.18
## 474   99.12  99.12  |    99.12  99.12  |    78.01  78.01
## 470   98.68  98.68  |    98.68  98.68  |    70.01  70.01
## 468   97.77  97.77  |    97.77  97.77  |    64.82  64.82
## 475   97.64  97.64  |    97.64  97.64  |    57.09  57.09
## 467   96.74  96.74  |    96.74  96.74  |    52.61  52.61
## 479   96.65  96.65  |    96.65  96.65  |    46.24  46.24
## 466   96.03  96.03  |    96.03  96.03  |    42.95  42.95
## 476   95.35  95.35  |    95.35  95.35  |    39.93  39.93
## 480   94.55  94.55  |    94.55  94.55  |    39.60  39.60
## 465   94.05  94.05  |    94.05  94.05  |    39.44  39.44
## 477   93.81  93.81  |    93.81  93.81  |    38.59  38.59
## 464   91.31  91.31  |    91.31  91.31  |    38.44  38.44
## 481   91.24  91.24  |    91.24  91.24  |    38.17  38.17
## 478   90.00  90.00  |    90.00  90.00  |    37.42  37.42
## 463   89.46  89.46  |    89.46  89.46  |    37.15  37.15
## 462   87.80  87.80  |    87.80  87.80  |    36.76  36.76

The results for variable importance show that the predictors with most impact on results are the ones around the 470nm wavelength, with the variable with most importance being the one corresponding to the 472nm wavelength.

3.4.3 Preprocessed Data

The next step consisted in testing the best models from the analysis using the “transbetacarotene” metadata variable (ridge regression (w/ feature selection), elastic network and partial least squares (“pls” method)) on a preprocessed dataset, to see if the model performance improved.

# Ridge regression (w/ feature selection)
res4 = perform_ML_preproc(carotAg, 'foba', 'transbetacarotene')

res4[order(res4$RMSE),] #ordered by RMSE values

##                                      RMSE Rsquared RMSESD RsquaredSD
## First Derivative                    3.651   0.4247  1.035     0.3939
## Background + Offset cors            4.258   0.4539  4.212     0.3544
## Background cor                      4.369   0.3611  4.164     0.3330
## Scaling                             4.389   0.3468  4.193     0.3247
## No preprocessing                    4.393   0.3600  4.205     0.3266
## Smoothing                           4.464   0.3418  4.025     0.3048
## Background + Offset + Baseline cors 4.754   0.3985  4.250     0.3097
## Multiplicative Scatter Cor          6.296   0.3188  5.128     0.2791

Applying the ridge regression model to the preprocessed datasets showed improvement of model performance when using first derivative (RMSE 3.651), a combination of background, offset and baseline corrections (RMSE 4.258), background correction (RMSE 4.369) and scaling (RMSE 4.389) as preprocessing methods.

# Elastic network
res5 = perform_ML_preproc(carotAg, 'enet', 'transbetacarotene')

res5[order(res5$RMSE),] #ordered by RMSE values

##                                      RMSE Rsquared RMSESD RsquaredSD
## Scaling                             4.100   0.4513  4.156     0.3341
## No preprocessing                    4.109   0.4158  3.895     0.3545
## Background cor                      4.148   0.4497  4.201     0.3329
## Smoothing                           4.164   0.4310  4.124     0.3389
## Background + Offset cors            4.383   0.4407  3.904     0.3346
## Background + Offset + Baseline cors 4.574   0.3391  4.051     0.3081
## First Derivative                    6.705   0.3191  5.300     0.3265
## Multiplicative Scatter Cor          8.377   0.1888  6.880     0.2420

Applying the elastic network model to the preprocessed datasets showed improvement of model performance when scaling the dataset (RMSE 4.100).

# Partial Least Squares
res6 = perform_ML_preproc(carotAg, 'pls', 'transbetacarotene')

res6[order(res6$RMSE),] #ordered by RMSE values

##                                      RMSE Rsquared RMSESD RsquaredSD
## No preprocessing                    4.353   0.4189  4.093     0.3050
## Smoothing                           4.376   0.4165  4.190     0.2794
## Background cor                      4.402   0.3936  4.211     0.2926
## Background + Offset cors            4.407   0.3973  4.171     0.3016
## Scaling                             4.441   0.4079  3.973     0.2910
## Background + Offset + Baseline cors 4.558   0.3881  4.252     0.2776
## First Derivative                    5.026   0.3081  4.175     0.2547
## Multiplicative Scatter Cor          5.866   0.2485  4.658     0.2991

Applying the partial least squares model to the preprocessed dataset showed no improvement in model performance when using any of the preprocessing methods.

3.4.4 Filtered Data

The data was also filtered in order to determine if feature selection could improve model performance. A flat pattern filter with inter-quartile range as filter function was applied to the dataset, retaining 40%, 60% and 80% of the data each time.

#Filtering 80% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 80)

res7 = perform_ML(carotAg.filt, models, 'transbetacarotene')

# Results of 80% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res7-res3
res7_3 = cbind(round(res7,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res7_3[order(res7_3$RMSE),]

##                                              RMSE Rsquared RMSESD RsquaredSD  div       RMSE Rsquared
## Ridge Regression (w/ FS)                    3.216  0.50101  2.629    0.30427    |   -0.83539  0.09902
## Support Vector Machines (e1071)             4.156  0.47259  4.311    0.32108    |   -0.18735  0.04782
## Support Vector Machines (kernlab)           4.310  0.46897  4.385    0.32987    |   -0.07832 -0.03285
## Elastic Net                                 4.627  0.43365  3.942    0.30884    |    0.54239  0.01196
## Partial Least Squares (widekernelpls)       4.635  0.39989  4.164    0.30880    |    0.27252 -0.02529
## Ridge Regression                            4.635  0.39935  4.224    0.31891    |    0.38230  0.07138
## Partial Least Squares (simpls)              4.656  0.47520  3.997    0.26854    |    0.43893 -0.02232
## Partial Least Squares (pls)                 4.679  0.42499  4.203    0.28594    |    0.54156 -0.02938
## K-Nearest Neighbors                         4.685  0.26368  4.333    0.23468    |    0.14939  0.04026
## Partial Least Squares (kernelpls)           4.703  0.41374  4.229    0.27793    |    0.53325 -0.09732
## Conditional Inference Random Forest         4.805  0.41761  4.109    0.30500    |    0.08018  0.02198
## Conditional Inference Tree                  4.950  0.00729  3.819    0.00791    |    0.02176 -0.23519
## Lasso                                       5.001  0.25549  3.893    0.28308    |  -11.44858  0.08084
## Decision Trees                              5.120  0.06824  3.938    0.07566    |   -0.01369 -0.01931
## Linear Regression (w/ Backwards Selection)  5.258  0.21770  3.944    0.24775    |    0.34047 -0.06068
## Linear Regression (w/ Stepwise Selection)   5.375  0.29623  3.941    0.29042    |   -0.40775 -0.00915
## Linear Regression (w/ Forward Selection)    5.381  0.30270  4.051    0.29856    |    0.35851 -0.04480
## Random Forest                               5.658  0.15240  3.857    0.20071    |    0.01747 -0.07404
## Linear Regression                          11.379  0.22750  5.501    0.25548    | -259.75350 -0.03106

Filtering 80% of the data showed mixed results, with model performance increasing or decreasing depending on the used model, in comparison to the results using the original dataset. However, it massively increased the performance of the linear model (without selection), decreasing the RMSE by 259 units. Ridge regression (RMSE 3.216), SVMS (RMSE 4.156 and 4.310) and elastic network (RMSE 4.627) models had the best performance.

#Filtering 60% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 60)

res8 = perform_ML(carotAg.filt, models, 'transbetacarotene')

# Results of 60% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res8-res3
res8_3 = cbind(round(res8,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res8_3[order(res8_3$RMSE),]

##                                               RMSE Rsquared   RMSESD RsquaredSD  div      RMSE Rsquared
## Ridge Regression (w/ FS)                     3.857  0.40095    3.927    0.30102    |  -0.19381 -0.00103
## Support Vector Machines (kernlab)            4.191  0.47469    4.254    0.32990    |  -0.19764 -0.02712
## Support Vector Machines (e1071)              4.303  0.45114    4.458    0.30532    |  -0.04048  0.02636
## K-Nearest Neighbors                          4.550  0.22615    4.276    0.21575    |   0.01407  0.00273
## Elastic Net                                  4.569  0.45515    4.015    0.31017    |   0.48495  0.03346
## Ridge Regression                             4.579  0.39625    4.131    0.29045    |   0.32670  0.06829
## Partial Least Squares (widekernelpls)        4.630  0.43855    4.302    0.30985    |   0.26802  0.01337
## Partial Least Squares (kernelpls)            4.632  0.44113    4.178    0.31995    |   0.46241 -0.06992
## Partial Least Squares (pls)                  4.643  0.38746    4.143    0.26302    |   0.50542 -0.06691
## Conditional Inference Random Forest          4.663  0.38400    4.019    0.26385    |  -0.06145 -0.01163
## Partial Least Squares (simpls)               4.711  0.45436    4.168    0.30475    |   0.49394 -0.04316
## Decision Trees                               5.018  0.17522    3.767    0.18968    |  -0.11492  0.08767
## Conditional Inference Tree                   5.044  0.01171    3.868    0.00634    |   0.11572 -0.23077
## Linear Regression (w/ Stepwise Selection)    5.116  0.23261    4.053    0.26651    |  -0.66645 -0.07276
## Linear Regression (w/ Backwards Selection)   5.353  0.31545    4.897    0.28876    |   0.43548  0.03706
## Linear Regression (w/ Forward Selection)     5.640  0.39462    4.559    0.34223    |   0.61763  0.04712
## Random Forest                                5.951  0.19547    3.799    0.23954    |   0.30999 -0.03097
## Lasso                                       12.325  0.28306   19.120    0.34952    |  -4.12458  0.10841
## Linear Regression                          498.929  0.37062 1422.767    0.33383    | 227.79654  0.11206

Filtering 60% of the data showed mixed results, with model performance increasing or decreasing depending on the used model, in comparison to the results using the original dataset. Here, ridge regression (with FS) had best performance, with RMSE of 3.857. SVMs models also showed good performance with RMSE of 4.191 and 4.303 for kernlab and e1071 packages, respectively.

#Filtering 40% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 40)

res9 = perform_ML(carotAg.filt, models, 'transbetacarotene')

# Results of 40% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res9-res3
res9_3 = cbind(round(res9,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res9_3[order(res9_3$RMSE),]

##                                               RMSE Rsquared  RMSESD RsquaredSD  div      RMSE Rsquared
## Support Vector Machines (kernlab)            4.118  0.49265   4.381    0.33108    |  -0.27066 -0.00916
## Support Vector Machines (e1071)              4.252  0.49738   4.466    0.32575    |  -0.09194  0.07261
## Ridge Regression (w/ FS)                     4.268  0.39253   4.138    0.29393    |   0.21669 -0.00945
## Elastic Net                                  4.488  0.43447   3.832    0.32264    |   0.40366  0.01278
## Ridge Regression                             4.532  0.31906   4.139    0.29795    |   0.27883 -0.00890
## Partial Least Squares (widekernelpls)        4.567  0.44318   4.115    0.31425    |   0.20509  0.01801
## K-Nearest Neighbors                          4.573  0.22989   4.163    0.23887    |   0.03643  0.00647
## Partial Least Squares (kernelpls)            4.587  0.43197   4.240    0.28393    |   0.41756 -0.07909
## Partial Least Squares (pls)                  4.603  0.43710   4.121    0.30858    |   0.46592 -0.01727
## Partial Least Squares (simpls)               4.609  0.46393   4.116    0.33188    |   0.39225 -0.03358
## Conditional Inference Random Forest          4.791  0.41726   4.082    0.26592    |   0.06683  0.02163
## Conditional Inference Tree                   4.820  0.02649   3.947    0.03402    |  -0.10838 -0.21599
## Decision Trees                               5.106  0.16324   3.745    0.18865    |  -0.02708  0.07569
## Linear Regression (w/ Forward Selection)     5.138  0.38590   4.598    0.30334    |   0.11586  0.03840
## Linear Regression (w/ Backwards Selection)   5.177  0.30938   4.294    0.32440    |   0.25971  0.03099
## Random Forest                                5.737  0.21476   4.189    0.25318    |   0.09633 -0.01168
## Linear Regression (w/ Stepwise Selection)    6.096  0.33878   4.658    0.33445    |   0.31336  0.03341
## Lasso                                       10.649  0.30848   7.090    0.30926    |  -5.80082  0.13383
## Linear Regression                          203.134  0.36460 307.890    0.31247    | -67.99830  0.10605

Filtering 40% of the data, showed similar results to the previous case, with model performance increasing or decreasing depending on the used model, in comparison to the results using the original dataset. Here, best RMSE values were achieved by SVMs (from packages e1071 and kernlab) with RMSE of 4.252 and 4.118, respectively, and ridge regression (w/ FS) with RMSE of 4.268.

4 CIELAB Data

A machine learning analysis using the CIELAB data was also performed.

4.1 Create dataset

The CIELAB data is stored in the metadata file. Therefore, it needs to be extracted first to create the cielab dataset.

color.values = t(get_metadata(carotAg)[2:4]) #L a b
filtered.meta = get_metadata(carotAg)[5:12]

carotCielab = create_dataset(datamatrix = color.values, metadata = filtered.meta, label.x = "cielab",
                             label.values = "color values", description = "Dataset from cielab values")
head(carotCielab$data)[,1:12] #Cielab values for first 12 samples

##           101.1  102.1 103.1 105.1  11.1  119.1  123.1  125.1   21.1   23.1   27.1    3.1
## Cielab_L 77.670 85.017 81.25 69.25 83.59 69.510 82.893 68.563 74.113 70.240 83.983 85.717
## Cielab_A -3.397 -3.663 -4.46 -4.95 -3.44 -5.457 -2.123 -4.733 -4.277 -1.437 -2.140 -2.607
## Cielab_B 16.493 18.477 18.49 31.96 16.81 37.693  8.213 36.790 20.107 16.160  8.683 22.017

sum_dataset(carotCielab) # Dataset summary

## Dataset summary:
## Valid dataset
## Description:  Dataset from cielab values 
## Type of data:  undefined 
## Number of samples:  50 
## Number of data points 3 
## Number of metadata variables:  8 
## Label of x-axis values:  cielab 
## Label of data points:  color values 
## Number of missing values in data:  0 
## Mean of data values:  31.99 
## Median of data values:  18.69 
## Standard deviation:  35.84 
## Range of values:  -5.457 88.28 
## Quantiles: 
##     0%    25%    50%    75%   100% 
## -5.457 -3.070 18.685 75.292 88.283

4.2 Machine Learning

The same machine learning models used in the UV dataset were used for the CIELAB dataset, with the exception of linear regression models with selection, as it does not make sense to use these considering there are only 3 features in the dataset (L, a and b values). The metadata variable used for prediction was “TCCHPLC” .

4.2.1 Unprocessed data

models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls',
           'widekernelpls', 'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm')

#Using transbetacarotene variable
res10 = perform_ML(carotCielab, models, pred_var = 'transbetacarotene')

# Results w/ CIELAB data and difference to unprocessed UV data results (Two last columns)
diff = res10-res3[-c(17,18,19),]
res10_3 = cbind(round(res10,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res10_3[order(res10_3$RMSE),]

##                                        RMSE Rsquared RMSESD RsquaredSD  div       RMSE Rsquared
## Partial Least Squares (widekernelpls) 4.551   0.2794  3.847    0.26530    |    0.18909 -0.14573
## Partial Least Squares (pls)           4.667   0.2446  3.881    0.25617    |    0.52949 -0.20981
## Conditional Inference Random Forest   4.667   0.2066  3.872    0.18505    |   -0.05693 -0.18902
## Partial Least Squares (simpls)        4.731   0.2371  4.058    0.24897    |    0.51431 -0.26037
## Partial Least Squares (kernelpls)     4.785   0.2278  3.883    0.23562    |    0.61592 -0.28330
## Elastic Net                           4.787   0.1840  3.882    0.21587    |    0.70242 -0.23774
## Ridge Regression (w/ FS)              4.802   0.2020  3.739    0.21105    |    0.75108 -0.20000
## Lasso                                 4.826   0.1539  4.069    0.18745    |  -11.62383 -0.02078
## Support Vector Machines (e1071)       4.829   0.1506  3.974    0.20949    |    0.48553 -0.27414
## Support Vector Machines (kernlab)     4.878   0.2043  4.162    0.24571    |    0.48898 -0.29747
## Ridge Regression                      4.886   0.1774  3.817    0.20406    |    0.63283 -0.15055
## Conditional Inference Tree            4.934   0.1105  3.699    0.09806    |    0.00557 -0.13193
## Linear Regression                     4.937   0.2424  3.570    0.18929    | -266.19504 -0.01620
## K-Nearest Neighbors                   4.997   0.2036  3.861    0.20608    |    0.46096 -0.01987
## Decision Trees                        5.015   0.2880  3.661    0.23662    |   -0.11834  0.20042
## Random Forest                         5.148   0.1532  3.714    0.17119    |   -0.49215 -0.07326

From the results above it is clear that there is an overall decrease in model performance when using CIELAB data in comparison to when using UV data, with increased RMSE values. However, the linear model performed much better than in the case of UV data with a RMSE of 4.937. Lasso regression also performed better comparing to when using UV data, with a RMSE of 4.826. The best model performance was achieved by partial least squares (“widekernelpls” and “pls” methods) with RMSE of 4.551 and 4.667, respectively, and conditional inference random rorest with RMSE of 4.667.

4.2.2 Variable Importance

The variable importance was calculated for the models that achieved better performance using CIELAB data. These models were Partial Least Squares (“widekernelpls” and “pls” methods) and conditional inference random rorest.

# Partial Least Squares ("widekernelpls")
varImp4 = train_models_performance(carotCielab, c('widekernelpls'), 'transbetacarotene', "repeatedcv",
                                      num.folds = 5, compute.varimp = T)

# Partial Least Squares ("pls")
varImp5 = train_models_performance(carotCielab, c('pls'), 'transbetacarotene', "repeatedcv",
                                      num.folds = 5, compute.varimp = T)

# Conditional inference random rorest
varImp6 = train_models_performance(carotCielab, c('cforest'), 'transbetacarotene', "repeatedcv",
                                      num.folds = 5, compute.varimp = T)

# Variable Importance: PLS ("widekernelpls") | PLS ("pls") | Conditional inference random rorest
div = rep(' | ', dim(varImp4$vips[[1]])[1])
cbind(varImp4$vips[[1]], div, varImp5$vips[[1]], div, varImp6$vips[[1]])

##          Overall   Mean div Overall   Mean div Overall    Mean
## Cielab_B  100.00 100.00  |   100.00 100.00  |  100.000 100.000
## Cielab_L   58.19  58.19  |    58.19  58.19  |    9.722   9.722
## Cielab_A    0.00   0.00  |     0.00   0.00  |    0.000   0.000

The results for variable importance show that the predictor with most impact on results is the CIELAB b value.

4.2.3 Scalled data

The dataset was then scalled to test whether CIELAB data scaling could improve results.

carotCielab.sc = specmine::scaling(carotCielab)
sum_dataset(carotCielab.sc)

## Dataset summary:
## Valid dataset
## Description:  Dataset from cielab values; Scaling with method auto 
## Type of data:  undefined 
## Number of samples:  50 
## Number of data points 3 
## Number of metadata variables:  8 
## Label of x-axis values:  cielab 
## Label of data points:  color values 
## Number of missing values in data:  0 
## Mean of data values:  1.49e-16 
## Median of data values:  0.06326 
## Standard deviation:  0.9933 
## Range of values:  -2.187 3.695 
## Quantiles: 
##       0%      25%      50%      75%     100% 
## -2.18663 -0.49244  0.06326  0.52084  3.69515

res11 = perform_ML(carotCielab.sc, models, pred_var = 'transbetacarotene')

# Results w/ scalled CIELAB data and difference to unprocessed CIELAB data results (Two last columns)
diff = res11-res10
res11_10 = cbind(round(res11,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res11_10[order(res11_10$RMSE),]

##                                        RMSE Rsquared RMSESD RsquaredSD  div     RMSE Rsquared
## Elastic Net                           4.690   0.2121  3.741     0.2559    | -0.09717  0.02811
## Support Vector Machines (kernlab)     4.745   0.2016  4.027     0.2366    | -0.13278 -0.00273
## Partial Least Squares (simpls)        4.781   0.1943  3.921     0.2214    |  0.05033 -0.04280
## Conditional Inference Random Forest   4.782   0.2230  3.818     0.2185    |  0.11426  0.01639
## Lasso                                 4.793   0.1824  3.777     0.2107    | -0.03334  0.02849
## Support Vector Machines (e1071)       4.800   0.1554  3.985     0.2039    | -0.02938  0.00472
## Partial Least Squares (kernelpls)     4.815   0.1624  3.987     0.2180    |  0.02924 -0.06539
## Ridge Regression                      4.848   0.2382  3.897     0.2210    | -0.03791  0.06075
## Partial Least Squares (widekernelpls) 4.857   0.1706  3.832     0.2382    |  0.30591 -0.10882
## Partial Least Squares (pls)           4.859   0.1645  4.008     0.2278    |  0.19266 -0.08001
## Conditional Inference Tree            4.929   0.1303  3.733     0.1356    | -0.00476  0.01980
## Linear Regression                     4.945   0.2200  3.741     0.2224    |  0.00796 -0.02233
## Ridge Regression (w/ FS)              4.951   0.2381  3.705     0.2267    |  0.14908  0.03607
## K-Nearest Neighbors                   4.956   0.1536  3.879     0.1839    | -0.04106 -0.04991
## Decision Trees                        5.000   0.2977  3.820     0.2277    | -0.01538  0.00975
## Random Forest                         5.393   0.1497  4.042     0.2072    |  0.24458 -0.00352

Applying the machine learning models to scalled CIELAB data showed mixed results, with increased and decreased model performance depending on the model used. These changes were, however, small.

5 UV and CIELAB Data Fusion

A machine learning analysis using fused UV and CIELAB data was also performed.

5.1 Create dataset

Two datasets were created, one using 40% of filtered UV data and another using the entire data

# Not filtered
carot.fus = low_level_fusion(list(carotAg, carotCielab))
sum_dataset(carot.fus)

## Dataset summary:
## Valid dataset
## Description:  Data integration from types: uvv-spectra,undefined 
## Type of data:  integrated-data 
## Number of samples:  50 
## Number of data points 104 
## Number of metadata variables:  12 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  0 
## Mean of data values:  1.148 
## Median of data values:  0.1881 
## Standard deviation:  8.069 
## Range of values:  -5.457 88.28 
## Quantiles: 
##      0%     25%     50%     75%    100% 
## -5.4567  0.1335  0.1881  0.2673 88.2833

# 40% data filtered
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 40)
carot.fus.filt = low_level_fusion(list(carotAg.filt, carotCielab))
sum_dataset(carot.fus.filt)

## Dataset summary:
## Valid dataset
## Description:  Data integration from types: uvv-spectra,undefined 
## Type of data:  integrated-data 
## Number of samples:  50 
## Number of data points 63 
## Number of metadata variables:  12 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  0 
## Mean of data values:  1.782 
## Median of data values:  0.217 
## Standard deviation:  10.32 
## Range of values:  -5.457 88.28 
## Quantiles: 
##      0%     25%     50%     75%    100% 
## -5.4567  0.1700  0.2170  0.3074 88.2833

5.2 Machine Learning

The same machine learning models applied in the UV dataset were used for the UV and CIELAB fusion datasets. The metadata variable used for prediction was “transbetacarotene” .

5.2.1 Unprocessed data

models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 'widekernelpls',
           'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm', 'leapBackward', 'leapForward', 'leapSeq')

# Using unfiltered dataset
res12 = perform_ML(carot.fus, models, pred_var = 'transbetacarotene')

# Results w/ unfiltered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res12-res3
res12_3 = cbind(round(res12,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res12_3[order(res12_3$RMSE),]

##                                               RMSE Rsquared  RMSESD RsquaredSD  div     RMSE Rsquared
## Support Vector Machines (e1071)              4.353   0.3551   4.250     0.3107    |  0.00985 -0.06969
## Support Vector Machines (kernlab)            4.436   0.4241   4.261     0.3113    |  0.04756 -0.07773
## Elastic Net                                  4.450   0.3256   4.086     0.2937    |  0.36575 -0.09606
## Partial Least Squares (widekernelpls)        4.592   0.2804   4.034     0.2734    |  0.23010 -0.14481
## Conditional Inference Random Forest          4.645   0.3540   4.073     0.2351    | -0.07892 -0.04160
## Partial Least Squares (pls)                  4.652   0.2530   4.154     0.2529    |  0.51453 -0.20133
## Partial Least Squares (kernelpls)            4.681   0.2745   3.964     0.2656    |  0.51153 -0.23659
## Partial Least Squares (simpls)               4.685   0.2545   4.028     0.2627    |  0.46842 -0.24302
## Ridge Regression (w/ FS)                     4.730   0.3430   4.114     0.3034    |  0.67862 -0.05898
## Linear Regression (w/ Forward Selection)     4.860   0.2492   4.095     0.2565    | -0.16216 -0.09835
## Conditional Inference Tree                   4.870   0.2706   4.030     0.2714    | -0.05825  0.02809
## Linear Regression (w/ Stepwise Selection)    4.909   0.2406   3.983     0.2518    | -0.87333 -0.06477
## K-Nearest Neighbors                          4.996   0.1359   3.897     0.1837    |  0.45952 -0.08747
## Linear Regression (w/ Backwards Selection)   5.179   0.2966   4.147     0.2610    |  0.26124  0.01821
## Decision Trees                               5.221   0.2181   3.744     0.2196    |  0.08775  0.13058
## Ridge Regression                             5.627   0.2783   3.886     0.3185    |  1.37378 -0.04969
## Random Forest                                6.101   0.1621   4.067     0.2230    |  0.46009 -0.06436
## Lasso                                       13.821   0.1487   7.389     0.1694    | -2.62904 -0.02593
## Linear Regression                          264.936   0.2363 457.315     0.3017    | -6.19636 -0.02223

The machine learning analysis with unprocessed fusion data showed a decrease in model performance, with overall increase in RMSE values when comparing to the unprocessed UV data results. The best performance was achieved by SVMs model (e1071 and kernlab packages) with a RMSE of 4.353 and 4.436, respectively, and elastic network with RMSE of 4.450.

5.2.2 Variable Importance

The variable importance was calculated for the models that achieved better performance using unprocessed fusion data. These models were ridge regression, partial least squares and elastic network.

# Support Vector Machines (e1071 package)
varImp7 = train_models_performance(carot.fus, c('svmLinear2'), 'transbetacarotene', "repeatedcv",
                                      num.folds = 5, compute.varimp = T)

# Support Vector Machines (kernlab package)
varImp8 = train_models_performance(carot.fus, c('svmLinear'), 'transbetacarotene', "repeatedcv",
                                      num.folds = 5, compute.varimp = T)

# Elastic Network
varImp9 = train_models_performance(carot.fus, c('enet'), 'transbetacarotene', "repeatedcv",
                                      num.folds = 5, compute.varimp = T)

# Variable Importance: SVMs (e1071 package) | SVMs (kernlab package) | Elastic Network
div = rep(' | ', dim(varImp7$vips[[1]])[1])
cbind(varImp7$vips[[1]], div, varImp8$vips[[1]], div, varImp9$vips[[1]])

##          Overall   Mean div Overall   Mean div Overall   Mean
## 472       100.00 100.00  |   100.00 100.00  |   100.00 100.00
## 471        99.96  99.96  |    99.96  99.96  |    99.96  99.96
## 473        99.94  99.94  |    99.94  99.94  |    99.94  99.94
## 469        99.39  99.39  |    99.39  99.39  |    99.39  99.39
## 474        99.39  99.39  |    99.39  99.39  |    99.39  99.39
## 470        99.08  99.08  |    99.08  99.08  |    99.08  99.08
## 468        98.45  98.45  |    98.45  98.45  |    98.45  98.45
## 475        98.36  98.36  |    98.36  98.36  |    98.36  98.36
## 467        97.74  97.74  |    97.74  97.74  |    97.74  97.74
## 479        97.68  97.68  |    97.68  97.68  |    97.68  97.68
## 466        97.25  97.25  |    97.25  97.25  |    97.25  97.25
## 476        96.77  96.77  |    96.77  96.77  |    96.77  96.77
## 480        96.22  96.22  |    96.22  96.22  |    96.22  96.22
## 465        95.87  95.87  |    95.87  95.87  |    95.87  95.87
## 477        95.70  95.70  |    95.70  95.70  |    95.70  95.70
## 464        93.97  93.97  |    93.97  93.97  |    93.97  93.97
## 481        93.92  93.92  |    93.92  93.92  |    93.92  93.92
## 478        93.06  93.06  |    93.06  93.06  |    93.06  93.06
## 463        92.68  92.68  |    92.68  92.68  |    92.68  92.68
## 462        91.53  91.53  |    91.53  91.53  |    91.53  91.53
## 459        91.30  91.30  |    91.30  91.30  |    91.30  91.30
## 460        90.88  90.88  |    90.88  90.88  |    90.88  90.88
## 461        90.09  90.09  |    90.09  90.09  |    90.09  90.09
## 458        89.76  89.76  |    89.76  89.76  |    89.76  89.76
## 457        88.25  88.25  |    88.25  88.25  |    88.25  88.25
## 482        87.84  87.84  |    87.84  87.84  |    87.84  87.84
## 494        86.27  86.27  |    86.27  86.27  |    86.27  86.27
## 456        85.51  85.51  |    85.51  85.51  |    85.51  85.51
## 495        85.49  85.49  |    85.49  85.49  |    85.49  85.49
## 455        84.02  84.02  |    84.02  84.02  |    84.02  84.02
## 483        83.89  83.89  |    83.89  83.89  |    83.89  83.89
## 486        83.82  83.82  |    83.82  83.82  |    83.82  83.82
## 444        82.87  82.87  |    82.87  82.87  |    82.87  82.87
## 487        82.71  82.71  |    82.71  82.71  |    82.71  82.71
## 445        82.41  82.41  |    82.41  82.41  |    82.41  82.41
## 489        82.36  82.36  |    82.36  82.36  |    82.36  82.36
## 443        82.16  82.16  |    82.16  82.16  |    82.16  82.16
## 446        82.07  82.07  |    82.07  82.07  |    82.07  82.07
## 488        82.06  82.06  |    82.06  82.06  |    82.06  82.06
## 454        81.36  81.36  |    81.36  81.36  |    81.36  81.36
## 442        81.27  81.27  |    81.27  81.27  |    81.27  81.27
## 447        81.16  81.16  |    81.16  81.16  |    81.16  81.16
## 484        80.81  80.81  |    80.81  80.81  |    80.81  80.81
## 448        80.72  80.72  |    80.72  80.72  |    80.72  80.72
## 440        80.69  80.69  |    80.69  80.69  |    80.69  80.69
## 441        80.44  80.44  |    80.44  80.44  |    80.44  80.44
## 485        80.43  80.43  |    80.43  80.43  |    80.43  80.43
## 453        80.26  80.26  |    80.26  80.26  |    80.26  80.26
## 452        79.83  79.83  |    79.83  79.83  |    79.83  79.83
## 439        79.79  79.79  |    79.79  79.79  |    79.79  79.79
## 451        79.74  79.74  |    79.74  79.74  |    79.74  79.74
## 449        79.73  79.73  |    79.73  79.73  |    79.73  79.73
## 493        79.49  79.49  |    79.49  79.49  |    79.49  79.49
## 450        79.04  79.04  |    79.04  79.04  |    79.04  79.04
## 490        77.55  77.55  |    77.55  77.55  |    77.55  77.55
## 496        76.75  76.75  |    76.75  76.75  |    76.75  76.75
## 438        76.60  76.60  |    76.60  76.60  |    76.60  76.60
## 492        76.50  76.50  |    76.50  76.50  |    76.50  76.50
## 491        76.39  76.39  |    76.39  76.39  |    76.39  76.39
## 437        75.56  75.56  |    75.56  75.56  |    75.56  75.56
## 436        73.16  73.16  |    73.16  73.16  |    73.16  73.16
## 435        71.03  71.03  |    71.03  71.03  |    71.03  71.03
## 497        70.22  70.22  |    70.22  70.22  |    70.22  70.22
## 425        69.75  69.75  |    69.75  69.75  |    69.75  69.75
## 424        69.53  69.53  |    69.53  69.53  |    69.53  69.53
## 426        69.04  69.04  |    69.04  69.04  |    69.04  69.04
## 434        68.98  68.98  |    68.98  68.98  |    68.98  68.98
## 433        68.81  68.81  |    68.81  68.81  |    68.81  68.81
## 427        68.63  68.63  |    68.63  68.63  |    68.63  68.63
## 418        68.55  68.55  |    68.55  68.55  |    68.55  68.55
## 423        68.53  68.53  |    68.53  68.53  |    68.53  68.53
## 419        68.00  68.00  |    68.00  68.00  |    68.00  68.00
## 432        67.65  67.65  |    67.65  67.65  |    67.65  67.65
## 422        67.64  67.64  |    67.64  67.64  |    67.64  67.64
## 428        67.50  67.50  |    67.50  67.50  |    67.50  67.50
## 429        67.17  67.17  |    67.17  67.17  |    67.17  67.17
## 431        67.15  67.15  |    67.15  67.15  |    67.15  67.15
## 430        67.13  67.13  |    67.13  67.13  |    67.13  67.13
## 417        66.71  66.71  |    66.71  66.71  |    66.71  66.71
## 421        66.44  66.44  |    66.44  66.44  |    66.44  66.44
## 416        66.40  66.40  |    66.40  66.40  |    66.40  66.40
## 415        65.73  65.73  |    65.73  65.73  |    65.73  65.73
## 414        64.25  64.25  |    64.25  64.25  |    64.25  64.25
## 420        63.74  63.74  |    63.74  63.74  |    63.74  63.74
## 413        62.04  62.04  |    62.04  62.04  |    62.04  62.04
## Cielab_B   61.22  61.22  |    61.22  61.22  |    61.22  61.22
## 498        60.58  60.58  |    60.58  60.58  |    60.58  60.58
## 412        59.64  59.64  |    59.64  59.64  |    59.64  59.64
## 499        57.96  57.96  |    57.96  57.96  |    57.96  57.96
## 411        57.41  57.41  |    57.41  57.41  |    57.41  57.41
## 500        54.35  54.35  |    54.35  54.35  |    54.35  54.35
## Cielab_A   54.33  54.33  |    54.33  54.33  |    54.33  54.33
## 410        53.95  53.95  |    53.95  53.95  |    53.95  53.95
## 409        52.99  52.99  |    52.99  52.99  |    52.99  52.99
## 408        51.17  51.17  |    51.17  51.17  |    51.17  51.17
## 407        48.08  48.08  |    48.08  48.08  |    48.08  48.08
## 406        43.24  43.24  |    43.24  43.24  |    43.24  43.24
## 405        41.29  41.29  |    41.29  41.29  |    41.29  41.29
## 404        38.71  38.71  |    38.71  38.71  |    38.71  38.71
## 403        37.47  37.47  |    37.47  37.47  |    37.47  37.47
## 402        34.33  34.33  |    34.33  34.33  |    34.33  34.33
## 401        33.05  33.05  |    33.05  33.05  |    33.05  33.05
## 400        30.58  30.58  |    30.58  30.58  |    30.58  30.58
## Cielab_L    0.00   0.00  |     0.00   0.00  |     0.00   0.00

5.2.3 Filtered data

# Using dataset w/ 40% data filtered
res13 = perform_ML(carot.fus.filt, models, pred_var = 'transbetacarotene')

# Results w/ 40% filtered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res13-res3
res13_3 = cbind(round(res13,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res13_3[order(res13_3$RMSE),]

##                                               RMSE Rsquared   RMSESD RsquaredSD  div      RMSE Rsquared
## Support Vector Machines (e1071)              4.341  0.43004    4.260    0.31603    |  -0.00278  0.00527
## Support Vector Machines (kernlab)            4.458  0.44320    4.251    0.31300    |   0.06938 -0.05861
## Partial Least Squares (simpls)               4.577  0.26942    3.864    0.25090    |   0.36021 -0.22809
## Partial Least Squares (pls)                  4.604  0.25776    3.975    0.25330    |   0.46726 -0.19661
## Partial Least Squares (widekernelpls)        4.642  0.29400    3.987    0.28253    |   0.27956 -0.13118
## Partial Least Squares (kernelpls)            4.664  0.25297    3.996    0.23998    |   0.49425 -0.25808
## Elastic Net                                  4.698  0.22024    3.925    0.24457    |   0.61387 -0.20145
## Conditional Inference Random Forest          4.709  0.31113    4.092    0.22349    |  -0.01546 -0.08450
## Ridge Regression (w/ FS)                     4.845  0.21226    3.876    0.24484    |   0.79396 -0.18973
## K-Nearest Neighbors                          4.881  0.20095    3.765    0.20428    |   0.34527 -0.02248
## Linear Regression (w/ Forward Selection)     4.924  0.28887    3.990    0.28127    |  -0.09826 -0.05863
## Decision Trees                               4.951  0.29144    3.724    0.26478    |  -0.18239  0.20389
## Conditional Inference Tree                   4.970  0.02511    3.916    0.02603    |   0.04152 -0.21737
## Ridge Regression                             5.010  0.16075    4.024    0.16434    |   0.75689 -0.16721
## Linear Regression (w/ Backwards Selection)   5.095  0.33564    3.928    0.32173    |   0.17730  0.05725
## Random Forest                                5.912  0.17026    3.727    0.21035    |   0.27105 -0.05618
## Linear Regression (w/ Stepwise Selection)    5.945  0.30782    4.398    0.31041    |   0.16268  0.00244
## Lasso                                       11.800  0.25968   10.428    0.28364    |  -4.65051  0.08503
## Linear Regression                          586.915  0.27394 2747.641    0.29894    | 315.78277  0.01539

The machine learning analysis with filtered fusion data showed an overall decrease in model performance when comparing to the results obtained with unprocessed UV data (higher RMSE values). The best performance was achieved by support vector machines (e1071 and kernlab packages) with a RMSE of 4.341 and 4.458, respectively.

5.2.4 Scalled data

Both filtered and unfiltered datasets were scalled and the machine learning models applied to these scalled datasets.

# Using unfiltered dataset
carot.fus.sc = specmine::scaling(carot.fus)
res14 = perform_ML(carot.fus.sc, models, pred_var = 'transbetacarotene')

# Results w/ unfiltered scalled fusion data and difference to unprocessed UV data results (Two last columns)
diff = res14-res3
res14_3 = cbind(round(res14,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res14_3[order(res14_3$RMSE),]

##                                               RMSE Rsquared   RMSESD RsquaredSD  div      RMSE Rsquared
## Support Vector Machines (e1071)              4.409   0.4003    4.438     0.3390    |   0.06584 -0.02448
## Support Vector Machines (kernlab)            4.418   0.3873    4.236     0.3288    |   0.02988 -0.11448
## Elastic Net                                  4.460   0.3272    4.102     0.3089    |   0.37561 -0.09450
## Partial Least Squares (widekernelpls)        4.514   0.4432    4.214     0.3161    |   0.15209  0.01806
## Ridge Regression (w/ FS)                     4.519   0.3453    3.991     0.2899    |   0.46776 -0.05667
## Partial Least Squares (pls)                  4.522   0.3912    4.157     0.3054    |   0.38480 -0.06320
## Partial Least Squares (simpls)               4.529   0.4455    4.083     0.3245    |   0.31228 -0.05206
## Partial Least Squares (kernelpls)            4.561   0.4649    4.232     0.3154    |   0.39112 -0.04610
## K-Nearest Neighbors                          4.709   0.2221    4.098     0.2119    |   0.17331 -0.00136
## Conditional Inference Random Forest          4.752   0.3363    4.240     0.2308    |   0.02744 -0.05929
## Linear Regression (w/ Forward Selection)     4.894   0.2738    4.110     0.2860    |  -0.12881 -0.07374
## Conditional Inference Tree                   4.915   0.2458    4.056     0.2403    |  -0.01313  0.00334
## Decision Trees                               5.111   0.2385    3.766     0.2313    |  -0.02184  0.15100
## Ridge Regression                             5.315   0.2682    4.093     0.2836    |   1.06263 -0.05971
## Linear Regression (w/ Backwards Selection)   5.348   0.2695    4.003     0.2436    |   0.43029 -0.00890
## Linear Regression (w/ Stepwise Selection)    5.540   0.2226    4.222     0.2761    |  -0.24252 -0.08276
## Random Forest                                5.879   0.1874    3.932     0.2465    |   0.23885 -0.03904
## Lasso                                       16.820   0.2086   15.214     0.2308    |   0.37003  0.03396
## Linear Regression                          442.244   0.2519 1153.205     0.2742    | 171.11181 -0.00662

The machine learning analysis with scalled fusion data showed a decrease in model performance, with increased RMSE values when comparing to the unprocessed UV data results. The best performance was achieved by SVMs model (e1071 and kernlab packages) with a RMSE of 4.409 and 4.418, respectively.

# Using dataset w/ 40% data filtered
carot.fus.filt.sc = specmine::scaling(carot.fus.filt)
res15 = perform_ML(carot.fus.filt.sc, models, pred_var = 'transbetacarotene')

# Results w/ 40% filtered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res15-res3
res15_3 = cbind(round(res15,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res15_3[order(res15_3$RMSE),]

##                                               RMSE Rsquared   RMSESD RsquaredSD  div      RMSE Rsquared
## Support Vector Machines (e1071)              4.187   0.3731    4.231    0.32637    |  -0.15695 -0.05166
## Support Vector Machines (kernlab)            4.341   0.3911    4.254    0.31080    |  -0.04792 -0.11073
## Partial Least Squares (widekernelpls)        4.595   0.4442    4.115    0.29971    |   0.23263  0.01900
## K-Nearest Neighbors                          4.603   0.2077    4.050    0.20361    |   0.06682 -0.01569
## Partial Least Squares (kernelpls)            4.626   0.4058    4.296    0.29314    |   0.45646 -0.10523
## Partial Least Squares (simpls)               4.727   0.4132    4.173    0.31029    |   0.51009 -0.08435
## Elastic Net                                  4.728   0.2311    3.972    0.23533    |   0.64333 -0.19063
## Conditional Inference Random Forest          4.737   0.3699    4.067    0.24418    |   0.01266 -0.02578
## Partial Least Squares (pls)                  4.744   0.4098    4.311    0.31637    |   0.60660 -0.04456
## Ridge Regression                             5.007   0.1578    3.904    0.16725    |   0.75378 -0.17013
## Conditional Inference Tree                   5.013   0.0089    3.939    0.01087    |   0.08435 -0.23358
## Ridge Regression (w/ FS)                     5.086   0.1747    4.134    0.19989    |   1.03500 -0.22732
## Decision Trees                               5.179   0.2505    3.834    0.23159    |   0.04596  0.16292
## Linear Regression (w/ Forward Selection)     5.187   0.3103    3.995    0.29537    |   0.16485 -0.03722
## Linear Regression (w/ Stepwise Selection)    5.469   0.3291    4.114    0.29630    |  -0.31317  0.02377
## Random Forest                                5.638   0.2566    4.006    0.24846    |  -0.00253  0.03017
## Linear Regression (w/ Backwards Selection)   5.963   0.2701    4.914    0.27612    |   1.04565 -0.00825
## Lasso                                       11.350   0.2755    7.541    0.27930    |  -5.09964  0.10083
## Linear Regression                          585.131   0.2704 2300.894    0.29489    | 313.99835  0.01184

Using filtered and scalled fusion data resulted in an overall decrease in model performance when comparing to the unprocessed UV data results (higher RMSE values). The best performance was achieved by SVMs model (e1071 and kernlab packages) with a RMSE of 4.187 and 4.341, respectively.

6 Results Summary

UV Data:

Metadata’s variable that shows the best results when used as output variable is “CarotenoidsContent_TCCS”. However, “transbetacarotene” variable was used in subsequent analysis;
The models that achieved the lowest RMSE values for the given data included ridge regression (w/ feature selection) with RMSE of 4.051, elastic net with a RMSE of 4.084 and partial least squares (“pls” method) with RMSE of 4.137;
Testing these on preprocessed data showed improvement of model performance when using a combination of background, offset and baseline corrections and first derivative in the case of ridge regression and scaling in the case of elastic network and ridge regression. Partial least squares model performance did not improve with any of the preprocessing methods;
Testing all models with 80%/60%/40% data filtering showed mixed results, with model performance increasing or decresing depending on the model used in every case;
The results for variable importance show that the predictors with most impact on results are the ones around the 470nm wavelength, with the variable with most importance being the one corresponding to the 472nm wavelength.

CIELAB Data:

Overall decrease in model performance using CIELAB data in comparison to when using UV data, with increased RMSE values;
The best model performance was achieved by partial least squares (“widekernelpls” and “pls” methods) with RMSE of 4.551 and 4.667, respectively, and conditional inference random rorest with RMSE of 4.667.
The results for variable importance show that the predictor with most impact on results is the CIELAB b value;
Applying the machine learning models to scalled CIELAB data showed mixed results;

Fusion Data:

Using unprocessed fusion data showed a decrease in model performance, with overall increase in RMSE values when comparing to the unprocessed UV data results;
The best performance was achieved by SVMs model (e1071 and kernlab packages) with a RMSE of 4.353 and 4.436, respectively, and elastic network with RMSE of 4.450;
The results for variable importance show that the predictors with most impact on results are the ones around the 470nm wavelength, with the variable with most importance being the one corresponding to the 472nm wavelength;
Using filtered fusion data showed an overall decrease in model performance when comparing to the results obtained with unprocessed UV data (higher RMSE values). The best performance was achieved by support vector machines (e1071 and kernlab packages) with a RMSE of 4.341 and 4.458, respectively;
The machine learning analysis with scalled fusion data and filtered and scalled fusion showed an overall decrease in model performance when comparing to the unprocessed UV data results (higher RMSE values). In both cases SVMs had best performance.

Classification tools and carotenoid measurement in Manihot esculenta via metabolomics and data mining

Telma Afonso

February 9th, 2017