1 Introduction

The aim of this work is to validate a quantification method for carotenoid contents in roots of M. esculenta from colorimetric data using the CIE L * a * b * system. Assuming that the statistical techniques of prognostic analysis as well as machine learning can correlate colorimetric data easily obtained in the field, with the levels obtained through traditional techniques for compounds quantification, such as UV-visible spectrophotometry or the HPLC and, from this, construct prediction models of carotenoids content for this type of biomass.

Roots of fifty M. esculenta genotypes belonging to the EPAGRI’s germplasm bank were sampled in the 2014/2015 season. Carotenoids were extracted from fresh roots and the absorbances of the organosolvent extracts were collected on a UV-visible spectrophotometer using a spectral window from 200 to 700 ƞm. Aliquots (10 Âµl) of the extracts were also injected into a liquid chromatograph. The color attributes of the samples were measured by a colorimeter and the results were expressed according to the CIELAB color space scale.

2 Necessary tools

To run this script the following packages are necessary:

library(specmine)
library(xlsx)

Setting working directory:

setwd("C:/Users/Telma/Desktop/CassavaCarotenoids")
set.seed(12345)

2.1 Used Models

The machine learning models used in this analysis are listed in the table below. These belong to the caret package, which is used by specmine.

**Table 1** - Machine learning models used in this analysis. The first column shows the model’s name, the second column shows the value that should be given to the function and the third column indicates whether or not the model has built-in feature selection. For more information on any of the models visit https://topepo.github.io/caret/available-models.html
Model	“Method” Value	Built-in Feature Selection
Conditional Inference Random Forest	cforest	YES
Conditional Inference Tree	ctree	YES
Decision Trees	rpart	YES
Elastic Net	enet	YES
K-Nearest Neighbors	knn	NO
Lasso Regression	lasso	YES
Linear Regression	lm	NO
Linear Regression (w/ Backwards Selection)	leapBackward	YES
Linear Regression (w/ Forward Selection)	leapForward	YES
Linear Regression (w/ Stepwise Selection)	leapSeq	YES
Partial Least Squares	kernelpls, pls, simpls, widekernelpls	YES
Random Forest	rf	YES
Ridge Regression	ridge	NO
Ridge Regression	foba	YES
Support Vector Machines (kernlab package)	svmLinear	NO
Support Vector Machines (e1071 package)	svmLinear2	NO

2.2 Auxiliary functions

The following function is used to retrieve the model name given the “method” value.

getModelName <- function(model) {
  if (model == 'lasso') name = 'Lasso'
  else if (model == 'ridge') name = 'Ridge Regression'
  else if (model == 'foba') name = 'Ridge Regression (w/ FS)'
  else if (model == 'rf') name = 'Random Forest'
  else if (model == 'cforest') name = 'Conditional Inference Random Forest'
  else if (model == 'enet') name = 'Elastic Net'
  else if (model == 'pls') name = 'Partial Least Squares (pls)'
  else if (model == 'kernelpls') name = 'Partial Least Squares (kernelpls)'
  else if (model == 'simpls') name = 'Partial Least Squares (simpls)'
  else if (model == 'widekernelpls') name = 'Partial Least Squares (widekernelpls)'
  else if (model == 'rpart') name = 'Decision Trees'
  else if (model == 'ctree') name = 'Conditional Inference Tree'
  else if (model == 'svmLinear') name = 'Support Vector Machines (kernlab)'
  else if (model == 'svmLinear2') name = 'Support Vector Machines (e1071)'
  else if (model == 'knn') name = 'K-Nearest Neighbors'
  else if (model == 'lm') name = 'Linear Regression'
  else if (model == 'leapBackward') name = 'Linear Regression (w/ Backwards Selection)'
  else if (model == 'leapForward') name = 'Linear Regression (w/ Forward Selection)'
  else if (model == 'leapSeq') name = 'Linear Regression (w/ Stepwise Selection)'
  else return()
  return (name)
}

The following function returns a data frame with the result of applying one or more machine learning models to a selected dataset. The metadata variable for prediction must be supplied.

perform_ML <- function(dataset, models, pred_var) {
  res = data.frame(RMSE = numeric(0), Rsquared = numeric(0), RMSESD = numeric(0), RsquaredSD = numeric(0))
  for (model in models) {
    name = getModelName(model)
    ml_res = train_models_performance(dataset, c(model), pred_var, "repeatedcv", 
                                      num.folds = 5, compute.varimp = F)
    res[name,] = c(ml_res$performance$RMSE, ml_res$performance$Rsquared, 
                   ml_res$performance$RMSESD, ml_res$performance$RsquaredSD)
    assign('res', res, envir = .GlobalEnv)
  }
  return(res)
}

The following function returns a data frame with the result of applying a machine learning model to a dataset that is to be applied various preprocessing methods, including scaling, smoothing interpolation, background, offset and baseline corrections, first derivative and multiplicative scatter correction. The metadata variable for prediction must be supplied.

perform_ML_preproc <- function(dataset, model, pred_var) {
  res = data.frame(RMSE = numeric(0), Rsquared = numeric(0), RMSESD = numeric(0), RsquaredSD = numeric(0))
  
  ds.sc = specmine::scaling(dataset)
  ds.wavelens = get_x_values_as_num(dataset)
  x.axis.sm = seq(min(ds.wavelens), max(ds.wavelens),10)
  ds.smooth = smoothing_interpolation(carotAg, method = "loess", x.axis = x.axis.sm)
  ds.bg = data_correction(dataset, 'background')
  ds.offset = data_correction(ds.bg, 'offset')
  ds.baseline = data_correction(ds.offset, 'baseline')
  ds.fd = first_derivative(dataset)
  ds.msc = msc_correction(dataset)
  
  datasets = list('No preprocessing' = dataset, 'Scaling' = ds.sc, 'Smoothing' = ds.smooth, 
                  'Background cor' = ds.bg, 'Background + Offset cors' = ds.offset, 
                  'Background + Offset + Baseline cors' = ds.baseline, 'First Derivative' = ds.fd,
                  'Multiplicative Scatter Cor' = ds.msc)
  i = 1
  for (ds in datasets) {
    ml_res = train_models_performance(ds, c(model), pred_var, "repeatedcv", num.folds = 5, compute.varimp = F)
    res[names(datasets)[i],] = c(ml_res$performance$RMSE, ml_res$performance$Rsquared,
                                 ml_res$performance$RMSESD, ml_res$performance$RsquaredSD)
    assign('res', res, envir = .GlobalEnv)
    i = i + 1
  }
  return(res)
}

3 UV Data

3.1 Read data from xlsx files

UV data is stored in 150 .xlsx files (3 replicates for each of the 50 genotypes), each file containing the read absorbances values between 200 to 700 ƞm.

files = list.files("data/UV")
datamat = matrix(nrow = 501, ncol = length(files))
rownames(datamat) = 200:700   #data recorded between 200-700nm
colnames(datamat) = gsub("\\.xls", "", files)

for (i in 1:length(files)){
  tab_excel = read.xlsx(paste("data/UV/", files[i], sep = ""), sheetIndex = 1, header = F)
  datamat[,i] = c(tab_excel[,2], rep(NA, 501-length(tab_excel[,2]))) 
}

datamat[1:6, 1:6]

##       101.1  101.2   101.3   102.1  102.2   102.3
## 200 0.08763 0.1863 0.10565 0.10565 0.1482 0.13221
## 201 0.09468 0.2184 0.13756 0.12944 0.1254 0.08732
## 202 0.06238 0.1792 0.08410 0.09159 0.1437 0.09159
## 203 0.11513 0.1776 0.13093 0.13497 0.1190 0.07799
## 204 0.11364 0.2038 0.05227 0.11364 0.1376 0.08368
## 205 0.13941 0.1820 0.10809 0.09691 0.1006 0.10809

3.2 Read metadata

Besides information regarding sample varieties and replicates, the metadata file also contains information about HPLC concentration measurements and CIELAB data.

file.metadata = "metadata/Carotenoides_Colorimetria.csv"
metadata = read_metadata(file.metadata)
description = "UV data for cassava cultivars - carotenoids"
label.x = "Wavelength"
label.values = "Absorbance"

head(metadata)

##     Varieties Replicates Cielab_L Cielab_A Cielab_B CarotenoidsContent_TCCS  Lutein Betacryptoxanthin
## 3.1         3          1    85.72    -2.70    22.28                   4.853 0.03248           0.06543
## 3.2         3          2    86.18    -2.48    21.39                   4.809 0.03248           0.06543
## 3.3         3          3    85.25    -2.64    22.38                   4.951 0.03248           0.06543
## 5.1         5          1    85.47    -1.76     6.74                   3.098 0.02598           0.07023
## 5.2         5          2    82.29    -2.00     7.02                   4.046 0.02598           0.07023
## 5.3         5          3    84.99    -1.86     7.25                   3.383 0.02598           0.07023
##     Alphacarotene Cisbetacarotene transbetacarotene Lycopene TCCHPLC
## 3.1       0.06021           2.250             3.269        0   5.678
## 3.2       0.06021           2.250             3.269        0   5.678
## 3.3       0.06021           2.250             3.269        0   5.678
## 5.1       0.08319           2.679             2.860        0   5.719
## 5.2       0.08319           2.679             2.860        0   5.719
## 5.3       0.08319           2.679             2.860        0   5.719

3.3 Create the dataset

After creating a matrix from the UV .xlsx files and reading the metadata, a dataset can be easily created.

Carotenoides_Colorimetria = create_dataset(type = "uvv-spectra", datamatrix = datamat, metadata = metadata, 
                                           label.x = label.x, label.values = label.values, 
                                           description = description)

sum_dataset(Carotenoides_Colorimetria)

## Dataset summary:
## Valid dataset
## Description:  UV data for cassava cultivars - carotenoids 
## Type of data:  uvv-spectra 
## Number of samples:  150 
## Number of data points 501 
## Number of metadata variables:  13 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  4224 
## Mean of data values:  0.3301 
## Median of data values:  0.1048 
## Standard deviation:  0.6824 
## Range of values:  -0.06964 4.191 
## Quantiles: 
##       0%      25%      50%      75%     100% 
## -0.06964  0.02003  0.10478  0.23166  4.19051

Because the majority of carotenoids exhibit absorption in the visible region of the spectrum, between 400 to 500 ƞm, a subset of the original dataset was created, with samples belonging to this wavelenght interval. Also, because the dataset has some missing values, as perceived by the summary above, missing values were replaced with the mean of the variables’ values.

carot_sub = subset_x_values_by_interval(Carotenoides_Colorimetria, 400, 500) # Absorbances between 400-500nm
carot_sub_nomissing = missingvalues_imputation(carot_sub, method = "mean")
sum_dataset(carot_sub_nomissing)

## Dataset summary:
## Valid dataset
## Description:  UV data for cassava cultivars - carotenoids; Missing value imputation with method mean 
## Type of data:  uvv-spectra 
## Number of samples:  150 
## Number of data points 101 
## Number of metadata variables:  13 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  0 
## Mean of data values:  0.2316 
## Median of data values:  0.187 
## Standard deviation:  0.1907 
## Range of values:  -0.002721 1.574 
## Quantiles: 
##        0%       25%       50%       75%      100% 
## -0.002721  0.130033  0.186963  0.261674  1.574271

The data was then aggregated, so that there are no replicates per genotype. (150 samples -> 50 samples)

indexes = rep(seq(1, num_samples(carot_sub_nomissing)/3), each = 3)
carotAg = aggregate_samples(carot_sub_nomissing, indexes, meta.to.remove = c("Replicates"))
sum_dataset(carotAg)

## Dataset summary:
## Valid dataset
## Description:  UV data for cassava cultivars - carotenoids; Missing value imputation with method mean 
## Type of data:  uvv-spectra 
## Number of samples:  50 
## Number of data points 101 
## Number of metadata variables:  12 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  0 
## Mean of data values:  0.2316 
## Median of data values:  0.1871 
## Standard deviation:  0.188 
## Range of values:  0.00136 1.299 
## Quantiles: 
##      0%     25%     50%     75%    100% 
## 0.00136 0.13380 0.18708 0.26038 1.29949

The dataset is now ready to be used in the subsequent analysis.

3.4 Machine Learning

The following step consisted in using a variety of machine learning regression approaches to determine which model and/or variables could best predict carotenoids content in roots of M. esculenta.

3.4.1 Select Output Variable

To determine which of the metadata variables would better perform in the prediction of carotenoids content, the already mentioned machine learning models were applied over the created dataset using different output variables. The chosen evaluation metric to compare model performance was the Root-Mean-Square Error (RMSE), since it explicitly shows how much the model predictions deviate, on average, from the actual values in the dataset.

models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 'widekernelpls',
           'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm', 'leapBackward', 'leapForward', 'leapSeq')

#Using CarotenoidsContent_TCCS variable
res1 = perform_ML(carotAg, models, pred_var = 'CarotenoidsContent_TCCS')

res1[order(res1$RMSE),] #ordered by RMSE values

##                                               RMSE Rsquared  RMSESD RsquaredSD
## Partial Least Squares (simpls)               3.492   0.9208   2.760    0.11351
## Support Vector Machines (e1071)              3.709   0.9316   2.823    0.08047
## Partial Least Squares (widekernelpls)        3.732   0.9238   3.197    0.14257
## Random Forest                                3.768   0.9483   2.224    0.05348
## Elastic Net                                  3.793   0.9185   3.539    0.13289
## Partial Least Squares (pls)                  3.800   0.9529   2.098    0.06209
## Ridge Regression (w/ FS)                     3.855   0.9478   2.506    0.04542
## Ridge Regression                             3.877   0.9283   3.344    0.08096
## Support Vector Machines (kernlab)            3.928   0.9409   2.743    0.06560
## Partial Least Squares (kernelpls)            4.096   0.8962   3.502    0.18642
## Linear Regression (w/ Stepwise Selection)    4.158   0.9192   3.211    0.11126
## Linear Regression (w/ Forward Selection)     4.178   0.8883   3.865    0.17517
## Linear Regression (w/ Backwards Selection)   4.392   0.8711   2.935    0.13775
## K-Nearest Neighbors                          4.732   0.9224   5.058    0.08588
## Lasso                                        5.207   0.8174   4.008    0.25323
## Conditional Inference Random Forest          6.713   0.7917   3.604    0.12296
## Conditional Inference Tree                   7.363   0.7114   3.053    0.16803
## Decision Trees                               7.582   0.6833   3.051    0.20625
## Linear Regression                          109.408   0.5563 378.967    0.32466

mean(get_metadata(carotAg)$CarotenoidsContent_TCCS) # CarotenoidsContent_TCCS variable mean values

## [1] 10.67

The results using the “CarotenoidsContent_TCCS” variable show that the models that achieved the lowest RMSE values for the given data included partial least squares (simpls and widekernelpls) with RMSE of 3.492 and 3.732, support vector machines (from e1071 package) with RMSE of 3.709 and random forests with RMSE of 3.768. These values could, however, be better, considering the average value of the “CarotenoidsContent_TCCS” variable.

Overall the coefficient of determination (\(R^{2}\)) shows a good fit of the predictions to the observations. The linear regression model without feature selection showed the worst results, with a 109.408 RMSE and 0.5563 \(R^{2}\).

#Using TCCHPLC variable
res2 = perform_ML(carotAg, models, pred_var = 'TCCHPLC')

res2[order(res2$RMSE),] #ordered by RMSE values

##                                               RMSE Rsquared   RMSESD RsquaredSD
## Partial Least Squares (pls)                  5.643   0.5971    4.049     0.3146
## Partial Least Squares (widekernelpls)        5.779   0.5701    3.840     0.3298
## Partial Least Squares (simpls)               5.789   0.5721    3.877     0.3213
## Support Vector Machines (e1071)              5.844   0.5975    4.099     0.2965
## Partial Least Squares (kernelpls)            5.878   0.5661    3.877     0.3498
## Ridge Regression (w/ FS)                     5.880   0.6038    3.791     0.3322
## Support Vector Machines (kernlab)            5.907   0.5892    4.306     0.3088
## Elastic Net                                  5.934   0.6340    3.795     0.2997
## K-Nearest Neighbors                          6.277   0.4451    3.985     0.2909
## Linear Regression (w/ Backwards Selection)   6.373   0.5226    3.921     0.2853
## Decision Trees                               6.795   0.4736    3.947     0.3026
## Conditional Inference Random Forest          6.806   0.5588    4.034     0.3079
## Conditional Inference Tree                   6.916   0.4805    3.808     0.2880
## Random Forest                                7.275   0.3596    3.351     0.2736
## Ridge Regression                             7.282   0.6163    4.579     0.2862
## Linear Regression (w/ Stepwise Selection)    8.341   0.5265    5.628     0.3311
## Linear Regression (w/ Forward Selection)     8.783   0.4716    6.292     0.3254
## Lasso                                       17.508   0.2494   14.130     0.2657
## Linear Regression                          863.264   0.2830 3171.947     0.2985

mean(get_metadata(carotAg)$TCCHPLC) # TCCHPLC variable mean values

## [1] 10.84

The results using the “TCCHPLC” variable show that, overall, RMSE values increased compared to when using the “CarotenoidsContent_TCCS” variable. The models that achieved the lowest RMSE values for the given data included partial least squares with methods “pls”, “widekernelpls” and “simpls”, with RMSE of 5.643, 5.779 and 5.789, respectively.

Overall the coefficient of determination shows a poor fit of the predictions to the observations. The linear regression model without feature selection showed the worst results, with a 863.264 RMSE and 0.2830 \(R^{2}\). Lasso regression also performed much poorly in this case, with a 17.508 RMSE.

#Using transbetacarotene variable
res3 = perform_ML(carotAg, models, pred_var = 'transbetacarotene')

res3[order(res3$RMSE),] #ordered by RMSE values

##                                               RMSE Rsquared  RMSESD RsquaredSD
## Ridge Regression (w/ FS)                     4.159  0.35640   4.057     0.3325
## Elastic Net                                  4.191  0.41274   4.202     0.3430
## Partial Least Squares (kernelpls)            4.211  0.42217   4.273     0.3081
## Support Vector Machines (e1071)              4.218  0.39924   4.330     0.3249
## Support Vector Machines (kernlab)            4.230  0.46608   4.219     0.3147
## Partial Least Squares (pls)                  4.265  0.47090   4.131     0.3096
## Partial Least Squares (simpls)               4.309  0.36296   4.257     0.3156
## Partial Least Squares (widekernelpls)        4.324  0.45308   4.215     0.2861
## Ridge Regression                             4.407  0.31655   4.219     0.3184
## K-Nearest Neighbors                          4.597  0.22467   4.043     0.2204
## Conditional Inference Random Forest          4.703  0.36963   4.124     0.2694
## Conditional Inference Tree                   4.894  0.28851   4.103     0.2487
## Linear Regression (w/ Forward Selection)     5.142  0.31153   4.534     0.3017
## Decision Trees                               5.189  0.05344   3.750     0.0612
## Linear Regression (w/ Backwards Selection)   5.355  0.27887   4.194     0.2421
## Random Forest                                5.753  0.23993   3.882     0.2553
## Linear Regression (w/ Stepwise Selection)    6.135  0.20603   5.378     0.2999
## Lasso                                       16.145  0.18959  14.917     0.2694
## Linear Regression                          329.642  0.28887 621.822     0.2943

mean(get_metadata(carotAg)$transbetacarotene) # transbetacarotene variable mean values

## [1] 5.897

Transbetacarotene concentrations were also used, considering that it was the carotenoid with highest concentration levels. The results using the “transbetacarotene” variable show that, overall, RMSE values increased compared to when using the “CarotenoidsContent_TCCS” variable. The models that achieved the lowest RMSE values for the given data included ridge regression (w/ feature selection) with RMSE of 4.159, elastic net with a RMSE of 4.191 and partial least squares (kernelpls) with RMSE of 4.211.

Overall the coefficient of determination shows a poor fit of the predictions to the observations. The linear regression model without feature selection showed the worst results as in the previous cases, with a 329.642 RMSE and 0.28887 \(R^{2}\). Lasso regression also performed much poorly in this case, with a 16.145 RMSE.

All the results above point to a better model performance whrn using the “CarotenoidsContent_TCCS” metadata variable. This was, therefore, the chosen variable used in the subsequent analysis.

3.4.2 Preprocessed Data

The next step consisted in testing the best models from the previous analysis (when using the “CarotenoidsContent_TCCS” metadata variable) on a preprocessed dataset, to see if the model performance improved. Those models were partial least squares, support vector machines and random forests.

# Partial least squares
res4 = perform_ML_preproc(carotAg, 'simpls', 'CarotenoidsContent_TCCS')

res4[order(res4$RMSE),] #ordered by RMSE values

##                                      RMSE Rsquared RMSESD RsquaredSD
## No preprocessing                    3.592   0.9326  2.848    0.08055
## Background + Offset + Baseline cors 3.639   0.9253  3.201    0.07748
## Background cor                      3.748   0.9068  3.173    0.15577
## Smoothing                           3.915   0.9411  2.828    0.08749
## Scaling                             4.023   0.9367  2.613    0.07894
## Background + Offset cors            4.231   0.9138  2.860    0.08263
## First Derivative                    5.114   0.7738  4.123    0.33141
## Multiplicative Scatter Cor          9.180   0.2886  7.773    0.30671

Applying the partial least squares model to the preprocessed datasets showed no improvement of model performance when using any of the preprocessing methods.

# Support vector machines
res5 = perform_ML_preproc(carotAg, 'svmLinear2', 'CarotenoidsContent_TCCS')

res5[order(res5$RMSE),] #ordered by RMSE values

##                                      RMSE Rsquared RMSESD RsquaredSD
## No preprocessing                    3.954   0.8900  3.797     0.1847
## First Derivative                    3.973   0.8591  4.009     0.2296
## Scaling                             4.049   0.9084  3.430     0.1812
## Background cor                      4.189   0.8999  3.596     0.1776
## Smoothing                           4.365   0.9250  3.116     0.1089
## Background + Offset cors            5.350   0.8479  4.842     0.2611
## Background + Offset + Baseline cors 5.846   0.7873  5.248     0.2638
## Multiplicative Scatter Cor          8.431   0.5065  4.978     0.3216

Applying the support vector machines model to the preprocessed datasets showed no improvement of model performance when using any of the preprocessing methods.

# Random forests
res6 = perform_ML_preproc(carotAg, 'rf', 'CarotenoidsContent_TCCS')

res6[order(res6$RMSE),] #ordered by RMSE values

##                                      RMSE Rsquared RMSESD RsquaredSD
## Smoothing                           3.664   0.9617  2.027    0.03688
## Scaling                             3.678   0.9501  2.616    0.05652
## Background cor                      3.808   0.9580  2.509    0.03858
## No preprocessing                    3.810   0.9673  2.707    0.03217
## Background + Offset cors            4.175   0.9310  2.263    0.06727
## Background + Offset + Baseline cors 4.338   0.8986  2.998    0.09572
## First Derivative                    4.638   0.8910  3.619    0.10289
## Multiplicative Scatter Cor          5.958   0.5513  2.815    0.37261

Applying the random forests model to the preprocessed dataset showed little improvement in model performance when using smoothing interpolation (RMSE of 3.567), scaling (RMSE of 3.678) and background correction (RMSE of 3.808) as preprocessing methods.

3.4.3 Filtered Data

The data was also filtered in order to determine if feature selection could improve model performance. A flat pattern filter with inter-quartile range as filter function was applied to the dataset, retaining 40%, 60% and 80% of the data each time.

#Filtering 80% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 80)

res7 = perform_ML(carotAg.filt, models, 'CarotenoidsContent_TCCS')

# Results of 80% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res7-res1
res7_1 = cbind(round(res7,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res7_1[order(res7_1$RMSE),]

##                                             RMSE Rsquared RMSESD RsquaredSD  div       RMSE Rsquared
## Support Vector Machines (e1071)            3.775   0.9272  2.865    0.10695    |    0.06607 -0.00439
## Ridge Regression (w/ FS)                   3.784   0.9609  2.880    0.04962    |   -0.07122  0.01313
## Elastic Net                                3.819   0.9200  2.784    0.10532    |    0.02604  0.00153
## Support Vector Machines (kernlab)          4.004   0.9354  2.985    0.09008    |    0.07612 -0.00551
## Ridge Regression                           4.026   0.9179  2.921    0.11112    |    0.14857 -0.01042
## Partial Least Squares (widekernelpls)      4.029   0.9165  2.783    0.10537    |    0.29741 -0.00733
## Random Forest                              4.068   0.8764  3.461    0.17183    |    0.29960 -0.07194
## Partial Least Squares (pls)                4.105   0.9278  2.537    0.09569    |    0.30487 -0.02514
## Partial Least Squares (simpls)             4.218   0.9288  2.595    0.09155    |    0.72582  0.00805
## Partial Least Squares (kernelpls)          4.290   0.9470  2.748    0.07729    |    0.19427  0.05079
## Linear Regression (w/ Backwards Selection) 4.413   0.8663  3.434    0.16604    |    0.02065 -0.00480
## Linear Regression (w/ Stepwise Selection)  4.618   0.8835  3.240    0.15953    |    0.46010 -0.03570
## Linear Regression                          4.895   0.7794  3.662    0.24235    | -104.51391  0.22306
## Lasso                                      5.181   0.7879  4.131    0.22043    |   -0.02559 -0.02956
## K-Nearest Neighbors                        5.346   0.8538  3.898    0.16235    |    0.61380 -0.06857
## Linear Regression (w/ Forward Selection)   5.574   0.7897  3.579    0.33281    |    1.39583 -0.09862
## Conditional Inference Random Forest        6.696   0.7714  2.901    0.12981    |   -0.01732 -0.02031
## Decision Trees                             7.163   0.7370  2.819    0.15669    |   -0.41977  0.05371
## Conditional Inference Tree                 7.293   0.7095  2.774    0.16195    |   -0.06946 -0.00189

Filtering 80% of the data showed an overall decrease in model performance, with RMSE values increasing in comparison to the results using the original dataset. However, it massively increased the performance of the linear model (without selection), decreasing the RMSE by 104 units.

#Filtering 60% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 60)

res8 = perform_ML(carotAg.filt, models, 'CarotenoidsContent_TCCS')

# Results of 60% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res8-res1
res8_1 = cbind(round(res8,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res8_1[order(res8_1$RMSE),]

##                                              RMSE Rsquared RMSESD RsquaredSD  div      RMSE Rsquared
## Elastic Net                                 2.975   0.9325  2.465    0.10707    |  -0.81828  0.01401
## Ridge Regression                            3.082   0.9597  2.287    0.07438    |  -0.79492  0.03133
## Partial Least Squares (widekernelpls)       3.115   0.9515  2.494    0.04963    |  -0.61633  0.02769
## Partial Least Squares (pls)                 3.146   0.9432  2.406    0.06523    |  -0.65462 -0.00970
## Support Vector Machines (kernlab)           3.184   0.9426  2.817    0.10225    |  -0.74389  0.00166
## Ridge Regression (w/ FS)                    3.269   0.9549  2.696    0.08504    |  -0.58561  0.00713
## Partial Least Squares (simpls)              3.330   0.9412  2.612    0.06337    |  -0.16275  0.02041
## Support Vector Machines (e1071)             3.414   0.9108  3.349    0.18377    |  -0.29580 -0.02077
## Partial Least Squares (kernelpls)           3.549   0.9412  2.946    0.10128    |  -0.54735  0.04500
## Random Forest                               3.764   0.9592  2.564    0.03871    |  -0.00474  0.01090
## Linear Regression (w/ Stepwise Selection)   4.408   0.8678  3.784    0.17481    |   0.25013 -0.05146
## Linear Regression (w/ Backwards Selection)  4.571   0.8832  3.584    0.18819    |   0.17868  0.01204
## K-Nearest Neighbors                         4.624   0.9353  4.362    0.06769    |  -0.10839  0.01297
## Linear Regression (w/ Forward Selection)    4.785   0.9189  4.420    0.15484    |   0.60672  0.03060
## Lasso                                       6.578   0.9142 12.959    0.12363    |   1.37104  0.09679
## Conditional Inference Random Forest         6.669   0.7811  3.427    0.12917    |  -0.04438 -0.01059
## Decision Trees                              7.255   0.7333  3.127    0.15816    |  -0.32758  0.04998
## Conditional Inference Tree                  7.482   0.7537  3.436    0.16905    |   0.11911  0.04232
## Linear Regression                          35.529   0.6011 49.871    0.36117    | -73.87949  0.04477

Filtering 60% of the data, on the other hand, showed an overall increase in model performance, with RMSE values decreasing in comparison to the results using the original dataset. Here, elastic network model showed the lowest RMSE value so far, with a value of 2.975.

#Filtering 40% of the data
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 40)

res9 = perform_ML(carotAg.filt, models, 'CarotenoidsContent_TCCS')

# Results of 40% data filtering w/ difference to unprocessed dataset results (Two last columns)
diff = res9-res1
res9_1 = cbind(round(res9,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res9_1[order(res9_1$RMSE),]

##                                              RMSE Rsquared  RMSESD RsquaredSD  div      RMSE Rsquared
## Elastic Net                                 2.788   0.9573   2.122    0.04566    |  -1.00560  0.03886
## Partial Least Squares (widekernelpls)       3.022   0.9350   2.444    0.12042    |  -0.70960  0.01115
## Linear Regression (w/ Forward Selection)    3.032   0.9488   2.332    0.05595    |  -1.14588  0.06053
## Partial Least Squares (pls)                 3.202   0.9672   2.095    0.04005    |  -0.59860  0.01428
## Ridge Regression                            3.287   0.9468   2.412    0.04975    |  -0.59062  0.01848
## Ridge Regression (w/ FS)                    3.387   0.9429   2.267    0.10049    |  -0.46729 -0.00480
## Partial Least Squares (kernelpls)           3.399   0.9236   2.990    0.14775    |  -0.69725  0.02743
## Random Forest                               3.556   0.9311   2.533    0.09180    |  -0.21273 -0.01720
## Partial Least Squares (simpls)              3.620   0.9412   3.000    0.13589    |   0.12811  0.02045
## Support Vector Machines (kernlab)           3.725   0.9360   3.091    0.12235    |  -0.20346 -0.00496
## Support Vector Machines (e1071)             3.817   0.9128   3.203    0.17841    |   0.10740 -0.01876
## Linear Regression (w/ Stepwise Selection)   4.016   0.8875   3.789    0.18218    |  -0.14253 -0.03173
## Linear Regression (w/ Backwards Selection)  4.096   0.8863   3.186    0.11189    |  -0.29628  0.01516
## Lasso                                       4.282   0.9266   3.412    0.13305    |  -0.92460  0.10920
## K-Nearest Neighbors                         4.994   0.9290   3.993    0.06097    |   0.26195  0.00662
## Conditional Inference Random Forest         6.734   0.7649   2.720    0.12479    |   0.02156 -0.02687
## Decision Trees                              7.168   0.7257   2.969    0.14286    |  -0.41461  0.04241
## Conditional Inference Tree                  7.264   0.7414   3.145    0.17168    |  -0.09836  0.03003
## Linear Regression                          85.785   0.5815 406.272    0.36570    | -23.62301  0.02515

Filtering 40% of the data, showed even better results than the previous case, with an overall increase in model performance, with RMSE values decreasing in comparison to the results using the original dataset. Here, elastic network model showed an even lower RMSE value of 2.788.

4 CIELAB Data

A machine learning analysis using the CIELAB data was also performed.

4.1 Create dataset

The CIELAB data is stored in the metadata file. Therefore, it needs to be extracted first to create the cielab dataset.

color.values = t(get_metadata(carotAg)[2:4]) #L a b
filtered.meta = get_metadata(carotAg)[5:12]

carotCielab = create_dataset(datamatrix = color.values, metadata = filtered.meta, label.x = "cielab",
                             label.values = "color values", description = "Dataset from cielab values")
head(carotCielab$data)[,1:12] #Cielab values for first 12 samples

##           101.1  102.1 103.1 105.1  11.1  119.1  123.1  125.1   21.1   23.1   27.1    3.1
## Cielab_L 77.670 85.017 81.25 69.25 83.59 69.510 82.893 68.563 74.113 70.240 83.983 85.717
## Cielab_A -3.397 -3.663 -4.46 -4.95 -3.44 -5.457 -2.123 -4.733 -4.277 -1.437 -2.140 -2.607
## Cielab_B 16.493 18.477 18.49 31.96 16.81 37.693  8.213 36.790 20.107 16.160  8.683 22.017

sum_dataset(carotCielab) # Dataset summary

## Dataset summary:
## Valid dataset
## Description:  Dataset from cielab values 
## Type of data:  undefined 
## Number of samples:  50 
## Number of data points 3 
## Number of metadata variables:  8 
## Label of x-axis values:  cielab 
## Label of data points:  color values 
## Number of missing values in data:  0 
## Mean of data values:  31.99 
## Median of data values:  18.69 
## Standard deviation:  35.84 
## Range of values:  -5.457 88.28 
## Quantiles: 
##     0%    25%    50%    75%   100% 
## -5.457 -3.070 18.685 75.292 88.283

4.2 Machine Learning

The same machine learning models used in the UV dataset were used for the CIELAB dataset, with the exception of linear regression models with selection, as it does not make sense to use these considering there are only 3 features in the dataset (L, a and b values). The metadata variable used for prediction was “CarotenoidsContent_TCCS” .

4.2.1 Unprocessed data

models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 
           'widekernelpls', 'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm')

#Using CarotenoidsContent_TCCS variable
res10 = perform_ML(carotCielab, models, pred_var = 'CarotenoidsContent_TCCS')

# Results w/ CIELAB data and difference to unprocessed UV data results (Two last columns)
diff = res10-res1[-c(17,18,19),]
res10_1 = cbind(round(res10,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res10_1[order(res10_1$RMSE),]

##                                        RMSE Rsquared RMSESD RsquaredSD  div     RMSE Rsquared
## Linear Regression                     6.295   0.5933  2.050     0.3114    | -103.113  0.03702
## Lasso                                 6.412   0.5503  2.191     0.2915    |    1.205 -0.26716
## Ridge Regression                      6.417   0.5681  1.874     0.3271    |    2.540 -0.36027
## Elastic Net                           6.456   0.5785  2.354     0.3005    |    2.663 -0.33996
## K-Nearest Neighbors                   6.636   0.5336  3.660     0.3659    |    1.904 -0.38878
## Ridge Regression (w/ FS)              6.638   0.5628  2.200     0.3159    |    2.783 -0.38498
## Random Forest                         6.647   0.5124  4.079     0.3346    |    2.879 -0.43592
## Partial Least Squares (pls)           6.939   0.5916  2.270     0.2887    |    3.139 -0.36136
## Partial Least Squares (simpls)        6.990   0.6022  2.410     0.2777    |    3.498 -0.31859
## Support Vector Machines (e1071)       7.015   0.5350  3.394     0.3027    |    3.306 -0.39664
## Partial Least Squares (kernelpls)     7.121   0.5827  2.315     0.2755    |    3.024 -0.31353
## Partial Least Squares (widekernelpls) 7.125   0.6221  2.691     0.2882    |    3.394 -0.30171
## Support Vector Machines (kernlab)     7.294   0.5040  3.719     0.3179    |    3.366 -0.43688
## Conditional Inference Random Forest   8.162   0.4385  4.159     0.2706    |    1.449 -0.35320
## Conditional Inference Tree            9.388   0.3063  3.570     0.2011    |    2.026 -0.40503
## Decision Trees                        9.990   0.2679  3.170     0.2505    |    2.408 -0.41536

From the results above it is clear that there is an overall decrease in model performance when using CIELAB data in comparison to when using UV data, with increased RMSE values. However, the linear model performed better that any other model with a RMSE of 6.295, unlike when using UV data where it performed worst in almost every case.

4.2.2 Scalled data

The dataset was then scalled to test whether CIELAB data scaling could improve results.

carotCielab.sc = specmine::scaling(carotCielab)
sum_dataset(carotCielab.sc)

## Dataset summary:
## Valid dataset
## Description:  Dataset from cielab values; Scaling with method auto 
## Type of data:  undefined 
## Number of samples:  50 
## Number of data points 3 
## Number of metadata variables:  8 
## Label of x-axis values:  cielab 
## Label of data points:  color values 
## Number of missing values in data:  0 
## Mean of data values:  1.49e-16 
## Median of data values:  0.06326 
## Standard deviation:  0.9933 
## Range of values:  -2.187 3.695 
## Quantiles: 
##       0%      25%      50%      75%     100% 
## -2.18663 -0.49244  0.06326  0.52084  3.69515

res11 = perform_ML(carotCielab.sc, models, pred_var = 'CarotenoidsContent_TCCS')

# Results w/ scalled CIELAB data and difference to unprocessed CIELAB data results (Two last columns)
diff = res11-res10
res11_10 = cbind(round(res11,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res11_10[order(res11_10$RMSE),]

##                                        RMSE Rsquared RMSESD RsquaredSD  div     RMSE Rsquared
## Ridge Regression (w/ FS)              6.469   0.6087  2.199     0.2984    | -0.16921  0.04594
## Ridge Regression                      6.497   0.5909  2.223     0.3051    |  0.08049  0.02284
## Elastic Net                           6.515   0.5736  2.280     0.2943    |  0.05852 -0.00493
## Linear Regression                     6.651   0.5587  2.207     0.2954    |  0.35576 -0.03466
## Lasso                                 6.757   0.5759  2.867     0.2945    |  0.34593  0.02567
## Partial Least Squares (widekernelpls) 6.771   0.5416  2.285     0.3121    | -0.35489 -0.08055
## Partial Least Squares (kernelpls)     6.865   0.5404  2.318     0.3196    | -0.25558 -0.04225
## Support Vector Machines (kernlab)     6.919   0.5284  4.051     0.3021    | -0.37585  0.02440
## Partial Least Squares (simpls)        7.043   0.5433  2.955     0.2718    |  0.05302 -0.05884
## Partial Least Squares (pls)           7.085   0.5385  2.695     0.2821    |  0.14515 -0.05306
## Support Vector Machines (e1071)       7.136   0.5000  3.350     0.3098    |  0.12065 -0.03496
## K-Nearest Neighbors                   7.267   0.5257  4.555     0.3774    |  0.63094 -0.00786
## Random Forest                         7.280   0.4481  4.149     0.3327    |  0.63345 -0.06426
## Conditional Inference Random Forest   8.021   0.4546  4.547     0.2727    | -0.14069  0.01608
## Conditional Inference Tree            9.636   0.3393  3.778     0.2672    |  0.24707  0.03295
## Decision Trees                        9.737   0.3168  3.303     0.2785    | -0.25366  0.04889

Applying the machine learning models to scalled CIELAB data showed mixed results, with increased and decreased model performance depending on the model used. These changes were, however, small.

5 UV and CIELAB Data Fusion

A machine learning analysis using fused UV and CIELAB data was also performed.

5.1 Create dataset

Two datasets were created, one using 40% of filtered UV data and another using the entire data

# Not filtered
carot.fus = low_level_fusion(list(carotAg, carotCielab))
sum_dataset(carot.fus)

## Dataset summary:
## Valid dataset
## Description:  Data integration from types: uvv-spectra,undefined 
## Type of data:  integrated-data 
## Number of samples:  50 
## Number of data points 104 
## Number of metadata variables:  12 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  0 
## Mean of data values:  1.148 
## Median of data values:  0.1881 
## Standard deviation:  8.069 
## Range of values:  -5.457 88.28 
## Quantiles: 
##      0%     25%     50%     75%    100% 
## -5.4567  0.1335  0.1881  0.2673 88.2833

# 40% data filtered
carotAg.filt = flat_pattern_filter(carotAg, "iqr", by.percent = TRUE, red.value = 40)
carot.fus.filt = low_level_fusion(list(carotAg.filt, carotCielab))
sum_dataset(carot.fus.filt)

## Dataset summary:
## Valid dataset
## Description:  Data integration from types: uvv-spectra,undefined 
## Type of data:  integrated-data 
## Number of samples:  50 
## Number of data points 63 
## Number of metadata variables:  12 
## Label of x-axis values:  Wavelength 
## Label of data points:  Absorbance 
## Number of missing values in data:  0 
## Mean of data values:  1.782 
## Median of data values:  0.217 
## Standard deviation:  10.32 
## Range of values:  -5.457 88.28 
## Quantiles: 
##      0%     25%     50%     75%    100% 
## -5.4567  0.1700  0.2170  0.3074 88.2833

5.2 Machine Learning

The same machine learning models applied in the UV dataset were used for the UV and CIELAB fusion datasets. The metadata variable used for prediction was “CarotenoidsContent_TCCS” .

5.2.1 Unprocessed data

models = c('lasso', 'ridge', 'foba', 'rf', 'cforest', 'enet', 'pls', 'kernelpls', 'simpls', 'widekernelpls',
           'rpart', 'ctree', 'svmLinear', 'svmLinear2', 'knn', 'lm', 'leapBackward', 'leapForward', 'leapSeq')

# Using unfiltered dataset
res12 = perform_ML(carot.fus, models, pred_var = 'CarotenoidsContent_TCCS')

# Results w/ unfiltered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res12-res1
res12_1 = cbind(round(res12,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res12_1[order(res12_1$RMSE),]

##                                              RMSE Rsquared  RMSESD RsquaredSD  div      RMSE Rsquared
## Ridge Regression (w/ FS)                    3.570   0.9298   2.570    0.09139    |  -0.28492 -0.01790
## Partial Least Squares (pls)                 3.682   0.8931   2.208    0.16247    |  -0.11817 -0.05986
## Partial Least Squares (simpls)              3.706   0.8746   2.115    0.22102    |   0.21384 -0.04617
## Random Forest                               3.758   0.9444   3.193    0.06612    |  -0.01067 -0.00397
## Elastic Net                                 3.775   0.9179   3.135    0.12661    |  -0.01812 -0.00055
## Partial Least Squares (kernelpls)           3.804   0.8312   2.036    0.25565    |  -0.29205 -0.06502
## Support Vector Machines (e1071)             3.875   0.8887   2.945    0.13634    |   0.16551 -0.04289
## Partial Least Squares (widekernelpls)       4.017   0.9247   3.009    0.09406    |   0.28545  0.00087
## Linear Regression (w/ Backwards Selection)  4.479   0.8020   3.793    0.28868    |   0.08673 -0.06914
## Support Vector Machines (kernlab)           4.612   0.8800   3.530    0.17235    |   0.68415 -0.06088
## Linear Regression (w/ Stepwise Selection)   4.718   0.7973   3.942    0.26786    |   0.55925 -0.12196
## Linear Regression (w/ Forward Selection)    4.829   0.8743   4.892    0.18620    |   0.65048 -0.01400
## Ridge Regression                            4.839   0.8510   3.693    0.20439    |   0.96199 -0.07729
## Lasso                                       4.983   0.8076   3.935    0.25721    |  -0.22357 -0.00988
## K-Nearest Neighbors                         6.412   0.6320   3.724    0.32438    |   1.67962 -0.29035
## Conditional Inference Random Forest         6.663   0.7671   3.738    0.12083    |  -0.05004 -0.02466
## Conditional Inference Tree                  7.566   0.6697   3.125    0.14793    |   0.20316 -0.04168
## Decision Trees                              8.021   0.6997   3.349    0.21125    |   0.43851  0.01642
## Linear Regression                          37.304   0.5489 106.894    0.35402    | -72.10456 -0.00743

The machine learning analysis with unprocessed fusion data showed mixed results, with increased and decreased model performance depending on the model used when comparing to the unprocessed UV data results. The best performance was achieved by ridge regression model (with selection) with a RMSE of 3.570.

5.2.2 Filtered data

# Using dataset w/ 40% data filtered
res13 = perform_ML(carot.fus.filt, models, pred_var = 'CarotenoidsContent_TCCS')

# Results w/ 40% filtered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res13-res1
res13_1 = cbind(round(res13,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res13_1[order(res13_1$RMSE),]

##                                              RMSE Rsquared RMSESD RsquaredSD  div      RMSE Rsquared
## Partial Least Squares (widekernelpls)       3.084   0.9319  2.047    0.08297    |  -0.64814  0.00804
## Ridge Regression (w/ FS)                    3.270   0.9474  2.222    0.04949    |  -0.58469 -0.00033
## Elastic Net                                 3.297   0.8981  3.157    0.15709    |  -0.49655 -0.02037
## Partial Least Squares (kernelpls)           3.439   0.9337  2.324    0.07546    |  -0.65670  0.03750
## Partial Least Squares (pls)                 3.497   0.9492  2.314    0.05237    |  -0.30310 -0.00371
## Partial Least Squares (simpls)              3.508   0.9028  3.208    0.14473    |   0.01595 -0.01804
## Ridge Regression                            3.581   0.9175  2.596    0.12414    |  -0.29581 -0.01084
## Random Forest                               3.634   0.9422  2.461    0.06038    |  -0.13417 -0.00615
## Support Vector Machines (kernlab)           3.698   0.9109  2.622    0.13331    |  -0.22996 -0.03005
## Support Vector Machines (e1071)             3.822   0.9046  2.336    0.12510    |   0.11254 -0.02704
## Linear Regression (w/ Forward Selection)    3.973   0.8968  3.630    0.15522    |  -0.20577  0.00848
## Linear Regression (w/ Stepwise Selection)   3.973   0.9137  3.111    0.14123    |  -0.18510 -0.00557
## Linear Regression (w/ Backwards Selection)  4.216   0.8961  3.468    0.17868    |  -0.17696  0.02496
## Lasso                                       4.520   0.8970  3.549    0.13102    |  -0.68690  0.07961
## K-Nearest Neighbors                         6.445   0.5573  3.881    0.36184    |   1.71233 -0.36511
## Conditional Inference Random Forest         6.813   0.8037  2.580    0.13109    |   0.10060  0.01198
## Decision Trees                              7.398   0.7618  3.077    0.16124    |  -0.18477  0.07850
## Conditional Inference Tree                  7.471   0.7600  2.738    0.16739    |   0.10842  0.04860
## Linear Regression                          34.045   0.5631 87.680    0.35951    | -75.36335  0.00675

The machine learning analysis with filtered fusion data showed an overall increase in model performance when comparing to the results obtained with unprocessed UV data. The best performance was achieved by partial least squares model (“widekernelpls”) with a RMSE of 3.084.

5.2.3 Scalled data

Both filtered and unfiltered datasets were scalled and the machine learning models applied to these scalled datasets.

# Using unfiltered dataset
carot.fus.sc = specmine::scaling(carot.fus)
res14 = perform_ML(carot.fus.sc, models, pred_var = 'CarotenoidsContent_TCCS')

# Results w/ unfiltered scalled fusion data and difference to unprocessed UV data results (Two last columns)
diff = res14-res1
res14_1 = cbind(round(res14,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res14_1[order(res14_1$RMSE),]

##                                              RMSE Rsquared  RMSESD RsquaredSD  div      RMSE Rsquared
## Ridge Regression (w/ FS)                    3.363   0.9562   2.444    0.04904    |  -0.49190  0.00846
## Elastic Net                                 3.555   0.9134   3.354    0.14244    |  -0.23861 -0.00503
## Random Forest                               3.702   0.9431   2.547    0.07687    |  -0.06679 -0.00520
## Partial Least Squares (widekernelpls)       3.821   0.9270   2.940    0.09650    |   0.08971  0.00315
## Partial Least Squares (simpls)              3.828   0.9047   2.888    0.13734    |   0.33613 -0.01612
## Linear Regression (w/ Forward Selection)    3.955   0.8887   3.734    0.16237    |  -0.22347  0.00037
## Partial Least Squares (kernelpls)           4.018   0.9099   2.484    0.14167    |  -0.07785  0.01372
## Partial Least Squares (pls)                 4.044   0.9445   2.562    0.06779    |   0.24382 -0.00844
## Support Vector Machines (e1071)             4.149   0.9021   3.013    0.12431    |   0.43919 -0.02955
## Ridge Regression                            4.277   0.8650   3.677    0.17924    |   0.39998 -0.06330
## Support Vector Machines (kernlab)           4.285   0.9174   2.871    0.09592    |   0.35657 -0.02351
## Linear Regression (w/ Stepwise Selection)   4.501   0.8917   5.755    0.17352    |   0.34253 -0.02755
## Linear Regression (w/ Backwards Selection)  4.625   0.7971   3.493    0.25177    |   0.23213 -0.07401
## K-Nearest Neighbors                         5.085   0.9319   3.843    0.06019    |   0.35300  0.00957
## Lasso                                       5.380   0.7681   4.480    0.26323    |   0.17360 -0.04938
## Conditional Inference Random Forest         6.675   0.7819   2.961    0.14885    |  -0.03773 -0.00982
## Conditional Inference Tree                  7.430   0.7408   2.923    0.18633    |   0.06707  0.02940
## Decision Trees                              8.279   0.6797   2.552    0.26684    |   0.69642 -0.00359
## Linear Regression                          57.460   0.5357 185.598    0.32405    | -51.94857 -0.02059

The machine learning analysis with scalled fusion data showed mixed results, with increased and decreased model performance depending on the model used when comparing to the unprocessed UV data results. The best performance was achieved by ridge regression model (with selection) with a RMSE of 3.363.

# Using dataset w/ 40% data filtered
carot.fus.filt.sc = specmine::scaling(carot.fus.filt)
res15 = perform_ML(carot.fus.filt.sc, models, pred_var = 'CarotenoidsContent_TCCS')

# Results w/ 40% filtered fusion data and difference to unprocessed UV data results (Two last columns)
diff = res15-res1
res15_1 = cbind(round(res15,5), div = rep('   |', nrow(diff)), round(diff[-c(3,4)],5))
res15_1[order(res15_1$RMSE),]

##                                              RMSE Rsquared RMSESD RsquaredSD  div      RMSE Rsquared
## Ridge Regression (w/ FS)                    3.291   0.9358  2.390    0.10474    |  -0.56366 -0.01194
## Partial Least Squares (pls)                 3.295   0.9213  2.160    0.09737    |  -0.50543 -0.03162
## Partial Least Squares (widekernelpls)       3.354   0.9260  2.448    0.11008    |  -0.37727  0.00213
## Ridge Regression                            3.413   0.9142  2.451    0.11500    |  -0.46427 -0.01414
## Partial Least Squares (simpls)              3.421   0.9255  2.393    0.08205    |  -0.07146  0.00472
## Partial Least Squares (kernelpls)           3.459   0.9032  2.261    0.10488    |  -0.63719  0.00695
## Linear Regression (w/ Stepwise Selection)   3.490   0.9463  2.800    0.05425    |  -0.66874  0.02704
## Elastic Net                                 3.499   0.9349  3.058    0.11903    |  -0.29455  0.01644
## Linear Regression (w/ Forward Selection)    3.695   0.9222  2.430    0.08538    |  -0.48370  0.03385
## Random Forest                               3.712   0.9458  2.497    0.06147    |  -0.05596 -0.00253
## Support Vector Machines (e1071)             3.795   0.9365  2.946    0.08364    |   0.08592  0.00492
## Support Vector Machines (kernlab)           4.023   0.8901  2.892    0.17257    |   0.09496 -0.05087
## Linear Regression (w/ Backwards Selection)  4.130   0.8808  3.474    0.15742    |  -0.26264  0.00972
## Lasso                                       4.456   0.8864  3.621    0.19410    |  -0.75126  0.06900
## K-Nearest Neighbors                         4.966   0.9238  3.985    0.08355    |   0.23403  0.00144
## Conditional Inference Random Forest         6.536   0.7824  3.561    0.12064    |  -0.17719 -0.00927
## Decision Trees                              7.042   0.7087  2.832    0.15897    |  -0.54042  0.02545
## Conditional Inference Tree                  7.487   0.7142  1.911    0.16668    |   0.12455  0.00279
## Linear Regression                          23.035   0.5023 35.928    0.34654    | -86.37294 -0.05395

Using filtered and scalled fusion data resulted in an overall increase in model performance when comparing to the unprocessed UV data results. The best performance was achieved by ridge regression model (with selection) with a RMSE of 3.291.

6 Results Summary

UV Data:

Metadata’s variable that shows the best results when used as output variable is ‘CarotenoidsContent_TCCS’ (it was used in subsequent analysis);
The models that showed best performance were partial least squares (simpls and widekernelpls) with RMSE of 3.492 and 3.732, support vector machines (from e1071 package) with RMSE of 3.709 and random forests with RMSE of 3.768;
Testing these on preprocessed data showed performance improval only in the case of random forests when using smoothing interpolation (RMSE of 3.567) and scaling (RMSE of 3.678) as preprocessing methods;
Testing all models with 80%/60%/40% data filtering showed that a 40% data filtering achieved best performance results (RMSE);

CIELAB Data:

Overall decrease in model performance using CIELAB data in comparison to when using UV data, with increased RMSE values;
Linear model performed better that any other model with a RMSE of 6.295, unlike when using UV data where it performed worst in almost every case;
Applying the machine learning models to scalled CIELAB data showed mixed results;

Fusion Data:

Using unprocessed fusion data showed results very similar to when using unprocessed UV data, being the best performance achieved by ridge regression model (with selection) with a RMSE of 3.570;
Using filtered fusion data showed an overall increase in model performance when comparing to the results obtained with unprocessed UV data. The best performance was achieved by partial least squares model (“widekernelpls”) with a RMSE of 3.084;
The machine learning analysis with scalled fusion data showed mixed results, whereas using filtered and scalled fusion data resulted in an overall increase in model performance when comparing to the unprocessed UV data results.

Classification tools and carotenoid measurement in Manihot esculenta via metabolomics and data mining

Telma Afonso

February 4th, 2017