Cassava IR no replicates

The data is read from .dx files and converted to a dataset list. Also “ppds”, which is numeric, was converted to factor.

setwd("D:/Dropbox/Metabolomics-package")
source("scripts/init.R")
source("Datasets/CassavaPPD/IR/scripts/ir-cassavappd-metadata.R")

ir.cassppd.data.folder = "Datasets/CassavaPPD/IR/data"
ir.cassppd.metadata.file = "Datasets/CassavaPPD/IR/metadata/ir-cassavappd-metadata.csv"

get.metadata.ir.cassavappd(ir.cassppd.data.folder, ir.cassppd.metadata.file)

ir.cassava.ppd = read.dataset.dx(ir.cassppd.data.folder, ir.cassppd.metadata.file, 
                                 type = "ir-spectra", description = "IR for Cassava - PPD",
                                 label.x = "Wavenumber", label.values = "transmittance")

ir.cassava.ppd = convert.to.factor(ir.cassava.ppd, "ppds")
index.vector = get.indexes.replicates(ir.cassava.ppd)
ir.cassava.noreps = aggregate.samples(ir.cassava.ppd, index.vector, aggreg.fn = "mean", meta.to.remove = "replicates")

To reduce the number of features in order to make the analysis’ algorithms run faster, smoothing interpolation was used.

sum.dataset(ir.cassava.noreps)
## Dataset summary:
## Valid dataset
## Description:  IR for Cassava - PPD 
## Type of data:  ir-spectra 
## Number of samples:  16 
## Number of data points 3735 
## Number of metadata variables:  2 
## Label of x-axis values:  Wavenumber 
## Label of data points:  transmittance 
## Number of missing values in data:  0 
## Mean of data values:  0.5943 
## Median of data values:  0.6148 
## Standard deviation:  0.1519 
## Range of values:  0.1129 0.828 
## Quantiles: 
##     0%    25%    50%    75%   100% 
## 0.1129 0.4897 0.6148 0.7226 0.8280
ir.cassava.noreps = smoothing.interpolation(ir.cassava.noreps, "bin", 10)
## Warning: Last data point averages only 5 points.
sum.dataset(ir.cassava.noreps)
## Dataset summary:
## Valid dataset
## Description:  IR for Cassava - PPD-smoothed with hyperSpec spc.bin 
## Type of data:  undefined 
## Number of samples:  16 
## Number of data points 374 
## Number of metadata variables:  2 
## Label of x-axis values:  Wavenumber 
## Label of data points:  transmittance 
## Number of missing values in data:  0 
## Mean of data values:  0.5942 
## Median of data values:  0.6148 
## Standard deviation:  0.1518 
## Range of values:  0.1135 0.8272 
## Quantiles: 
##     0%    25%    50%    75%   100% 
## 0.1135 0.4897 0.6148 0.7227 0.8272

No preprocessing used.

UNIVARIATE ANALYSIS

An analysis of variance (ANOVA) was conducted over the data with tukey test also, and this is the top 10 results ordered by p-value:

anova.cassava.varieties = univariate.analysis(ir.cassava.noreps, type = "anova", "varieties")
anova.cassava.varieties[1:10,]
##                  pvalues  logs    fdr   tukey
## 3388.4456809052  0.01469 1.833 0.1183 SAN-BRA
## 3398.08964180503 0.01472 1.832 0.1183 SAN-BRA
## 3407.73360270487 0.01473 1.832 0.1183 SAN-BRA
## 3417.37756360471 0.01486 1.828 0.1183 SAN-BRA
## 3378.80172000536 0.01489 1.827 0.1183 SAN-BRA
## 3427.02152450455 0.01513 1.820 0.1183 SAN-BRA
## 3369.15775910552 0.01521 1.818 0.1183 SAN-BRA
## 3359.51379820568 0.01566 1.805 0.1183 SAN-BRA
## 3436.66548540439 0.01568 1.805 0.1183 SAN-BRA
## 3349.86983730584 0.01616 1.792 0.1183 SAN-BRA
anova.cassava.ppd = univariate.analysis(ir.cassava.noreps, type = "anova", "ppds")
anova.cassava.ppd[1:10,]
##                  pvalues   logs    fdr tukey
## 2356.54186462239  0.1929 0.7147 0.8992      
## 2366.18582552223  0.2031 0.6922 0.8992      
## 2337.25394282271  0.2656 0.5757 0.8992      
## 2327.60998192287  0.3256 0.4873 0.8992      
## 2346.89790372255  0.3286 0.4833 0.8992      
## 2317.96602102303  0.4691 0.3287 0.8992      
## 2375.82978642207  0.5033 0.2982 0.8992      
## 967.811495045528  0.5375 0.2696 0.8992      
## 977.455455945367  0.5470 0.2620 0.8992      
## 2308.32206012319  0.5805 0.2362 0.8992

CLUSTERING

Hierarchical clustering with euclidean distance and complete method was performed on the data and the resulting dendrogram is shown below:

hc.cassava = clustering(ir.cassava.noreps, method = "hc", distance = "euclidean")
dendrogram.plot.col(ir.cassava.noreps, hc.cassava, "ppds")

plot of chunk hc

dendrogram.plot.col(ir.cassava.noreps, hc.cassava, "varieties")

plot of chunk hc

K-Means was performed on the data also with 4 centers and the results and the plot giving for each cluster the median of the samples in blue, and in grey the values of all samples in that cluster are shown below:

kmeans.cassava = clustering(ir.cassava.noreps, method = "kmeans", num.clusters = 4)
kmeans.plot(ir.cassava.noreps, kmeans.cassava)

plot of chunk kmeans

kmeans.df = kmeans.result.df(kmeans.cassava, 4)
kmeans.df
##   cluster                                 samples
## 1       1 BRA8_1 IAC_1 IAC3_1 IAC8_1 ORI8_1 SAN_1
## 2       2                           ORI3_1 ORI5_1
## 3       3                     BRA_1 BRA3_1 BRA5_1
## 4       4       IAC5_1 ORI_1 SAN3_1 SAN5_1 SAN8_1

PCA

Principal components analysis was performed on the data and some plots are shown below:

pca.analysis.result = pca.analysis.dataset(ir.cassava.noreps)

pca.pairs.plot(ir.cassava.noreps, "varieties", pca.analysis.result)

plot of chunk pca

pca.pairs.plot(ir.cassava.noreps, "ppds", pca.analysis.result)

plot of chunk pca

pca.screeplot(pca.analysis.result)

plot of chunk pca_2

pca.scoresplot2D(ir.cassava.noreps, pca.analysis.result, "varieties", ellipses = T)

plot of chunk pca_2

pca.scoresplot2D(ir.cassava.noreps, pca.analysis.result, "ppds", ellipses = T)

plot of chunk pca_2

pca.kmeans.plot2D(ir.cassava.noreps, pca.analysis.result, kmeans.result = kmeans.cassava, ellipses = T)

plot of chunk pca_2

MACHINE LEARNING

For classification models and prediction the following parameters were used: - models: PLS, SVM and Random Forests - validation method: repeated cross-validation - number of folds: 10 - number of repeats: 10

Below are some results with the best tune for each model:

ml.cassava.varieties = train.models.performance(ir.cassava.noreps, c("pls", "svmLinear", "rf"), "varieties", "repeatedcv", num.folds = 16, num.repeats = 10, tunelength = 10)
ml.cassava.varieties$performance
##           Accuracy Kappa AccuracySD KappaSD
## pls         0.5625     0     0.4976       0
## svmLinear   0.4375     0     0.4976       0
## rf          0.5000     0     0.5016       0
ml.cassava.ppds = train.models.performance(ir.cassava.noreps, c("pls",
"svmLinear", "rf"), "ppds", "repeatedcv", num.folds = 16, num.repeats = 10, tunelength = 10)
ml.cassava.ppds$performance
##           Accuracy Kappa AccuracySD KappaSD
## pls         0.5625     0     0.4976       0
## svmLinear   0.3125     0     0.4650       0
## rf          0.2625     0     0.4414       0

And the variable importance in varieties and ppds models:

summary.var.importance(ml.cassava.varieties, 10)
## 
## [1] "pls"
##                     BRA    IAC   ORI   SAN  Mean
## X2356.54186462239 18.76 100.00 72.40 9.590 50.19
## X2366.18582552223 18.19  96.32 69.97 9.095 48.39
## X2337.25394282271 16.13  76.23 55.70 7.656 38.93
## X2346.89790372255 14.88  67.02 48.82 6.951 34.42
## X2327.60998192287 14.97  65.84 48.33 6.869 34.00
## X2317.96602102303 12.45  43.37 32.15 5.235 23.30
## X977.455455945367 16.26  34.57 30.34 8.899 22.52
## X2375.82978642207 11.75  42.15 30.92 5.001 22.45
## X929.23565144617  10.80  37.81 30.18 5.787 21.14
## X919.591690546331 10.23  37.74 29.54 5.590 20.77
## 
## [1] "svmLinear"
##                   BRA IAC ORI SAN  Mean
## X3513.81717260311 100 100  75 100 93.75
## X3504.17321170327 100 100  75 100 93.75
## X3494.52925080343 100 100  75 100 93.75
## X3484.88528990359 100 100  75 100 93.75
## X3475.24132900375 100 100  75 100 93.75
## X3465.59736810391 100 100  75 100 93.75
## X3455.95340720407 100 100  75 100 93.75
## X3446.30944630423 100 100  75 100 93.75
## X3436.66548540439 100 100  75 100 93.75
## X3427.02152450455 100 100  75 100 93.75
## 
## [1] "rf"
##                   Overall   Mean
## X1739.32836703267  100.00 100.00
## X2366.18582552223   42.29  42.29
## X2327.60998192287   31.66  31.66
## X2337.25394282271   30.46  30.46
## X2356.54186462239   24.55  24.55
## X1729.68440613283   22.71  22.71
## X3465.59736810391   21.14  21.14
## X3446.30944630423   20.44  20.44
## X1652.53271893412   20.39  20.39
## X3407.73360270487   20.08  20.08
summary.var.importance(ml.cassava.ppds, 10)
## 
## [1] "pls"
##                       0      3     5     8  Mean
## X2356.54186462239 46.41 100.00 97.34 29.42 68.29
## X2366.18582552223 43.75  96.34 92.92 27.10 65.03
## X2337.25394282271 36.80  78.20 76.77 25.63 54.35
## X987.099416845206 36.15  69.81 65.93 28.97 50.22
## X1565.73707083556 20.40  34.98 58.62 83.32 49.33
## X996.743377745045 28.64  76.07 70.34 20.27 48.83
## X2327.60998192287 32.72  67.32 67.02 24.61 47.92
## X2346.89790372255 31.16  67.77 66.15 21.77 46.71
## X1613.95687533476 35.99  57.81 58.22 32.84 46.21
## X1575.3810317354  22.83  25.49 51.63 83.22 45.79
## 
## [1] "svmLinear"
##                   X0  X3  X5  X8 Mean
## X2366.18582552223 80 100 100 100   95
## X2356.54186462239 80 100 100 100   95
## X3996.01521759507 80 100  80 100   90
## X3986.37125669523 80 100  80 100   90
## X3976.72729579539 80 100  80 100   90
## X3967.08333489555 80 100  80 100   90
## X3957.43937399571 80 100  80 100   90
## X3947.79541309588 80 100  80 100   90
## X3938.15145219604 80 100  80 100   90
## X3928.5074912962  80 100  80 100   90
## 
## [1] "rf"
##                   Overall   Mean
## X2366.18582552223  100.00 100.00
## X2356.54186462239   67.77  67.77
## X1662.17667983396   42.16  42.16
## X2346.89790372255   38.50  38.50
## X2337.25394282271   37.44  37.44
## X2327.60998192287   28.94  28.94
## X967.811495045528   24.05  24.05
## X958.167534145688   23.86  23.86
## X401.228792179968   20.68  20.68
## X2317.96602102303   15.34  15.34

FEATURE SELECTION

Using recursive feature selection, various subsets were used with random forests classifier. The results are shown below:

feature.selection.result.varieties = feature.selection(ir.cassava.noreps, "varieties", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result.varieties
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          2    0.344 0.0517      0.388   0.337         
##          4    0.369 0.0575      0.396   0.342         
##          8    0.429 0.0746      0.412   0.406         
##         16    0.461 0.1243      0.423   0.404         
##         32    0.486 0.1231      0.424   0.424         
##         64    0.496 0.1427      0.431   0.441        *
##        374    0.496 0.1525      0.431   0.442         
## 
## The top 5 variables (out of 64):
##    X1633.24479713444, X1652.53271893412, X1613.95687533476, X3320.93795460632, X1604.31291443492
plot(feature.selection.result.varieties, type=c("g","o"))

plot of chunk rfe

feature.selection.result.ppds = feature.selection(ir.cassava.noreps, "ppds", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result.ppds
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy   Kappa AccuracySD KappaSD Selected
##          2    0.187 -0.0548      0.315   0.205         
##          4    0.163 -0.0758      0.316   0.219         
##          8    0.171 -0.0747      0.316   0.229         
##         16    0.187 -0.0446      0.315   0.193         
##         32    0.234 -0.0265      0.357   0.278         
##         64    0.202 -0.0208      0.322   0.226         
##        374    0.278  0.0217      0.347   0.291        *
## 
## The top 5 variables (out of 374):
##    X3754.91619509909, X3889.93164769684, X3783.84807779861, X2375.82978642207, X3996.01521759507
plot(feature.selection.result.ppds, type=c("g","o"))

plot of chunk rfe

Also selection by filter was used with the results shown below:

feature.selection.result2.varieties = feature.selection(ir.cassava.noreps, "varieties", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2.varieties
## 
## Selection By Filter
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance:
## 
##  Accuracy  Kappa AccuracySD KappaSD
##     0.316 0.0624      0.371   0.256
## 
## Using the training set, 81 variables were selected:
##    X3552.39301620246, X3542.74905530262, X3533.10509440279, X3523.46113350295, X3513.81717260311...
## 
## During resampling, the top 5 selected variables (out of a possible 374):
##    X3407.73360270487 (90.7%), X3369.15775910552 (88.4%), X3378.80172000536 (88.4%), X3388.4456809052 (88.4%), X3398.08964180503 (88.4%)
## 
## On average, 86.6 variables were selected (min = 0, max = 369)
feature.selection.result2.ppds = feature.selection(ir.cassava.noreps, "ppds", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2.ppds
## 
## Selection By Filter
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance:
## 
##  Accuracy   Kappa AccuracySD KappaSD
##         0 -0.0144          0  0.0662
## 
## No variables were selected from the training set.
## 
## During resampling, the top 4 selected variables (out of a possible 4):
##    X2356.54186462239 (7%), X2366.18582552223 (7%), X2337.25394282271 (4.7%), X2346.89790372255 (2.3%)
## 
## On average, 0.2 variables were selected (min = 0, max = 4)