The data is read from .dx files and converted to a dataset list. Also “ppds”, which is numeric, was converted to factor.
setwd("D:/Dropbox/Metabolomics-package")
source("scripts/init.R")
source("Datasets/CassavaPPD/IR/scripts/ir-cassavappd-metadata.R")
ir.cassppd.data.folder = "Datasets/CassavaPPD/IR/data"
ir.cassppd.metadata.file = "Datasets/CassavaPPD/IR/metadata/ir-cassavappd-metadata.csv"
get.metadata.ir.cassavappd(ir.cassppd.data.folder, ir.cassppd.metadata.file)
ir.cassava.ppd = read.dataset.dx(ir.cassppd.data.folder, ir.cassppd.metadata.file,
type = "ir-spectra", description = "IR for Cassava - PPD",
label.x = "Wavenumber", label.values = "transmittance")
ir.cassava.ppd = convert.to.factor(ir.cassava.ppd, "ppds")
index.vector = get.indexes.replicates(ir.cassava.ppd)
ir.cassava.noreps = aggregate.samples(ir.cassava.ppd, index.vector, aggreg.fn = "mean", meta.to.remove = "replicates")
To reduce the number of features in order to make the analysis’ algorithms run faster, smoothing interpolation was used.
sum.dataset(ir.cassava.noreps)
## Dataset summary:
## Valid dataset
## Description: IR for Cassava - PPD
## Type of data: ir-spectra
## Number of samples: 16
## Number of data points 3735
## Number of metadata variables: 2
## Label of x-axis values: Wavenumber
## Label of data points: transmittance
## Number of missing values in data: 0
## Mean of data values: 0.5943
## Median of data values: 0.6148
## Standard deviation: 0.1519
## Range of values: 0.1129 0.828
## Quantiles:
## 0% 25% 50% 75% 100%
## 0.1129 0.4897 0.6148 0.7226 0.8280
ir.cassava.noreps = smoothing.interpolation(ir.cassava.noreps, "bin", 10)
## Warning: Last data point averages only 5 points.
sum.dataset(ir.cassava.noreps)
## Dataset summary:
## Valid dataset
## Description: IR for Cassava - PPD-smoothed with hyperSpec spc.bin
## Type of data: undefined
## Number of samples: 16
## Number of data points 374
## Number of metadata variables: 2
## Label of x-axis values: Wavenumber
## Label of data points: transmittance
## Number of missing values in data: 0
## Mean of data values: 0.5942
## Median of data values: 0.6148
## Standard deviation: 0.1518
## Range of values: 0.1135 0.8272
## Quantiles:
## 0% 25% 50% 75% 100%
## 0.1135 0.4897 0.6148 0.7227 0.8272
No preprocessing used.
UNIVARIATE ANALYSIS
An analysis of variance (ANOVA) was conducted over the data with tukey test also, and this is the top 10 results ordered by p-value:
anova.cassava.varieties = univariate.analysis(ir.cassava.noreps, type = "anova", "varieties")
anova.cassava.varieties[1:10,]
## pvalues logs fdr tukey
## 3388.4456809052 0.01469 1.833 0.1183 SAN-BRA
## 3398.08964180503 0.01472 1.832 0.1183 SAN-BRA
## 3407.73360270487 0.01473 1.832 0.1183 SAN-BRA
## 3417.37756360471 0.01486 1.828 0.1183 SAN-BRA
## 3378.80172000536 0.01489 1.827 0.1183 SAN-BRA
## 3427.02152450455 0.01513 1.820 0.1183 SAN-BRA
## 3369.15775910552 0.01521 1.818 0.1183 SAN-BRA
## 3359.51379820568 0.01566 1.805 0.1183 SAN-BRA
## 3436.66548540439 0.01568 1.805 0.1183 SAN-BRA
## 3349.86983730584 0.01616 1.792 0.1183 SAN-BRA
anova.cassava.ppd = univariate.analysis(ir.cassava.noreps, type = "anova", "ppds")
anova.cassava.ppd[1:10,]
## pvalues logs fdr tukey
## 2356.54186462239 0.1929 0.7147 0.8992
## 2366.18582552223 0.2031 0.6922 0.8992
## 2337.25394282271 0.2656 0.5757 0.8992
## 2327.60998192287 0.3256 0.4873 0.8992
## 2346.89790372255 0.3286 0.4833 0.8992
## 2317.96602102303 0.4691 0.3287 0.8992
## 2375.82978642207 0.5033 0.2982 0.8992
## 967.811495045528 0.5375 0.2696 0.8992
## 977.455455945367 0.5470 0.2620 0.8992
## 2308.32206012319 0.5805 0.2362 0.8992
CLUSTERING
Hierarchical clustering with euclidean distance and complete method was performed on the data and the resulting dendrogram is shown below:
hc.cassava = clustering(ir.cassava.noreps, method = "hc", distance = "euclidean")
dendrogram.plot.col(ir.cassava.noreps, hc.cassava, "ppds")
dendrogram.plot.col(ir.cassava.noreps, hc.cassava, "varieties")
K-Means was performed on the data also with 4 centers and the results and the plot giving for each cluster the median of the samples in blue, and in grey the values of all samples in that cluster are shown below:
kmeans.cassava = clustering(ir.cassava.noreps, method = "kmeans", num.clusters = 4)
kmeans.plot(ir.cassava.noreps, kmeans.cassava)
kmeans.df = kmeans.result.df(kmeans.cassava, 4)
kmeans.df
## cluster samples
## 1 1 BRA8_1 IAC_1 IAC3_1 IAC8_1 ORI8_1 SAN_1
## 2 2 ORI3_1 ORI5_1
## 3 3 BRA_1 BRA3_1 BRA5_1
## 4 4 IAC5_1 ORI_1 SAN3_1 SAN5_1 SAN8_1
PCA
Principal components analysis was performed on the data and some plots are shown below:
pca.analysis.result = pca.analysis.dataset(ir.cassava.noreps)
pca.pairs.plot(ir.cassava.noreps, "varieties", pca.analysis.result)
pca.pairs.plot(ir.cassava.noreps, "ppds", pca.analysis.result)
pca.screeplot(pca.analysis.result)
pca.scoresplot2D(ir.cassava.noreps, pca.analysis.result, "varieties", ellipses = T)
pca.scoresplot2D(ir.cassava.noreps, pca.analysis.result, "ppds", ellipses = T)
pca.kmeans.plot2D(ir.cassava.noreps, pca.analysis.result, kmeans.result = kmeans.cassava, ellipses = T)
MACHINE LEARNING
For classification models and prediction the following parameters were used: - models: PLS, SVM and Random Forests - validation method: repeated cross-validation - number of folds: 10 - number of repeats: 10
Below are some results with the best tune for each model:
ml.cassava.varieties = train.models.performance(ir.cassava.noreps, c("pls", "svmLinear", "rf"), "varieties", "repeatedcv", num.folds = 16, num.repeats = 10, tunelength = 10)
ml.cassava.varieties$performance
## Accuracy Kappa AccuracySD KappaSD
## pls 0.5625 0 0.4976 0
## svmLinear 0.4375 0 0.4976 0
## rf 0.5000 0 0.5016 0
ml.cassava.ppds = train.models.performance(ir.cassava.noreps, c("pls",
"svmLinear", "rf"), "ppds", "repeatedcv", num.folds = 16, num.repeats = 10, tunelength = 10)
ml.cassava.ppds$performance
## Accuracy Kappa AccuracySD KappaSD
## pls 0.5625 0 0.4976 0
## svmLinear 0.3125 0 0.4650 0
## rf 0.2625 0 0.4414 0
And the variable importance in varieties and ppds models:
summary.var.importance(ml.cassava.varieties, 10)
##
## [1] "pls"
## BRA IAC ORI SAN Mean
## X2356.54186462239 18.76 100.00 72.40 9.590 50.19
## X2366.18582552223 18.19 96.32 69.97 9.095 48.39
## X2337.25394282271 16.13 76.23 55.70 7.656 38.93
## X2346.89790372255 14.88 67.02 48.82 6.951 34.42
## X2327.60998192287 14.97 65.84 48.33 6.869 34.00
## X2317.96602102303 12.45 43.37 32.15 5.235 23.30
## X977.455455945367 16.26 34.57 30.34 8.899 22.52
## X2375.82978642207 11.75 42.15 30.92 5.001 22.45
## X929.23565144617 10.80 37.81 30.18 5.787 21.14
## X919.591690546331 10.23 37.74 29.54 5.590 20.77
##
## [1] "svmLinear"
## BRA IAC ORI SAN Mean
## X3513.81717260311 100 100 75 100 93.75
## X3504.17321170327 100 100 75 100 93.75
## X3494.52925080343 100 100 75 100 93.75
## X3484.88528990359 100 100 75 100 93.75
## X3475.24132900375 100 100 75 100 93.75
## X3465.59736810391 100 100 75 100 93.75
## X3455.95340720407 100 100 75 100 93.75
## X3446.30944630423 100 100 75 100 93.75
## X3436.66548540439 100 100 75 100 93.75
## X3427.02152450455 100 100 75 100 93.75
##
## [1] "rf"
## Overall Mean
## X1739.32836703267 100.00 100.00
## X2366.18582552223 42.29 42.29
## X2327.60998192287 31.66 31.66
## X2337.25394282271 30.46 30.46
## X2356.54186462239 24.55 24.55
## X1729.68440613283 22.71 22.71
## X3465.59736810391 21.14 21.14
## X3446.30944630423 20.44 20.44
## X1652.53271893412 20.39 20.39
## X3407.73360270487 20.08 20.08
summary.var.importance(ml.cassava.ppds, 10)
##
## [1] "pls"
## 0 3 5 8 Mean
## X2356.54186462239 46.41 100.00 97.34 29.42 68.29
## X2366.18582552223 43.75 96.34 92.92 27.10 65.03
## X2337.25394282271 36.80 78.20 76.77 25.63 54.35
## X987.099416845206 36.15 69.81 65.93 28.97 50.22
## X1565.73707083556 20.40 34.98 58.62 83.32 49.33
## X996.743377745045 28.64 76.07 70.34 20.27 48.83
## X2327.60998192287 32.72 67.32 67.02 24.61 47.92
## X2346.89790372255 31.16 67.77 66.15 21.77 46.71
## X1613.95687533476 35.99 57.81 58.22 32.84 46.21
## X1575.3810317354 22.83 25.49 51.63 83.22 45.79
##
## [1] "svmLinear"
## X0 X3 X5 X8 Mean
## X2366.18582552223 80 100 100 100 95
## X2356.54186462239 80 100 100 100 95
## X3996.01521759507 80 100 80 100 90
## X3986.37125669523 80 100 80 100 90
## X3976.72729579539 80 100 80 100 90
## X3967.08333489555 80 100 80 100 90
## X3957.43937399571 80 100 80 100 90
## X3947.79541309588 80 100 80 100 90
## X3938.15145219604 80 100 80 100 90
## X3928.5074912962 80 100 80 100 90
##
## [1] "rf"
## Overall Mean
## X2366.18582552223 100.00 100.00
## X2356.54186462239 67.77 67.77
## X1662.17667983396 42.16 42.16
## X2346.89790372255 38.50 38.50
## X2337.25394282271 37.44 37.44
## X2327.60998192287 28.94 28.94
## X967.811495045528 24.05 24.05
## X958.167534145688 23.86 23.86
## X401.228792179968 20.68 20.68
## X2317.96602102303 15.34 15.34
FEATURE SELECTION
Using recursive feature selection, various subsets were used with random forests classifier. The results are shown below:
feature.selection.result.varieties = feature.selection(ir.cassava.noreps, "varieties", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result.varieties
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 2 0.344 0.0517 0.388 0.337
## 4 0.369 0.0575 0.396 0.342
## 8 0.429 0.0746 0.412 0.406
## 16 0.461 0.1243 0.423 0.404
## 32 0.486 0.1231 0.424 0.424
## 64 0.496 0.1427 0.431 0.441 *
## 374 0.496 0.1525 0.431 0.442
##
## The top 5 variables (out of 64):
## X1633.24479713444, X1652.53271893412, X1613.95687533476, X3320.93795460632, X1604.31291443492
plot(feature.selection.result.varieties, type=c("g","o"))
feature.selection.result.ppds = feature.selection(ir.cassava.noreps, "ppds", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result.ppds
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 2 0.187 -0.0548 0.315 0.205
## 4 0.163 -0.0758 0.316 0.219
## 8 0.171 -0.0747 0.316 0.229
## 16 0.187 -0.0446 0.315 0.193
## 32 0.234 -0.0265 0.357 0.278
## 64 0.202 -0.0208 0.322 0.226
## 374 0.278 0.0217 0.347 0.291 *
##
## The top 5 variables (out of 374):
## X3754.91619509909, X3889.93164769684, X3783.84807779861, X2375.82978642207, X3996.01521759507
plot(feature.selection.result.ppds, type=c("g","o"))
Also selection by filter was used with the results shown below:
feature.selection.result2.varieties = feature.selection(ir.cassava.noreps, "varieties", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2.varieties
##
## Selection By Filter
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance:
##
## Accuracy Kappa AccuracySD KappaSD
## 0.316 0.0624 0.371 0.256
##
## Using the training set, 81 variables were selected:
## X3552.39301620246, X3542.74905530262, X3533.10509440279, X3523.46113350295, X3513.81717260311...
##
## During resampling, the top 5 selected variables (out of a possible 374):
## X3407.73360270487 (90.7%), X3369.15775910552 (88.4%), X3378.80172000536 (88.4%), X3388.4456809052 (88.4%), X3398.08964180503 (88.4%)
##
## On average, 86.6 variables were selected (min = 0, max = 369)
feature.selection.result2.ppds = feature.selection(ir.cassava.noreps, "ppds", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2.ppds
##
## Selection By Filter
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance:
##
## Accuracy Kappa AccuracySD KappaSD
## 0 -0.0144 0 0.0662
##
## No variables were selected from the training set.
##
## During resampling, the top 4 selected variables (out of a possible 4):
## X2356.54186462239 (7%), X2366.18582552223 (7%), X2337.25394282271 (4.7%), X2346.89790372255 (2.3%)
##
## On average, 0.2 variables were selected (min = 0, max = 4)