NMR Propolis pipeline 3

Propolis peak list data was read and stored in a list, containing 2 elements, the dataset consisting in a list with the samples with their ppm intensities being the elements. The propolis metadata consists on the agroregions and regions.

setwd("~/Dropbox")
library(metabolomicsUM)
source("Datasets/Propolis/NMR/scripts/propolis_metadata.R")

prop.nmr.metadata.file = "Datasets/Propolis/NMR/metadata/metadata_propolis_agro.csv"
prop.nmr.data.folder = "Datasets/Propolis/NMR/data"

get.metadata.agro(prop.nmr.data.folder, write.file = TRUE, file.name = prop.nmr.metadata.file)
prop.nmr.metadata = read.metadata(prop.nmr.metadata.file)

peaks.lists = read.csvs.folder(prop.nmr.data.folder)

Agroregions metadata used, own grouping peaks algorithm used, removed peak groups with less than 25% of values, missing values imputation with low value, log transformation and autoscaling.

PREPROCESSING

Own grouping peaks algorithm used with step = 0.03:

# removing resonances in selected regions
peaks.lists = remove.peaks.interval.sample.list(peaks.lists, 0, 0.19)
peaks.lists = remove.peaks.interval.sample.list(peaks.lists, 3.29, 3.31)
peaks.lists = remove.peaks.interval.sample.list(peaks.lists, 4.85, 5)

#group peaks
prop.nmr.ds = group.peaks(peaks.lists, type = "nmr-peaks", metadata = prop.nmr.metadata, description = "NMR propolis", label.x = "ppm", label.values = "intensity")
sum.dataset(prop.nmr.ds)

## Dataset summary:
## Valid dataset
## Description:  NMR propolis 
## Type of data:  nmr-peaks 
## Number of samples:  59 
## Number of data points 293 
## Number of metadata variables:  2 
## Label of x-axis values:  ppm 
## Label of data points:  intensity 
## Number of missing values in data:  5376 
## Mean of data values:  0.09016594 
## Median of data values:  0.0287 
## Standard deviation:  0.1904829 
## Range of values:  0 10 
## Quantiles: 
##      0%     25%     50%     75%    100% 
##  0.0000  0.0081  0.0287  0.0929 10.0000

Peak groups with less than 25% of values were removed:

nsamps = num.samples(prop.nmr.ds)
prop.nmr.ds = remove.variables.by.nas(prop.nmr.ds,  0.75*nsamps)

There are 2659 missing values found in the dataset, which will be replaced with a low value (0.00005).

prop.nmr.na = missingvalues.imputation(prop.nmr.ds, method="value", value = 0.00005)

The data was transformed by log transformation and scaled by autoscaling:

prop.nmr.log = transform.data(prop.nmr.na, method="log")
prop.nmr.log.scal = scaling(prop.nmr.log, "auto")

UNIVARIATE TESTS

An analysis of variance (ANOVA) was conducted over the data with tukey test also, and this is the top 10 results ordered by p-value:

anova.prop.nmr.log.scal = aov.all.vars(prop.nmr.log.scal, "agroregions")
anova.prop.nmr.log.scal[1:20,]

##           pvalues     logs          fdr
## 2.11 3.778168e-06 5.422719 0.0009143166
## 5.17 9.140972e-06 5.039008 0.0011060577
## 2.2  3.759972e-05 4.424815 0.0023624243
## 5.42 4.314793e-05 4.365040 0.0023624243
## 2.86 4.881042e-05 4.311487 0.0023624243
## 2.98 2.108492e-04 3.676028 0.0060283467
## 1.49 2.134870e-04 3.670629 0.0060283467
## 2.5  2.232885e-04 3.651134 0.0060283467
## 1.73 2.241947e-04 3.649375 0.0060283467
## 1.87 3.294498e-04 3.482211 0.0079726846
## 6.13 3.854040e-04 3.414084 0.0084788889
## 6.17 5.532303e-04 3.257094 0.0106587346
## 5.62 5.725766e-04 3.242166 0.0106587346
## 0.56 6.991081e-04 3.155456 0.0120845832
## 9.51 8.282134e-04 3.081858 0.0130755406
## 7.09 8.644986e-04 3.063236 0.0130755406
## 0.66 9.664493e-04 3.014821 0.0137576897
## 2.02 1.550503e-03 2.809528 0.0208456451
## 1.18 1.811065e-03 2.742066 0.0215165417
## 5.88 1.860076e-03 2.730469 0.0215165417
##                                                  tukey
## 2.11                    Plain-Highlands; Plateau-Plain
## 5.17                    Plain-Highlands; Plateau-Plain
## 2.2                     Plain-Highlands; Plateau-Plain
## 5.42                    Plain-Highlands; Plateau-Plain
## 2.86                    Plain-Highlands; Plateau-Plain
## 2.98                    Plain-Highlands; Plateau-Plain
## 1.49                    Plain-Highlands; Plateau-Plain
## 2.5  Plain-Highlands; Plateau-Highlands; Plateau-Plain
## 1.73                    Plain-Highlands; Plateau-Plain
## 1.87                    Plain-Highlands; Plateau-Plain
## 6.13 Plain-Highlands; Plateau-Highlands; Plateau-Plain
## 6.17                    Plain-Highlands; Plateau-Plain
## 5.62                    Plain-Highlands; Plateau-Plain
## 0.56                    Plain-Highlands; Plateau-Plain
## 9.51                Plain-Highlands; Plateau-Highlands
## 7.09                    Plain-Highlands; Plateau-Plain
## 0.66                    Plain-Highlands; Plateau-Plain
## 2.02                    Plain-Highlands; Plateau-Plain
## 1.18                    Plain-Highlands; Plateau-Plain
## 5.88                    Plain-Highlands; Plateau-Plain

A heatmap with the correlations between all the variables is shown below:

correl.prop.nmr.log.scal = correlations.dataset(prop.nmr.log.scal, method = "pearson")
heatmap(correl.prop.nmr.log.scal, col =  topo.colors(256))

CLUSTERING

Hierarchical clustering with euclidean distance and complete method was performed on the data and the resulting dendrogram is shown below:

hc.prop.nmr.log.scal = clustering(prop.nmr.log.scal, method = "hc", distance = "euclidean")
dendrogram.plot.col(prop.nmr.log.scal, hc.prop.nmr.log.scal, "agroregions")

K-Means was performed on the data also with 4 centers and the results and the plot giving for each cluster the median of the samples in blue, and in grey the values of all samples in that cluster are shown below:

kmeans.prop.nmr.log.scal = clustering(prop.nmr.log.scal, method = "kmeans", num.clusters = 4)
kmeans.plot(prop.nmr.log.scal, kmeans.prop.nmr.log.scal)

kmeans.df = kmeans.result.df(kmeans.prop.nmr.log.scal, 4)
kmeans.df

##   cluster
## 1       1
## 2       2
## 3       3
## 4       4
##                                                                                                                     samples
## 1         AC_au DC_au JB_au SA_au SJ_au UR_au VR_au DC_sm SA_sm DC_sp FP_sp JB_sp SA_sp SJ_sp UR_sp VR_sp AN_wi SJ_wi UR_wi
## 2 BR_au CE_au CN_au IT_au SJC_au XX_au PU_sm XX_sm BR_sp CE_sp CN_sp IT_sp PU_sp SJC_sp BR_wi CE_wi CN_wi PU_wi SA_wi XX_wi
## 3                                AN_au PU_au AC_sm AN_sm BR_sm CE_sm CN_sm FP_sm IT_sm JB_sm SJC_sm SJ_sm UR_sm VR_sm AN_sp
## 4                                                                                             AC_sp AC_wi DC_wi FP_wi JB_wi

PCA

Principal components analysis was performed on the data and some plots are shown below:

pca.analysis.result = pca.analysis.dataset(prop.nmr.log.scal)

pca.pairs.plot(prop.nmr.log.scal, pca.analysis.result, "agroregions")

pca.screeplot(pca.analysis.result)

pca.scoresplot2D(prop.nmr.log.scal, pca.analysis.result, "agroregions", ellipses = T)

pca.kmeans.plot2D(prop.nmr.log.scal, pca.analysis.result, kmeans.result = kmeans.prop.nmr.log.scal, ellipses = T)

MACHINE LEARNING

For classification models and prediction the following parameters were used: - models: PLS, J48, JRip, SVM and Random Forests - validation method: repeated cross-validation - number of folds: 5 - number of repeats: 10

Below are some results with the best tune for each model:

ml.prop.nmr = train.models.performance(prop.nmr.log.scal, c("pls", "J48", "JRip", "svmLinear", "rf"), "agroregions", "repeatedcv", num.folds = 10, num.repeats = 10, tunelength = 20, metric = "ROC")
ml.prop.nmr$performance

##            Accuracy     Kappa Sensitivity Specificity       ROC AccuracySD
## pls       0.7720000 0.5297329   0.6777778   0.8369444 0.8256389  0.1487945
## J48       0.5981310 0.2369943   0.5080556   0.7438889 0.6286412  0.1692889
## JRip      0.5577143 0.1354773   0.4463889   0.7038889 0.6084329  0.1683469
## svmLinear 0.6443333 0.2086367   0.4644444   0.7301667 0.8070556  0.1529102
## rf        0.7461786 0.4294667   0.6163889   0.8002222 0.8045417  0.1479118
##             KappaSD SensitivitySD SpecificitySD     ROCSD
## pls       0.3235259     0.2208501     0.1097571 0.1863722
## J48       0.3167571     0.2204013     0.1072811 0.1911771
## JRip      0.3016261     0.1905885     0.1030871 0.1479537
## svmLinear 0.3234703     0.2036302     0.1047841 0.1582748
## rf        0.3686263     0.2410544     0.1215852 0.1700060

The full result of the performance of pls:

ml.prop.nmr$full.results$pls[,c("ncomp","Accuracy","Kappa","Sensitivity","Specificity","ROC","AccuracySD","KappaSD","SensitivitySD","SpecificitySD","ROCSD")]

##    ncomp  Accuracy     Kappa Sensitivity Specificity       ROC AccuracySD
## 1      1 0.6524286 0.1758946   0.4461111   0.7152222 0.7358148  0.1177438
## 2      2 0.6980714 0.3830767   0.5950000   0.7893889 0.7503519  0.1768781
## 3      3 0.7558571 0.4833418   0.6541667   0.8178889 0.8200463  0.1594352
## 4      4 0.7767024 0.5296972   0.6811111   0.8343333 0.8117824  0.1593143
## 5      5 0.7720000 0.5297329   0.6777778   0.8369444 0.8256389  0.1487945
## 6      6 0.7611429 0.5143948   0.6677778   0.8363889 0.8107870  0.1612733
## 7      7 0.7207619 0.4251518   0.6086111   0.8049444 0.8081759  0.1593284
## 8      8 0.6907143 0.3889779   0.5936111   0.7928333 0.7847778  0.1683903
## 9      9 0.6655119 0.3495750   0.5769444   0.7798889 0.7847454  0.1818755
## 10    10 0.6563571 0.3348202   0.5655556   0.7748889 0.7789444  0.1838657
## 11    11 0.6552619 0.3271642   0.5627778   0.7715556 0.7853009  0.1769049
## 12    12 0.6481667 0.3164771   0.5627778   0.7666667 0.7849167  0.1733165
## 13    13 0.6507381 0.3221489   0.5613889   0.7685000 0.7844259  0.1754903
## 14    14 0.6521667 0.3286804   0.5675000   0.7697778 0.7834815  0.1808382
## 15    15 0.6513571 0.3272461   0.5658333   0.7695556 0.7859630  0.1830154
## 16    16 0.6551190 0.3348398   0.5738889   0.7714444 0.7847407  0.1797120
## 17    17 0.6444524 0.3186429   0.5652778   0.7665000 0.7826389  0.1817485
## 18    18 0.6464524 0.3265364   0.5711111   0.7686111 0.7815278  0.1808086
## 19    19 0.6408810 0.3155650   0.5661111   0.7656667 0.7795556  0.1794319
## 20    20 0.6375476 0.3086302   0.5594444   0.7641667 0.7780556  0.1804774
##      KappaSD SensitivitySD SpecificitySD     ROCSD
## 1  0.2783756     0.1619141    0.08055263 0.1037939
## 2  0.3616482     0.2360386    0.12132642 0.1893380
## 3  0.3509152     0.2359460    0.11787383 0.1699823
## 4  0.3489976     0.2385774    0.11769596 0.1980044
## 5  0.3235259     0.2208501    0.10975711 0.1863722
## 6  0.3386702     0.2381668    0.11386699 0.1772910
## 7  0.3463269     0.2338463    0.11525087 0.1675483
## 8  0.3385686     0.2293217    0.11489162 0.1697134
## 9  0.3536320     0.2411242    0.11992739 0.1631684
## 10 0.3511904     0.2377606    0.11783388 0.1620538
## 11 0.3476358     0.2336971    0.11675423 0.1549021
## 12 0.3372807     0.2306760    0.11291466 0.1594823
## 13 0.3350715     0.2278775    0.11134111 0.1530157
## 14 0.3405732     0.2298174    0.11381207 0.1483722
## 15 0.3411991     0.2346303    0.11419535 0.1517615
## 16 0.3368699     0.2302905    0.11337761 0.1518426
## 17 0.3365839     0.2298893    0.11337881 0.1539674
## 18 0.3339047     0.2272619    0.11263925 0.1559761
## 19 0.3336176     0.2284747    0.11314739 0.1580363
## 20 0.3344078     0.2311284    0.11344182 0.1571108

Also the confusion matrices and a plot using the first 3 PCs, showing the separation of the classes (agroregions) are shown below:

ml.prop.nmr$confusion.matrices

## $pls
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##            Reference
## Prediction  Highlands Plain Plateau
##   Highlands       9.9   1.4     2.6
##   Plain           0.0  11.8     3.1
##   Plateau        10.3   5.5    55.5
## 
## 
## $J48
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##            Reference
## Prediction  Highlands Plain Plateau
##   Highlands       9.6   2.6     9.7
##   Plain           1.0   5.8     7.0
##   Plateau         9.7  10.1    44.4
## 
## 
## $JRip
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##            Reference
## Prediction  Highlands Plain Plateau
##   Highlands       4.1   0.5     5.8
##   Plain           1.2   7.7    11.3
##   Plateau        14.9  10.5    44.0
## 
## 
## $svmLinear
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##            Reference
## Prediction  Highlands Plain Plateau
##   Highlands       3.8   0.0     3.7
##   Plain           1.6   6.0     2.9
##   Plateau        14.7  12.7    54.6
## 
## 
## $rf
## Cross-Validated (10 fold, repeated 10 times) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##            Reference
## Prediction  Highlands Plain Plateau
##   Highlands      10.2   0.0     1.4
##   Plain           0.0   8.0     3.4
##   Plateau        10.0  10.6    56.5

pls.model = ml.prop.nmr$final.models$pls
pca.plot.3d(prop.nmr.log.scal, pls.model, "agroregions")

And the variable importance in the four season classes for all models:

summary.var.importance(ml.prop.nmr, 10)

## $pls
##      Highlands    Plain   Plateau     Mean
## 7.5   75.48420 51.23543  97.01789 74.57917
## 5.31  74.40055 26.65833 100.00000 67.01962
## 1.15  64.29241 49.76154  78.72362 64.25919
## 3.01  69.12025 38.40697  85.08916 64.20546
## 9.51  74.19247 21.87819  95.78411 63.95159
## 1.73  54.09655 76.29485  60.50966 63.63369
## 2.65  80.13652 23.21344  87.22189 63.52395
## 2.86  54.33817 54.78950  75.70554 61.61107
## 0.48  65.05210 36.16675  78.21459 59.81115
## 3.19  75.91522 22.95475  77.32528 58.73175
## 
## $J48
##      Highlands     Plain  Plateau     Mean
## 2.11 100.00000 100.00000 88.92950 96.30983
## 5.17  96.24021  96.24021 85.16971 92.55004
## 1.87 100.00000 100.00000 77.44125 92.48042
## 1.9   97.49347  97.49347 72.56745 89.18480
## 5.28  94.98695  94.98695 77.44125 89.13838
## 5.42  96.24021  96.24021 74.72585 89.06876
## 5.62  96.24021  96.24021 73.05483 88.51175
## 2.2   93.73368  93.73368 78.06789 88.51175
## 2.86  75.56136  91.64491 91.64491 86.28372
## 0.66  86.21410  86.21410 86.21410 86.21410
## 
## $JRip
##      Overall Mean
## 2.11     100  100
## 3.22     100  100
## 3.9      100  100
## 3.93     100  100
## 0.27       0    0
## 0.31       0    0
## 0.34       0    0
## 0.41       0    0
## 0.44       0    0
## 0.48       0    0
## 
## $svmLinear
##      Highlands     Plain  Plateau     Mean
## 2.11 100.00000 100.00000 88.92950 96.30983
## 5.17  96.24021  96.24021 85.16971 92.55004
## 1.87 100.00000 100.00000 77.44125 92.48042
## 1.9   97.49347  97.49347 72.56745 89.18480
## 5.28  94.98695  94.98695 77.44125 89.13838
## 5.42  96.24021  96.24021 74.72585 89.06876
## 5.62  96.24021  96.24021 73.05483 88.51175
## 2.2   93.73368  93.73368 78.06789 88.51175
## 2.86  75.56136  91.64491 91.64491 86.28372
## 0.66  86.21410  86.21410 86.21410 86.21410
## 
## $rf
##        Overall      Mean
## 2.86 100.00000 100.00000
## 2.11  88.35864  88.35864
## 5.17  75.94536  75.94536
## 5.31  67.34051  67.34051
## 6.13  65.52832  65.52832
## 6.17  63.51306  63.51306
## 2.62  55.25098  55.25098
## 5.21  55.20977  55.20977
## 0.69  53.09080  53.09080
## 6.75  52.87027  52.87027

FEATURE SELECTION

Using recursive feature selection, various subsets were used with random forests classifier. The results are shown below:

feature.selection.result = feature.selection(prop.nmr.log.scal, "agroregions", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          2   0.5561 0.1481     0.2051  0.3479         
##          4   0.5732 0.1667     0.1683  0.3144         
##          8   0.6635 0.3122     0.1681  0.3505         
##         16   0.6877 0.3594     0.1678  0.3524         
##         32   0.7243 0.4178     0.1589  0.3619         
##         64   0.7385 0.4395     0.1638  0.3729        *
##        242   0.7310 0.4259     0.1649  0.3703         
## 
## The top 5 variables (out of 64):
##    X2.11, X6.17, X2.65, X5.31, X6.13

plot(feature.selection.result, type=c("g","o"))

Also selection by filter was used with the results shown below:

feature.selection.result2 = feature.selection(prop.nmr.log.scal, "agroregions", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2

## 
## Selection By Filter
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance:
## 
##  Accuracy  Kappa AccuracySD KappaSD
##    0.7361 0.4187     0.1578  0.3773
## 
## Using the training set, 70 variables were selected:
##    X0.5, X0.56, X0.61, X0.66, X0.99...
## 
## During resampling, the top 5 selected variables (out of a possible 127):
##    X0.5 (100%), X0.56 (100%), X0.61 (100%), X0.66 (100%), X1.18 (100%)
## 
## On average, 65.6 variables were selected (min = 46, max = 90)