Propolis UV pipeline 1

setwd("~/Dropbox")
library(metabolomicsUM)
source("Datasets/Propolis/UVV/scripts/propolis_uv_metadata.R")

# reading data and metadata
uv.prop.metadata.file = "Datasets/Propolis/UVV/metadata/prop_uv_metadata.csv"
uv.prop.data.file = "Datasets/Propolis/UVV/data/uvv.csv"

get.metadata.uv(uv.prop.data.file, write.file = TRUE, file.name= uv.prop.metadata.file)

# need to check labels of x and values
label.x = "wavelength(nm)"
label.val = "absorbance"
uv.prop.ds = read.dataset.csv(uv.prop.data.file, uv.prop.metadata.file, description = "UV data for propolis", type = "uvv-spectra", label.x = label.x, label.values = label.val)
uv.prop.ds$metadata$years = factor(uv.prop.ds$metadata$years)

sum.dataset(uv.prop.ds)

## Dataset summary:
## Valid dataset
## Description:  UV data for propolis 
## Type of data:  uvv-spectra 
## Number of samples:  136 
## Number of data points 501 
## Number of metadata variables:  3 
## Label of x-axis values:  wavelength(nm) 
## Label of data points:  absorbance 
## Number of missing values in data:  0 
## Mean of data values:  71.93233 
## Median of data values:  0.3097 
## Standard deviation:  18529.24 
## Range of values:  1e-04 4836665 
## Quantiles: 
##           0%          25%          50%          75%         100% 
## 1.000000e-04 1.035000e-01 3.097000e-01 1.009725e+00 4.836665e+06

Regions metadata used, no normalization or correction used.

PREPROCESSING

By plotting the entire spectrum we can see an outlier around the 500 and 600 wavelengths. The value will be replaced by linear interpolation.

no.regions = which(is.na(uv.prop.ds$metadata$regions))
uv.prop.ds = remove.samples(uv.prop.ds, no.regions)
plot.spectra(uv.prop.ds,"regions")

uv.prop.ds = replace.data.value(uv.prop.ds, 529, "ACpri11", NA)
uv.prop.ds = missingvalues.imputation(uv.prop.ds, method = "linapprox")
plot.spectra(uv.prop.ds,"regions")

sum.dataset(uv.prop.ds)

## Dataset summary:
## Valid dataset
## Description:  ; Missing value imputation with method linapprox 
## Type of data:  uvv-spectra 
## Number of samples:  128 
## Number of data points 501 
## Number of metadata variables:  3 
## Label of x-axis values:  wavelength(nm) 
## Label of data points:  absorbance 
## Number of missing values in data:  0 
## Mean of data values:  0.9385933 
## Median of data values:  0.3095 
## Standard deviation:  1.290586 
## Range of values:  1e-04 4.4118 
## Quantiles: 
##       0%      25%      50%      75%     100% 
## 0.000100 0.103600 0.309500 0.979825 4.411800

UNIVARIATE TESTS

An analysis of variance (ANOVA) was conducted over the data with tukey test also, and this is the top 10 results ordered by p-value:

uv.prop.anova = aov.all.vars(uv.prop.ds, "regions")
uv.prop.anova[1:10,]

##          pvalues     logs          fdr
## 404 1.296138e-10 9.887349 2.213986e-08
## 406 1.754917e-10 9.755743 2.213986e-08
## 408 2.299238e-10 9.638416 2.213986e-08
## 403 2.385925e-10 9.622343 2.213986e-08
## 405 2.490582e-10 9.603699 2.213986e-08
## 407 2.829145e-10 9.548345 2.213986e-08
## 401 3.347938e-10 9.475223 2.213986e-08
## 402 3.535306e-10 9.451573 2.213986e-08
## 409 4.092460e-10 9.388016 2.278136e-08
## 399 6.228816e-10 9.205595 2.943330e-08
##                                                 tukey
## 404 Planicie-Planalto; Serra-Planalto; Serra-Planicie
## 406 Planicie-Planalto; Serra-Planalto; Serra-Planicie
## 408 Planicie-Planalto; Serra-Planalto; Serra-Planicie
## 403 Planicie-Planalto; Serra-Planalto; Serra-Planicie
## 405 Planicie-Planalto; Serra-Planalto; Serra-Planicie
## 407 Planicie-Planalto; Serra-Planalto; Serra-Planicie
## 401 Planicie-Planalto; Serra-Planalto; Serra-Planicie
## 402 Planicie-Planalto; Serra-Planalto; Serra-Planicie
## 409 Planicie-Planalto; Serra-Planalto; Serra-Planicie
## 399 Planicie-Planalto; Serra-Planalto; Serra-Planicie

A heatmap with the correlations between all the variables is shown below:

uv.prop.correl = correlations.dataset(uv.prop.ds, method = "pearson")
heatmap(uv.prop.correl, col =  topo.colors(256))

CLUSTERING

Hierarchical clustering with euclidean distance and complete method was performed on the data and the resulting dendrogram is shown below:

uv.prop.hc = clustering(uv.prop.ds, method = "hc", distance = "euclidean")
dendrogram.plot.col(uv.prop.ds, uv.prop.hc, "regions")

K-Means was performed on the data also with 4 centers and the results and the plot giving for each cluster the median of the samples in blue, and in grey the values of all samples in that cluster are shown below:

uv.prop.kmeans = clustering(uv.prop.ds, method = "kmeans", num.clusters = 3)
kmeans.plot(uv.prop.ds, uv.prop.kmeans)

kmeans.result.df(uv.prop.kmeans, 3)

##   cluster
## 1       1
## 2       2
## 3       3
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                samples
## 1                                                                              ANout10 BGout10 BR1out10 JBout10 SJ1out10 SJ2out10 URout10 ACinv10 ANinv10 BGinv10 BR2inv10 FPinv10 PUinv10 URinv10 ANpri10 BGpri10 FPpri10 SJ1pri10 SJ2pri10 URpri10 BGver10 BR1ver10 BR2ver10 JBver10 BGout11 CEout11 FPout11 SJ2out11 BR2inv11 PUinv11 BGpri11 BR2pri11 BR1pri11 ITpri11 JBpri11 SJ1pri11 SJCpri11 URpri11 ANver11 SJ2inv11 BGver11 FPver11 SJ1ver11 URver11 VRver11
## 2                                                                                                                                                                                                                                ACout10 BR2out10 CEout10 DCinv10 ACpri10 CEpri10 CNpri10 PUpri10 SJCpri10 ANver10 CEver10 CNver10 ITver10 PUver10 SJCver10 ITout11 BR1inv11 CEinv11 CNinv11 ACpri11 CEpri11 VRpri11 SJCinv11 BR1ver11 CEver11 CNver11 ITver11 PUver11
## 3 CNout10 CNOout10 DCout10 FPout10 ITout10 PUout10 SAout10 SJCout10 VRout10 CEinv10 CNinv10 CNOinv10 ITinv10 JBinv10 SAinv10 SJCinv10 BR1pri10 BR2pri10 DCpri10 ITpri10 JBpri10 SApri10 VRpri10 ACver10 DCver10 FPver10 SAver10 SJ2ver10 URver10 VRver10 ACout11 ANout11 CNout11 CNOout11 DCout11 PUout11 SAout11 SJCout11 VRout11 ACinv11 ANinv11 BGinv11 CNOinv11 DCinv11 FPinv11 SAinv11 CNpri11 DCpri11 FPpri11 PUpri11 BR2ver11 JBver11 SAver11 SJ2ver11 SJCver11

PCA

Principal components analysis was performed on the data and some plots are shown below:

uv.prop.pca = pca.analysis.dataset(uv.prop.ds, scale = T, center = T)

pca.pairs.plot(uv.prop.ds, uv.prop.pca, "regions")

pca.screeplot(uv.prop.pca)

pca.scoresplot2D(uv.prop.ds, uv.prop.pca, "regions", ellipses = T)

pca.kmeans.plot2D(uv.prop.ds, uv.prop.pca,  ellipses = T, num.clusters = 4, labels = T)

MACHINE LEARNING

For classification models and prediction the following parameters were used: - models: PLS, J48, JRip, SVM and Random Forests - validation method: repeated cross-validation - number of folds: 10 - number of repeats: 10

Below are some results with the best tune for each model:

uv.prop.ml = train.models.performance(uv.prop.ds, c("pls","J48","JRip","svmLinear", "rf"), "regions", "repeatedcv", metric = "ROC")
uv.prop.ml$performance

##            Accuracy     Kappa Sensitivity Specificity       ROC AccuracySD
## pls       0.7395513 0.5408433   0.6757143   0.8420370 0.8286658  0.1051958
## J48       0.5877739 0.3059385   0.5254762   0.7719074 0.6386539  0.1539209
## JRip      0.6280128 0.3256663   0.5284921   0.7728519 0.6519123  0.1179619
## svmLinear 0.6409615 0.3147714   0.5257143   0.7629630 0.8042566  0.1117128
## rf        0.6809382 0.4407172   0.6111905   0.8106296 0.8081781  0.1290248
##             KappaSD SensitivitySD SpecificitySD      ROCSD
## pls       0.1899917     0.1277435    0.06486824 0.09528838
## J48       0.2489608     0.1677807    0.08482742 0.14429359
## JRip      0.2181684     0.1417128    0.07234238 0.11536971
## svmLinear 0.2308857     0.1431069    0.07485559 0.09401367
## rf        0.2260363     0.1520793    0.07427144 0.11260362

And the variable importance in the four season classes for all models:

summary.var.importance(uv.prop.ml, 10)

## $pls
##     Planalto  Planicie    Serra     Mean
## 371 41.74751 100.00000 21.21009 54.31920
## 372 44.86123  97.58377 16.69064 53.04522
## 356 41.11914  82.93864 18.35101 47.46960
## 362 41.96546  80.19991 16.96353 46.37630
## 364 42.74618  77.87263 17.05271 45.89050
## 352 35.24938  78.24318 23.08332 45.52529
## 366 31.11150  73.76028 18.52420 41.13199
## 369 31.80509  71.81182 16.54724 40.05472
## 394 35.96522  64.93490 15.72653 38.87555
## 343 28.64213  72.54719 13.13527 38.10820
## 
## $J48
##      Planalto Planicie     Serra     Mean
## 406  98.20972 75.36232  98.20972 90.59392
## 404  98.20972 74.33930  98.20972 90.25291
## 397 100.00000 69.90622 100.00000 89.96874
## 398  99.57374 70.75874  99.57374 89.96874
## 408  96.84569 76.04433  96.84569 89.91191
## 399  98.72123 71.95226  98.72123 89.79824
## 405  97.10145 74.59506  97.10145 89.59932
## 407  96.50469 74.85081  96.50469 89.28673
## 403  97.18670 73.14578  97.18670 89.17306
## 402  96.84569 72.97528  96.84569 88.88889
## 
## $JRip
##     Overall Mean
## 404     100  100
## 452     100  100
## 200       0    0
## 201       0    0
## 202       0    0
## 203       0    0
## 204       0    0
## 205       0    0
## 206       0    0
## 207       0    0
## 
## $svmLinear
##      Planalto Planicie     Serra     Mean
## 406  98.20972 75.36232  98.20972 90.59392
## 404  98.20972 74.33930  98.20972 90.25291
## 397 100.00000 69.90622 100.00000 89.96874
## 398  99.57374 70.75874  99.57374 89.96874
## 408  96.84569 76.04433  96.84569 89.91191
## 399  98.72123 71.95226  98.72123 89.79824
## 405  97.10145 74.59506  97.10145 89.59932
## 407  96.50469 74.85081  96.50469 89.28673
## 403  97.18670 73.14578  97.18670 89.17306
## 402  96.84569 72.97528  96.84569 88.88889
## 
## $rf
##       Overall      Mean
## 399 100.00000 100.00000
## 407  99.16986  99.16986
## 404  96.35450  96.35450
## 412  92.71855  92.71855
## 417  91.56584  91.56584
## 400  90.02786  90.02786
## 405  86.10115  86.10115
## 414  84.09790  84.09790
## 397  78.54116  78.54116
## 408  78.00702  78.00702

FEATURE SELECTION

Using recursive feature selection, various subsets were used with random forests classifier. The results are shown below:

feature.selection.result = feature.selection(uv.prop.ds, "regions", method="rfe", functions = rfFuncs, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          2   0.5699 0.2643     0.1494  0.2549         
##          4   0.6236 0.3632     0.1360  0.2289         
##          8   0.6656 0.4341     0.1324  0.2174         
##         16   0.6909 0.4741     0.1483  0.2473         
##         32   0.7062 0.4978     0.1253  0.2080         
##         64   0.7233 0.5287     0.1206  0.2077        *
##        501   0.6874 0.4602     0.1169  0.1992         
## 
## The top 5 variables (out of 64):
##    X404, X406, X402, X407, X403

plot(feature.selection.result, type=c("g","o"))

Also selection by filter was used with the results shown below:

feature.selection.result2 = feature.selection(uv.prop.ds, "regions", method="filter", functions = rfSBF, validation = "repeatedcv", repeats = 5, subsets = 2^(1:6))
feature.selection.result2

## 
## Selection By Filter
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance:
## 
##  Accuracy  Kappa AccuracySD KappaSD
##    0.6842 0.4588     0.1251  0.2207
## 
## Using the training set, 259 variables were selected:
##    X200, X201, X202, X203, X204...
## 
## During resampling, the top 5 selected variables (out of a possible 419):
##    X200 (100%), X201 (100%), X202 (100%), X203 (100%), X205 (100%)
## 
## On average, 264.1 variables were selected (min = 230, max = 336)