Monday, September 15, 2014

PCA / EOF for data with missing values - a comparison of accuracy

Not all Principal Component Analysis (PCA) (also called Empirical Orthogonal Function analysis, EOF) approaches are equal when it comes to dealing with a data field that contain missing values (i.e. "gappy"). The following post compares several methods by assessing the accuracy of the derived PCs to reconstruct the "true" data set, as was similarly conducted by Taylor et al. (2013).

The gappy EOF methods to be compared are:
  1. LSEOF - "Least-Squares Empirical Orthogonal Functions" - The traditional approach, which modifies the covariance matrix used for the EOF decomposition by the number of paired observations, and further scales the projected PCs by these same weightings (see Björnsson and Venegas 1997, von Storch and Zweiers 1999 for details).
  2. RSEOF - "Recursively Subtracted Empirical Orthogonal Functions" - This approach modifies the LSEOF approach by recursively solving for the leading EOF, whose reconstructed field is then subtracted from the original field. This recursive subtraction is done until a given stopping point (i.e. number of EOFs, % remaining variance, etc.) (see Taylor et al. 2013 for details)
  3. DINEOF - "Data Interpolating Empirical Orthogonal Functions" - This approach gradually solves for EOFs by means of an iterative algorothm to fit EOFS to a given number of non-missing value reference points (small percentage of observations) via RMSE minimization (see Beckers and Rixen 2003 for details).
I have introduced both the LSEOF [link] and DINEOF [link] methods in the past, but have never directly compared them for the blog. The purpose of this post is to make this comparison and to also introduce a more general EOF function that is capable of conducting RSEOF. All analyses can be reproduced following installation of the "sinkr" package: https://github.com/menugget/sinkr

The basic problem comes down to the difficulties of decomposing a matrix that is not "positive-definite", i.e. the estimated covariance matrix from a gappy data set. DINEOF entirely avoids this issue by first interpolating the values to create a full data field, while LSEOF and RSEOF rely on decomposing this estimation. A known problem is that the trailing EOFs derived from such a matrix are amplified in their singular values, which can consequently amplify errors in field reconstructions when included. The RSEOF approach thus attempts to remedy these issues by recursively solving for only leading EOFs. In the following examples, I show the performance of the three approaches in terms of reconstructing the data field (including the "true" values).

Example 1 - Synthetic data set:
The first example uses a synthetic data set used by Beckers and Rixen (2003) in their introduction of the DINEOF approach. The accuracy of the reconstruction is dependent on the number of  EOFs used. In a non-gappy example, a perfect reconstruction should be possible using this full set of EOFs - In fact it only takes 9 EOFs when using the non-noisy true field, since it is a composite of 9 signals. In the case of the noisy, gappy data sets, reconstructions with trailing EOFs may increase errors. This can be seen in the figure at the top of the post showing RMSE vs the number of EOFs used in the reconstruction.

The figure shows the DINEOF approach to be the most accurate. The LSEOF approach has a clear RSME minimum with 4 EOFs, while the RSEOF approach was largely able to remedy the amplification of error when using trailing EOFs. The problem of error amplification is even more dramatic when viewed visually, as in the following where the full set of EOFs have been used:

Tuesday, September 2, 2014

"sinkr" - a collection of functions featured on "me nugget"

The R package sinkr (version 1.0) has now been released:  https://github.com/menugget/sinkr

I have finally gotten around to learning how to create an R package and decided to start by bundling functions that I have featured on the blog. Thanks to the R Studio team for making this so easy (in combination with the R packages roxygen2 and devtools). In addition to the great tips on the R Studio website,  I found the following Youtube videos helpful along the way:
Being new to the world of R packages (and due to the eclectic nature of the functions in sinkr), I'm not confident to upload this to CRAN. But, one can easily install the package using devtools and the following code (see https://github.com/hadley/devtools for OS-specific package updating tips):

library(devtools)

install_github("sinkr", "menugget")

For those using functions that were featured on this site in the past, the present versions may differ slightly, especially in the case of function names. For example, function names like plot.stacked were getting associated with the general plot function during the package build and thus "." characters have been removed from all function names (e.g. plotStacked).

sinkr functions include:

Tuesday, August 5, 2014

Rotated axis labels in R plots


It's somehow amazing to me that the option for slanted or rotated axes labels is not an option within the basic plot() or axis() functions in R.  The advantage is mainly in saving plot area space when long labels are needed (rather than as a means of preventing excessive head tilting). The topic is briefly covered in this FAQ, and the solution is a bit tricky, especially for a new R user. Below is an example of this procedure.

To reproduce example:

Wednesday, July 23, 2014

Flood fill a region of an active device in R


The following is a function to "flood fill" a region on the active plotting device. Once called, the user will be asked to click on the desired target region. The flood fill algorithm then searches neighbors in 4 directions of the target cell (down, left, up, right) and checks for similar colors to the target cell. If neighboring cells are of the same color, their color is changed to a defined replacement color, and the cell number is added to a "queue" for further searches of neighbors. Once a cell has been checked, its position is added to a list of completed cells. This algorithm is referred to as "Four-way flood fill using a queue for storage".

Here's a visualization of the Four-way flood fill from Wikimedia Commons:

http://commons.wikimedia.org/wiki/File:Wfm_floodfill_animation_stack.gif
This is kind of a pointless exercise given that any basic image editing programs (e.g. Microsoft Paint) can do this much more efficiently; Nevertheless, I felt compelled to figure out a way of programming this in R (I was originally interested in filling in land areas on a map that I created in R). You'll see from my example above that I didn't quite get it right - there is still some blank white space within the regions that I filled. Part of this problem is remedied by exporting a higher resolution image (floodfill argument "res"), but this slows things down considerably.

In order to have this function work directly on an open graphics device, I exported a PNG image and then re-imported it and trimmed off the margins. What remains is an image of the plot region itself  which I convert to a matrix and look-up dataframe, where each cell's color and neighboring cells are defined. It is this dataframe that forms the basis of my searching algorithm. I'm guessing I have made some sort of small mistake in how I trimmed the margins of the image, thus creating the slight offset in the filled region. Anyway, feel free to suggest improvements!

Function:

Monday, May 19, 2014

Automated determination of distribution groupings - A StackOverflow collaboration


For those of you not familiar with StackOverflow (SO), it's a coder's help forum on the StackExchange website. It's one of the best resources for R-coding tips that I know of, due entirely to the community of users that routinely give expert advise (assuming you show that you have done your homework and provide a clear question and a reproducible example). It's hard to believe that users spend time to offer this help for nothing more than virtual reputation points. I think a lot of coders are probably puzzle fanatics at heart, and enjoy the challenge of a given problem, but I'm nevertheless amazed by the depth of some of the R-related answers. The following is a short example of the value of this community (via SO), which helped me find a solution to a tricky problem.

I have used figures like the one above (left) in my work at various times. It present various distributions in the form of a boxplot, and uses differing labels (in this case, the lowercase letters) to denote significant differences; i.e. levels sharing a label are not significantly different. This type of presentation is common when showing changes in organism condition indices over time (e.g. Figs 3 & 4, Bullseye puffer fish in Mexico).

In the example above, a Kruskal-Wallis rank sum test is used to test differences across all levels, followed by pairwise Mann-Whitney rank tests. The result is a matrix of p-values showing significant differences in distribution. So far so good, but it's not always clear how the grouping relationships should be labelled. In this relatively simple example, the tricky part is that level 1 should be grouped with 3 and 5, but 3 and 5 should not be grouped; Therefore, two labeling codes should be designated, with level 1 sharing both. I have wondered, for some time, if there might be some way to do this in an automated fashion using an algorithm. After many attempts on my own, I finally decided to post a question to SO.

So, my first question "Algorithm for automating pairwise significance grouping labels in R" led me to the concept of the "clique cover problem", and "graph theory" in general, via SO user "David Eisenstat". While I didn't completely understand his recommendation at first, it got me pointed in the right direction - I ultimately found the R package igraph for analyzing and plotting these types of problems.

The next questions were a bit more technical. I figured out that I could return the "cliques" of my grouping relationships network using the cliques function of the igraph package, but my original attempt was giving me a list all relationships in my matrix. It was obvious to me that I would need to identify groupings where all levels were fully connected (i.e. each node in the clique connects to all others). So, my next question "How to identify fully connected node clusters with igraph [in R]" got me a tip from SO user "majom", who showed me that these fully connected cliques could be identified by first reordering the starting nodes in my list of connections (before use in the graph.data.frame function), and then subjecting the resulting igraph object to the function maximal.cliques. So, the first suggestions from David were right on, even though they didn't include code. The result shows nicely all those groupings in the above example (right plot) with fully connected cliques [i.e. (1, 3), (1, 5), (2), (4, 6), (7)].

The final piece of the puzzle was more cosmetic - "How to order a list of vectors based on the order of values contained within [in R]". A bit vague, I know, but what I was trying to do was to label groups in a progressive way so that earlier levels received their labels first. I think this leads to more legible labeling, especially when levels represent some process of progression. At the time of this posting, I have received a single negative (-1) vote on this question... This may have to do with the clarity of the question - I seem to have confused some of the respondents based on follow up comments for clarification - or, maybe someone thought I hadn't shown enough effort on my own. There's no way to know without an accompanying comment. In any case, I got a robust approach from SO user "MrFlick", and I can safely say that I would never have come up with such an eloquent solution on my own.

In all, this solution seems to work great. I have tried it out on larger problems involving more levels and it appears to give correct results. Here is an example with 20 levels (a problem that would have been an amazing headache to do manually):
Any comments are welcome. There might be other ways of doing this (clustering?), but searching for similar methods seems to be limited by my ability to articulate the problem. Who would have thought this was an example of a "clique cover problem"? Thanks again to all those that provided help on SO!

Code to reproduce the example:

Saturday, May 3, 2014

Evaluating model performance - A practical example of the effects of overfitting and data size on prediction


Following my last post on decision making trees and machine learning, where I presented some tips gathered from the "Pragmatic Programming Techniques" blog, I have again been impressed by its clear presentation of strategies regarding the evaluation of model performance. I have seen some of these topics presented elsewhere - especially graphics showing the link between model complexity and prediction error (i.e. "overfitting") - but this particular presentation made me want to go back to this topic and try to make a practical example in R that I could use when teaching.

Effect of overfitting on prediction
The above graph shows polynomial fitting of various degrees to an artificial data set - The "real" underlying model is a 3rd-degree polynomial (y ~ b3*x^3 + b2*x^2 + b1*x + a). One gets a good idea that the higher degree models are incorrect give the single-term removal significance tests provided by the summary function (e.g. 5th-degree polynomial model):

Call:
lm(formula = ye ~ poly(x, degree = 5), data = df)
Residuals:
Min 1Q Median 3Q Max
-4.4916 -2.0382 -0.4417 2.2340 8.1518
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.3696 0.4304 68.242 < 2e-16 ***
poly(x, degree = 5)1 74.4980 3.0432 24.480 < 2e-16 ***
poly(x, degree = 5)2 54.0712 3.0432 17.768 < 2e-16 ***
poly(x, degree = 5)3 23.5394 3.0432 7.735 9.72e-10 ***
poly(x, degree = 5)4 -3.0043 3.0432 -0.987 0.329
poly(x, degree = 5)5 1.1392 3.0432 0.374 0.710
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.043 on 44 degrees of freedom
Multiple R-squared: 0.9569, Adjusted R-squared: 0.952
F-statistic: 195.2 on 5 and 44 DF, p-value: < 2.2e-16


Nevertheless, a more robust analysis of prediction error is through a cross-validation - by splitting the data into training and validation sub-sets. The following example does this split at 50% training and 50% validation, with 500 permutations.


So, here we have the typical trend of increasing prediction error with model complexity (via cross-validation - CV) when the model is overfit (i.e. > 3rd-degree polynomial, vertical grey dashed line). As reference, the horizontal grey dashed line shows the original amount of error added, which is where the CV error reaches a minimum.

Effect of data size on prediction
Another interesting aspect presented in the post is the use of CV in estimating the relationship between prediction error and the amount of data used in the model fitting (credit given to Andrew Ng from Stanford). This is helpful concept when determining what the benefit in prediction would be following an invest in more data sampling:


Here we see that, given a fixed model complexity, training error and CV error converges. Again, the horizontal grey dashed line indicates the actual measurement error of the response variable. So, in this example, there is not much improvement in prediction error following a data size of ca. 100. Interestingly, the example also demonstrates that even with an overfit model containing a 7th-degree polynomial, the increased prediction error is overcome with a larger data set. For comparison, the same exercise done with the correct 3rd-degree model shows that even the smaller data set achieves a relatively low prediction error even when the data size is small (2.6 MAE in 3rd-degree poly. vs 3.7 MAE in 7th-degree poly.):


Code to reproduce example:

Wednesday, April 30, 2014

Decision making trees and machine learning resources for R

I have recently come across Ricky Ho's blog "Pragmatic Programming Techniques", which seems to be excellent resource for all sorts of aspects regarding data exploration and predictive modelling. The post "Six steps in data science" provides a nice overview to some of the topics covered in the blog. For some reason, this blog does not seem to be listed on R-Bloggers (attn: Tal Galili!).

I was drawn to the page from my interest in understanding Classification and Regression Tree (CART) models, and quickly became amazed by the blog's nice review of some available methods. I was specifically looking for a example that uses Edgar Anderson's iris data set as I find it to be a very understandable example for this type of problem - i.e. Can we develop a model to predict the iris species based on it's morphological characteristics?

Below is a sample of just two methods that were presented using the rpart and randomForest packages. rpart is a referred to as a "Decision Tree" method, while randomForest is an example of a "Tree Ensemble" method. The blog explains many of the pros and cons for each method, and a further post show even further examples of predictive analytics, including Neural Network, Support Vector Machine, Naive Bayes and Nearest Neighbor approaches (all using the iris data set). I would love to know a bit more about the comparative predictive powers of each of these methods. For the meantime, the example below shows a cross-validation comparison of prediction accuracy for the rpart and randomForest methods using 100 permutations. Half the data set is used as the training set and the other half is used as the validation set.



The results show a slight improvement in accuracy for the randomForest method, especially for the species versicolor and virginica - which are more similar in morphology. This can be see in the degree of overlap in the plot of the first 2 principle components (explaining ~98% of the variance):




The setosa species is different enough that there is perfect (100%) accuracy in it's prediction. I'm looking forward to continuing with this comparison for the other methods as well.

Example code:

Sunday, March 23, 2014

The power of PCA strikes again!

Amazing study using genetic markers to predict principle components of facial features:
New Scientist article -  Genetic mugshot recreates faces from nothing but DNA - life - 20 March 2014 - New Scientist
Original article - (PLoS Genetics, DOI: 10.1371/journal.pgen.1004224)

Saturday, January 25, 2014

Importing bathymetry and coastline data in R


After noticing some frustrating inaccuracies with the high-resolution world coastlines and national boundaries database found in worldHires from the package mapdata (based on CIA World Data Bank II data), I decided to look into other options. Although listed as "depreciated", the data found in NOAAs online "Coastline Extractor" is a big step forward. There seem to be more up-to-date products, but this served my needs for the moment, and I thought I'd pass along the address to other R users. I exported the data in ASCII "Matlab" format, which is basically just a 2-column text file (.dat) with NaN's in the rows that separate line segments.

I've also discovered the bathymetry / topography data from GEBCO. Again, very easy to import into R from the netCDF files.

The above map of the Galapagos Archipelago illustrates the quality of both datasets. It also shows the comparison of coastline accuracy between World Vector Shoreline (1:250,000), world (map package), and worldHires (mapdata package) datasets. Obviously, the low-resolution world data only makes sense for quick plotting at large scales, but the high-resolution data is as much as 1/10° off in some locations. I noticed these errors for the first time when trying to map some data for smaller coastal bays. It drove me crazy trying to figure out where the errors were - in my data locations or the map itself. Bathymetry used in the map was 30 arc-second resolution GEBCO data.

[EDIT: The comparison of coastline data now includes the high resolution data from the rworldmap package.]

A more detailed description export settings:
  • Coastline data (from 'Coastline Extractor') :
    • Coastline database: World Vector Shoreline (1:250,000)
    • Compression method for extracted ASCII data: None
    • Coast Format options: Matlab
    • Coast Preview options: GMT Plot
  • Bathymetry / topography data [link]:
    • General Bathymetric Chart of the Oceans (GEBCO) : GEBCO_08 Grid (30 arc-second resolution)
Here's another example of bathymetry / topography data for the western Pacific (1 minute resolution GEBCO data):



For both maps, I took inspiration for the color palettes from GMT. The rgb color levels of these palettes have got to be documented somwhere, but I gave up looking after a while and managed to hack their levels from color scales contained in .png files [link].

Below is the R code to reproduce the figures.

GMT standard color palettes

GMT (Generic Mapping Tools) (http://gmt.soest.hawaii.edu/) is a great mapping tool. I'm hoping to use it more in the future, but for the meantime I wanted to recreate some of the it's standard color palettes in R. Unfortunately, I couldn't find documentation of the precise rgb color levels used, so I ended up "stealing" them from the .png images on this website: http://www.geos.ed.ac.uk/it/howto/GMT/CPT/palettes.html

Here's the result:




Here's how I extracted the color levels from the .png images: