Monday, November 28, 2011

Another aspect of speeding up loops in R

Any frequent reader of R-bloggers will have come across several posts concerning the optimization of code - in particular, the avoidance of loops.

Here's another aspect of the same issue. If you have experience programming in other languages besides R, this is probably a no-brainer, but for laymen, like myself, the following example was a total surprise. Basically, every time you redefine the size of an object in R, you are also redefining the allotted memory - and this takes some time. It's not necessarily a lot of time, but if you are having to do it during every iteration of a loop, it can really slow things down.

The following example shows three versions of a loop that creates random numbers and stores those numbers in a results object. The first example (Ex. 1) demonstrates the wrong approach, which is to concatenate the results onto the results object ("x") , thereby continually changing the size of x after each loop. The second approach (Ex. 2) is about 150x faster - x is defined as an empty matrix containing NAs, which is gradually filled (by row) during each loop. The third example (Ex. 3) shows another possibility if one does not know what the size of the results from each loop will be. An empty list is created of length equaling the number of loops. The elements of the list are then gradually filled with the loop results. Again, this is at least 150x faster than Ex. 1 (and I'm actually surprised to see that it may even be faster than Ex.2).

Thursday, November 24, 2011

Define intermediate color steps for colorRampPalette

The following function, color.palette(), is a wrapper for colorRampPalette() and allows some increased flexibility in defining the spacing between main color levels. One defines both the main color levels (as with colorRampPalette) and an optional vector containing the number of color levels that should be put in between at equal distances.

     The above figure shows the effect on a color scale (see image.scale) containing 5 main colors (blue, cyan, white, yellow, and red).  The result of colorRampPalette (upper) produces an equal number of levels between the main colors. By increasing the number of intermediate colors between blue-cyan and yellow-red (lower), the number of color levels in the near white range is reduced. The resulting palette, for example, was better in highlighting the positive and negative values of an Emprical Orthogonal Function (EOF) mode.


Empirical Orthogonal Function (EOF) Analysis for gappy data

[Updates]: The following approach has serious shortcomings, which I have recently become aware of. In a comparison of gappy EOF approaches Taylor et al. (2013) [pdf] show that this traditional approach is not as accurate as others. Specifically, the approach of DINEOF (Data Interpolating Empirical Orthogonal Functions) proved to be the most accurate. I have outlined the DINEOF algorithm in another post [link]. and show a comparison of gappoy EOF methods here: http://menugget.blogspot.de/2014/09/pca-eof-for-data-with-missing-values.html. The R package "sinkr" now contains a version of the function ("eof") for easy installation: https://github.com/menugget/sinkr

-----------------
The following is a function for the calculation of Empirical Orthogonal Functions (EOF). For those coming from a more biologically-oriented background and are familiar with Principal Component Analysis (PCA), the methods are similar. In the climate sciences the method is usually used for the decomposition of a data field into dominant spatial-temporal modes. 

Friday, November 11, 2011

Propagation of error

     At the onset, this was strictly an excercise of my own curiosity and I didn't imagine writing this down in any form at all. As someone who has done some modelling work in the past, I'm embarrassed to say that I had never fully grasped how one can gauge the error of a model output without having to do some sort of Monte Carlo simulation whereby the model parameters are repeatedly randomized within a given confidence interval. Its relatively easy to imagine that a model containing many parameters, each with an associated error, will tend to propagate these errors throughout. Without getting to far over my head here, I will just say that there are defined methods for calculating the error of a variable if one knows the underlying error of the functions that define them (and I have tried out only a very simple one here!).
     In the example below, I have three main variables (x, y, and z) and two functions that define the relationships y~x and z~y. The question is, given these functions, what would be the error of a predicted z value given an initial x value? The most general rule seems to be:
     error(z~x)^2 = error(y~x)^2 + error(z~y)^2
However, correlated errors require additional terms (see Wikipedia: Propagation of uncertainty). The following example does just that by simulating correlated error terms using the MASS package's function mvrnorm().

example: