Sunday, March 23, 2014

The power of PCA strikes again!

Amazing study using genetic markers to predict principle components of facial features:
New Scientist article -  Genetic mugshot recreates faces from nothing but DNA - life - 20 March 2014 - New Scientist
Original article - (PLoS Genetics, DOI: 10.1371/journal.pgen.1004224)

Saturday, January 25, 2014

Importing bathymetry and coastline data in R


After noticing some frustrating inaccuracies with the high-resolution world coastlines and national boundaries database found in worldHires from the package mapdata (based on CIA World Data Bank II data), I decided to look into other options. Although listed as "depreciated", the data found in NOAAs online "Coastline Extractor" is a big step forward. There seem to be more up-to-date products, but this served my needs for the moment, and I thought I'd pass along the address to other R users. I exported the data in ASCII "Matlab" format, which is basically just a 2-column text file (.dat) with NaN's in the rows that separate line segments.

I've also discovered the bathymetry / topography data from GEBCO. Again, very easy to import into R from the netCDF files.

The above map of the Galapagos Archipelago illustrates the quality of both datasets. It also shows the comparison of coastline accuracy between World Vector Shoreline (1:250,000), world (map package), and worldHires (mapdata package) datasets. Obviously, the low-resolution world data only makes sense for quick plotting at large scales, but the high-resolution data is as much as 1/10° off in some locations. I noticed these errors for the first time when trying to map some data for smaller coastal bays. It drove me crazy trying to figure out where the errors were - in my data locations or the map itself. Bathymetry used in the map was 30 arc-second resolution GEBCO data.

[EDIT: The comparison of coastline data now includes the high resolution data from the rworldmap package.]

A more detailed description export settings:
  • Coastline data (from 'Coastline Extractor') :
    • Coastline database: World Vector Shoreline (1:250,000)
    • Compression method for extracted ASCII data: None
    • Coast Format options: Matlab
    • Coast Preview options: GMT Plot
  • Bathymetry / topography data [link]:
    • General Bathymetric Chart of the Oceans (GEBCO) : GEBCO_08 Grid (30 arc-second resolution)
Here's another example of bathymetry / topography data for the western Pacific (1 minute resolution GEBCO data):



For both maps, I took inspiration for the color palettes from GMT. The rgb color levels of these palettes have got to be documented somwhere, but I gave up looking after a while and managed to hack their levels from color scales contained in .png files [link].

Below is the R code to reproduce the figures.

GMT standard color palettes

GMT (Generic Mapping Tools) (http://gmt.soest.hawaii.edu/) is a great mapping tool. I'm hoping to use it more in the future, but for the meantime I wanted to recreate some of the it's standard color palettes in R. Unfortunately, I couldn't find documentation of the precise rgb color levels used, so I ended up "stealing" them from the .png images on this website: http://www.geos.ed.ac.uk/it/howto/GMT/CPT/palettes.html

Here's the result:




Here's how I extracted the color levels from the .png images:

Monday, December 9, 2013

Data mountains and streams - stacked area plots in R



Below are two functions for producing stacked area plots. The first is the more typical approach where sequential series are stacked on top of another (function: plot.stacked), while the second approach is the more aesthetically-oriented version called a "stream plot" (function: plot.stream), which alternates series on either side of a meandering baseline (see here for the motivation, and here for the inspiration). 

Arguments are similar for both functions regarding the input of x and y series and polygon attributes (fill color, border color, border line width). The stream plot also requires that the degree of meandering for the baseline be defined by the arguments frac.rand and spar; frac.rand, controls the meander amplitude (uniform random numbers added to baseline as a fraction of the total y range) and spar controls the amount of smoothing (as fit by the function smooth.spline).

The plot above colors the series with a color gradient of when the first appear in the series, while the plot below colors series by their maximum value. The order of the plotting of the series can also affect the the emphasis on the plot. By default, plotting order is sequential by column, although two ordering options are built-in to the functions: order by maximum value, and order by first appearance.



The plot.stacked function:

Thursday, December 5, 2013

New version of image.scale function


Below is an updated version of the image.scale function. In the old version, one had to constantly use additional arguments to suppress axes and their labels. The new version contains the additional arguments axis.pos (1, 2, 3, or 4) for defining the side of the axis, and add.axis (TRUE or FALSE), for defining whether the axis is plotted. Based on the position of the axis, the scale color levels are automatically drawn in a horizontal (axis.pos = 1[bottom] or 3[top]) or vertical (axis.pos = 2[left] or 4[right]) orientation. For the right plot above, the argument add.axis=FALSE so that additional control over axis ticks and labels could be added in an additional step with axis(). The function mtext() can be used to add additional labels to the scale.


The image.scale function:

Friday, November 8, 2013

Working with hdf files in R - Example: Pathfinder SST data



Following  a question that I posted on stackoverflow.com, I recieved the great advice to use the Bioconductor rhdf5 package to work with HDF5 files. The package is not located on CRAN, but can be sourced from the Bioconductor website:

source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
Created by Pretty R at inside-R.org

As an example, I use the package to extract Pathfinder sea surface temperature (SST) data, available in netCDF-4 format (the features of netCDF-4 are a subset of the features of HDF5). This type of file is not readable by the netCDF package ncdf.The result is the above plot of a subarea from one of the daily data sets.

To reproduce the figure, you will need the image.scale and val2col functions found on this blog.

To reproduce example:

Tuesday, October 29, 2013

A first attempt at an individual-based model in R



I have been curious for a while as to how R might be used for the construction of an individually-based model (IBM), or agent-based model (ABM). In particular, what R objects lend themselves best to storing information on individuals, and allow for new individuals to be added or subtracted throughout the simulation?

Thursday, April 25, 2013

A plea for less word clouds

Word cloud of DOMA hearing transcripts

I must admit, there is something appealing about the word cloud - that is, until you try to understand what it actually means...

Word clouds are pervasive - even in the science world. I was somewhat spurred to write this given the incredibly wasteful summaries of EGU General Assembly survey results that include several useless word clouds (link to document). Capitalization of words isn't even considered; e.g. "Nice" vs."nice". I have been hesitant to equate word clouds to the hilariously labeled "mullets of the internet" but, on second thought, it is entirely appropriate. They were once fad, but seem reluctant to die...

Oh, and yes, a "tag cloud" is a type of word cloud - I have fallen into the trap myself by including such a thing on this blog! I honestly didn't make the connection at first, because, at least, it had the function of showing the relative importance of terms that I personally defined as topics - not an arbitrary puking up of all the words that I have ever written here. Nevertheless, I think it must be removed now - I can't tell you how many times that I have wanted to go to a specific blog post by clicking on a tag, only to be forced to search into the nether regions of (extremely) small font size. Simple alphabetical arrangement probably makes more sense.

There are some attempts at making word clouds with R (most notable the "wordcloud" package), but they don't seem to be as visually appealing as those easily produced by sites such as Wordle. Nevertheless, you continue to see such things produced - just do a search for "word cloud" on R-bloggers for many examples.

I decided to give Wordle a try, and chose the Defence of Marriage Act (DOMA) hearing transcripts as a source for text. The above word cloud shows the results (with some beautiful patriotic colonial-looking font to boot!). It doesn't reveal much to me. An initial attempt caught me off-guard in that the dominant word was "justice" (below), which would have possibly been insightful if it hadn't been a construct of the prevalence of the speakers titles (i.e. "Justice Kagan"):

An even more worthless word cloud of DOMA hearing transcripts

Anyway, I'm glad I'm not alone in this thinking - I have come across many discussions along the same lines; in particular, the nice article Jacob Harris. Unfortunately, it seems they are here to stay, and I will just have to learn to better avert my eyes from their alluring power in the future...

Monday, January 28, 2013

My template for controlling publication quality figures



The following is a template that I usually start with when producing figures for publication. It allows me to control:
  1. The overall size of the figure (in inches) (WIDTH, HEIGHT)
  2. The layout of figure subplots (using the layout() function) (LO)
  3. The resolution of the figure (for a .png file) (RESO)
I define the overall dimensions of the figure in units of measurement (e.g. inches or centimeters) in order to control how the figure will look on the printed page. For example, a typical journal page might have ~8 inches of space for a 2 column figure and ~4 inches for a 1 column figure.

I define margins (mar, oma) in terms of point size (ps), since this relates to the height of text, which allows of control of axis labeling. By defining the outer margins (OMA) and point size (PS) before calling layout, you will have these margins incorporated. Then, by running the x11() device (after the #), you can check your figure layout with layout.show(n):



I learned recently that the layout() function will adjust the character expansion size (par()$cex) depending on how your device is split up. For that reason, I usually include another line of code resetting par(cex=1) before proceeding with individual plots.

Finally, the three different device types included in the template are:

  1. x11(), for initial tweaking of the layout and general functionality of the plotting code
  2. png(), for producing a compact figure useful in pasting into Word documents, and for cases where the figure contains a lot of information and would be slow to loading as a .pdf
  3. pdf(), for a vector-based figure that is fully scalable / zoomable. When not too big, these figures look the best, and can also be embedded in LaTeX documents
I have been able to use this template to successfully control my figures to the formatting requirements of specific journals or other publications (e.g. overall size, point size, resolution, etc.).


Figure template:

Friday, January 18, 2013

Choosing colors visually with 'getcolors'


When plotting, I am constantly defaulting to the "main" colors in R - In other words, the colors that one can quickly call by number (1="black", 2="red", 3="green", 4="blue", ... etc.) . In my opinion, these colors do not lend themselves well to compelling graphics. I imagine this is the reason for the inclusion of the much more pleasing color palettes used by default in the popular graphical package ggplot2. I try and choose better colors for final figure versions for publishing, but it is usually a tedious process of trial and error with functions like rgb(). There are some nice alternate color palettes out there probably more in line with color theory, and one has a lot of flexibility with functions like colorRampPalette(), but I wanted to have a function where I can choose colors visually in order to speed up the process. Below is the function getcolors(), which allows for this selection by using a simplified color swatch to allow selection with a mouse using the locator() function (above, top plot). Following selection, a second plot opens showing how these colors look next to each other and on a background gradient of black to white. The function uses an RGB color model: Red increases on the y-axis, Green increases on the x-axis, and Blue is a repeated sequence of levels across the x-axis).

For the example, I chose 4 colors, which are saved in a vector. These colors were subsequently used to make the following line plot:


the getcolors function: