Monday, December 19, 2011

Maximal Information Coefficient (MIC)


Pearson r correlation coefficients for various distributions of paired data (Credit: Denis Boigelot, Wikimedia Commons)

A paper published this week in Science outlines a new statistic called the maximal information coefficient (MIC), which is able to equally describe the correlation between paired variables regardless of linear or nonlinear relationship. In other words, as Pearson's r gives a measure of the noise surrounding a linear regression, MIC should give similar scores to equally noisy relationships regardless of type.

The authors stress that the equitable nature of the MIC makes it appropriate in the comparison of a variety of relationships. In the paper, they demonstrate its use in explorations of several large data sets on global health, gene expression, human gut microbiota, and (an R-bloggers favorite!) major-league baseball.

Instructions for its use in R can be found on the author's website (under Downloads): http://www.exploredata.net/

Congrats to the authors!

[UPDATE: the R package, minerva, now provides an easy way to implement Maximal Information-Based Nonparametric Exploration (MINE) statistics, including MIC. An example of the the package using an data set from the Science article can be found in this post: http://menugget.blogspot.de/2014/09/maximal-information-coefficient-part-ii.html]



16 comments:

  1. Nice, but only works for bivariate data which I think is a strong limitation.

    ReplyDelete
  2. ...we need extensions of MIC(X,Y) to MIC(X,Y|Z). We will want to know how much data are needed to get stable estimates of MIC, how susceptible it is to outliers, what three- or higher-dimensional relationships it will miss, and more. MIC is a great step forward, but there are many more steps to take.

    http://www.sciencemag.org/content/334/6062/1502.full

    I think this is the real issue to be solved!

    ReplyDelete
  3. I don't know if you have similar problem but when I run that R code, my R just shut down...

    ReplyDelete
  4. I had the same problem, until I realized that the instructions are poorly worded. The statement:

    MINE("example.csv","all.pairs")

    means, you need to use the PATH to YOUR CSV file. There is no "example.csv" You can download one of theirs, put it in your working directory and then use "your filehere.csv" where it says "example.csv". PS - it is REALLY a shell; outputs the data to a csv file. It could be expanded to be used with dataframes or matrices and to output to the same.

    ReplyDelete
  5. Thanks for all the comments clarifying the example. The link to the Perspectives article in Science (2nd comment post) provides a nice background into the significance of the work and its future prospects. Thanks for adding that. - Marc

    ReplyDelete
  6. Yep, I've changed filenames to one of their example but still not working, same problem.

    ReplyDelete
  7. Same here, by the way. Running the code in a command line shows a Java formatting error in how it is handling the first datapoint, if I read it correctly. I sent a message regarding this, but have yet to get a reply. Same error shows up under Windows Vista or Ubuntu Linux

    ReplyDelete
  8. Looks like they've posted supplemental material with pseudo code here
    http://www.sciencemag.org/content/334/6062/1518/suppl/DC1

    ReplyDelete
  9. Problem solved. In data set cannot be any character variable.

    ReplyDelete
  10. Can you explain what you did to get one of the provided files (e.g. MLB2008.csv) to run? I don't see any character variable in the records. Can you explain further what is meant?

    ReplyDelete
  11. Is it not telling that this was published in a non-statistical journal whose statistical peer review we are unsure of? This problem was solved by Hoeffding in 1948 (Annals of Mathematical Statistics 19:546) who developed a straightforward algorithm requiring no binning nor multiple steps. Hoeffding's work was not even referenced in the Science article (according the the references in the online supplement; I don't have access to the main article). This has been in the R hoeffd function in the Hmisc package for many years. Here's an example (type example(hoeffd) in R):

    # Hoeffding's test can detect even one-to-many dependency
    set.seed(1)
    x <- seq(-10,10,length=200)
    y <- x*sign(runif(200,-1,1))
    plot(x,y) # an X
    hoeffd(x,y)

    D
    x y
    x 1.00 0.06
    y 0.06 1.00

    n= 200

    P
    x y
    x 0 # P-value is very small
    y 0

    hoeffd uses an efficient Fortran implementation of Hoeffding's method. The basic idea of his test is to consider the difference between joint ranks of X and Y and the product of the marginal rank of X and the marginal rank of Y, suitably scaled.

    Frank Harrell

    ReplyDelete
  12. No source code, just a non-free (CC Non-Commercial) wodge of Java byte code, and an R wrapper that calls the Java. If anyone thinks this is a useful technique then have a go re-implementing it from the published paper.

    ReplyDelete
  13. @Frank Harrel, In their page http://www.exploredata.net/Technical-information, there is a link to the reprint of the article. They have some impressive names in their acknowledgment list, but the comparison to Hoeffding would have been nice.

    ReplyDelete
  14. I had a look into the published paper. It does not contain much technical information about the algorithm. But the supplemental material (which is free) contains details about the algorithm. The method seems to be very ad-hoc with a lot of technicalities. It would be hard to develop an independent implementation based on the published material. After reading the paper, I think the method is basically an application of mutual information concept from information theory with some complex binning and aggregation methods (e.g. quantization).

    ReplyDelete
  15. Hoeffding's D has also been used successfully to study gene expression profiles, which is one of Reshef's example applications. It will be interesting to see how MIC compares with D.

    http://www.mendeley.com/research/comparing-pearson-spearman-and-hoeffdings-d-measure-for-gene-expression-association-analysis/

    ReplyDelete
  16. Michael Clark has a simulation study that compares these dependence measures. It may be a little artificial and using data with some actual application in mind would be nice, but I find it interesting nonetheless.

    http://www3.nd.edu/~mclark19/learn/CorrelationComparison.pdf

    ReplyDelete