me nugget: Maximal Information Coefficient (MIC)

Monday, December 19, 2011

Maximal Information Coefficient (MIC)

Pearson r correlation coefficients for various distributions of paired data (Credit: Denis Boigelot, Wikimedia Commons)

A paper published this week in Science outlines a new statistic called the maximal information coefficient (MIC), which is able to equally describe the correlation between paired variables regardless of linear or nonlinear relationship. In other words, as Pearson's r gives a measure of the noise surrounding a linear regression, MIC should give similar scores to equally noisy relationships regardless of type.

The authors stress that the equitable nature of the MIC makes it appropriate in the comparison of a variety of relationships. In the paper, they demonstrate its use in explorations of several large data sets on global health, gene expression, human gut microbiota, and (an R-bloggers favorite!) major-league baseball.

Instructions for its use in R can be found on the author's website (under Downloads): http://www.exploredata.net/

Congrats to the authors!

[UPDATE: the R package, minerva, now provides an easy way to implement Maximal Information-Based Nonparametric Exploration (MINE) statistics, including MIC. An example of the the package using an data set from the Science article can be found in this post: http://menugget.blogspot.de/2014/09/maximal-information-coefficient-part-ii.html]

16 comments:

AnonymousDecember 19, 2011 at 4:53 PM
Nice, but only works for bivariate data which I think is a strong limitation.
ReplyDelete
Replies
AnonymousDecember 19, 2011 at 5:35 PM
...we need extensions of MIC(X,Y) to MIC(X,Y|Z). We will want to know how much data are needed to get stable estimates of MIC, how susceptible it is to outliers, what three- or higher-dimensional relationships it will miss, and more. MIC is a great step forward, but there are many more steps to take.

http://www.sciencemag.org/content/334/6062/1502.full

I think this is the real issue to be solved!
ReplyDelete
Replies
AnonymousDecember 19, 2011 at 6:15 PM
I don't know if you have similar problem but when I run that R code, my R just shut down...
ReplyDelete
Replies
AnonymousDecember 19, 2011 at 8:03 PM
I had the same problem, until I realized that the instructions are poorly worded. The statement:

MINE("example.csv","all.pairs")

means, you need to use the PATH to YOUR CSV file. There is no "example.csv" You can download one of theirs, put it in your working directory and then use "your filehere.csv" where it says "example.csv". PS - it is REALLY a shell; outputs the data to a csv file. It could be expanded to be used with dataframes or matrices and to output to the same.
ReplyDelete
Replies
Marc in the boxDecember 19, 2011 at 8:49 PM
Thanks for all the comments clarifying the example. The link to the Perspectives article in Science (2nd comment post) provides a nice background into the significance of the work and its future prospects. Thanks for adding that. - Marc
ReplyDelete
Replies
AnonymousDecember 19, 2011 at 10:12 PM
Yep, I've changed filenames to one of their example but still not working, same problem.
ReplyDelete
Replies
AnonymousDecember 20, 2011 at 1:02 AM
Same here, by the way. Running the code in a command line shows a Java formatting error in how it is handling the first datapoint, if I read it correctly. I sent a message regarding this, but have yet to get a reply. Same error shows up under Windows Vista or Ubuntu Linux
ReplyDelete
Replies
AnonymousDecember 20, 2011 at 4:10 AM
Looks like they've posted supplemental material with pseudo code here
http://www.sciencemag.org/content/334/6062/1518/suppl/DC1
ReplyDelete
Replies
AnonymousDecember 20, 2011 at 8:19 AM
Problem solved. In data set cannot be any character variable.
ReplyDelete
Replies
AnonymousDecember 21, 2011 at 2:00 AM
Can you explain what you did to get one of the provided files (e.g. MLB2008.csv) to run? I don't see any character variable in the records. Can you explain further what is meant?
ReplyDelete
Replies
Frank HarrellDecember 23, 2011 at 4:37 PM
Is it not telling that this was published in a non-statistical journal whose statistical peer review we are unsure of? This problem was solved by Hoeffding in 1948 (Annals of Mathematical Statistics 19:546) who developed a straightforward algorithm requiring no binning nor multiple steps. Hoeffding's work was not even referenced in the Science article (according the the references in the online supplement; I don't have access to the main article). This has been in the R hoeffd function in the Hmisc package for many years. Here's an example (type example(hoeffd) in R):

# Hoeffding's test can detect even one-to-many dependency
set.seed(1)
x <- seq(-10,10,length=200)
y <- x*sign(runif(200,-1,1))
plot(x,y) # an X
hoeffd(x,y)

D
x y
x 1.00 0.06
y 0.06 1.00

n= 200

P
x y
x 0 # P-value is very small
y 0

hoeffd uses an efficient Fortran implementation of Hoeffding's method. The basic idea of his test is to consider the difference between joint ranks of X and Y and the product of the marginal rank of X and the marginal rank of Y, suitably scaled.

Frank Harrell
ReplyDelete
Replies
Barry RDecember 24, 2011 at 9:51 AM
No source code, just a non-free (CC Non-Commercial) wodge of Java byte code, and an R wrapper that calls the Java. If anyone thinks this is a useful technique then have a go re-implementing it from the published paper.
ReplyDelete
Replies
mpiktasDecember 27, 2011 at 4:05 PM
@Frank Harrel, In their page http://www.exploredata.net/Technical-information, there is a link to the reprint of the article. They have some impressive names in their acknowledgment list, but the comparison to Hoeffding would have been nice.
ReplyDelete
Replies
James LiJanuary 9, 2012 at 2:01 AM
I had a look into the published paper. It does not contain much technical information about the algorithm. But the supplemental material (which is free) contains details about the algorithm. The method seems to be very ad-hoc with a lot of technicalities. It would be hard to develop an independent implementation based on the published material. After reading the paper, I think the method is basically an application of mutual information concept from information theory with some complex binning and aggregation methods (e.g. quantization).
ReplyDelete
Replies
Jack LewisJanuary 29, 2012 at 2:12 AM
Hoeffding's D has also been used successfully to study gene expression profiles, which is one of Reshef's example applications. It will be interesting to see how MIC compares with D.

http://www.mendeley.com/research/comparing-pearson-spearman-and-hoeffdings-d-measure-for-gene-expression-association-analysis/
ReplyDelete
Replies
ChamonSeptember 19, 2014 at 2:57 AM
Michael Clark has a simulation study that compares these dependence measures. It may be a little artificial and using data with some actual application in mind would be nice, but I find it interesting nonetheless.

http://www3.nd.edu/~mclark19/learn/CorrelationComparison.pdf
ReplyDelete
Replies

Add comment

me nugget

Monday, December 19, 2011

Maximal Information Coefficient (MIC)

16 comments:

NPR Topics: Science

WWF - Environmental News