Knowing the Score on Relationships

By jhalper on March 5, 2012.

What would you say are the strongest three factors associated with the salaries of major-league baseball players? According to a popular, well-established algorithm, the main influential factors are walks, intentional walks and runs batted in.

How much does he earn?

But a paper recently published in Science reports on a new data analysis tool, which is able to find interesting relationships and trends in complex data sets - relationships that are invisible to other types of statistical analyses.

This could be a big deal: Large data sets with thousands of variables are increasingly common in fields as diverse as genomics, physics, political science, economics and more, so there is an increasing need for data analysis tools to make sense of such complex data sets.

It all started when Yakir Reshef, who is now a visiting Fulbright Scholar at the Weizmann Institute, was an undergraduate at Harvard University. Together with his older brother, David Reshef, then a master's student at MIT, he became interested in large data sets containing relationships whose type is unknown. In a collaboration that began in North America and crossed the Atlantic Ocean as Yakir moved to Israel, the two developed a new algorithm that could discover unexpected yet important relationships that would otherwise go unnoticed.

The tool the two developed - under the guidance of advisers Michael Mitzenmacher of the Harvard University School of Engineering and Applied Sciences and Pardis Sabeti of the Broad Institute - is named the maximal information coefficient, or MIC, and it scores pairs of variables based on how closely related they are. Researchers can calculate MIC on each pair of variables in their data set, rank the pairs by their scores (the higher the score, the more related the pair), and then examine the top-scoring pairs - that is, the pairs that affect each other the most.

Associations between bacterial species in the gut microbiota of "humanized" mice

To test whether the algorithm actually works, Yakir and David worked with Ph.D. student Hilary Finucane, of the Weizmann Institute's Mathematics Department (and while we are on the subject of relationships, Yakir's fiancÃ©e). The three applied MIC to both known and novel data sets in global health, gene expression, the human gut microbiota, and - you guessed it - major-league baseball, and compared the results to those of current methods.

In one example, they examined data from the World Health Organization, covering 200 countries and containing 357 data variables per country. One interesting relationship they found was between female obesity and income in which obesity increases monotonically with income in the Pacific Islands, a finding that contrasted with results from other countries. Was this an anomaly they were seeing? On the contrary - obesity is considered a sign of status in the Pacific Islands. But while most methods would treat this separate trend as noise, MIC is able to identify relationships, such as this one, that include more than one trend.

The researchers explain that the attributes which set MIC apart from other data analysis tools are twofold: It assigns high scores to a wide variety of relationship types hidden in large datasets, while also being able to provide similar scores to relationships with comparable amounts of noise. In other words, they say, it can find "cool things going on" that are unexpected and therefore difficult to detect with other types of analyses.

So what about baseball? MIC results differ from that traditional statistic: Rather than walks, intentional walks and runs batted in, it places hits, total bases and how many runs a player generates for a team as the most influential factors. So, which of the statistics is correct? The researchers have wisely opted to step aside, leaving it to baseball enthusiasts to decide which of them are - or should be - more strongly tied to salary!

Hilary Finucane and Yakir Reshef

More like this

You offered us so many interesting ideas and thoughts,it really does effect to relationship the eranings in the family.

Is the claim that the algorithm was able, without assistance, to classify certain nations as belonging to Polynesia? If so, you really ought to be writing about that. If not, if it just depends on human intervention to assign some relevance to a correlation that is positive in some cases and negative in others, then it's just spotting relationships within arbitrary subsets of a total population while ignoring the rest - in other words, a conspiracy theory generator!

Actually, it found a correlation between obesity and economic status in Polynesia in a large data set on global health. In other words, it identified a true (and known) trend that would probably have been discarded as noise in such a large pile of information using other statistical methods.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

The Physics of Neurons

May 30, 2016

Does the brain really operate like some kind of extra-complex computer, with logic gates and circuits made of the synapses that connect one neuron to another? In 2009, we wrote: In the future, the interface between brain and artificial system might be based on nerve cells grown for that purpose. In…

The 12th annual Ilan Ramon Space Olympics (Rehovot, we have a winner)

March 7, 2016

Is this science writer jazzed that ninth-grade girls from a religious girls’ school in Jerusalem won a space/science contest? You bet your sweet solar-powered spacelab she is! It is not just that these girls beat out a lot of other classes (over 400), or that they break more than one stereotype.…

How rat whiskers link movement to perception

January 28, 2016

The whisking of a rat’s whisker is a classic example of “active sensing” – in other words, sensing that involves movement. Prof. Ehud Ahissar studies rat whisking in order to understand how mammals perceive through all types of active sensing; without the continuous movement of whiskers,…

Feeling Sick? Blame Your Selfish Genes

January 7, 2016

Why does infection with bacteria or viruses make you feel sick? Prof. Guy Shakhar and Dr. Keren Shakhar have proposed that your symptoms are not just a byproduct of your body’s attempt to get rid of the infection. It is your genes’ way of ensuring they are passed down. The long and short of their…

New site, new stories

January 3, 2016

Cells that “spit” out their contents and messenger RNA that is not so swift at delivering its message. Those are two brand new stories on our new and improved website. Check it out and let us know what you think. The first story arose from a simple question: How do secretory cells – those that…