Sunday, 2 August 2015

Please sponsor me for City2Surf (one week to go!)

This time next week, I’ll be running the 14km City2Surf fun run.

I’m not going to pretend that I signed up for this “for charity” - it’s a fitness motivator - but, at the same time, I’d love to raise some money for a cause that I care about and will be "running for the panda" to support wildlife conservation.

Before signing up, I’d never run 14km before, and the City2Surf route also includes the notorious 2km “heartbreak hill” in the middle.

Donating to my supporter page will really help me get up that hill!

Saturday, 1 August 2015

The Day the Earth Smiled

This somehow passed me by when it happens (perhaps caught up with the upcoming move to Australia) but the latest episode of the Infinite Monkey Cage podcast (series 12, episode 4) featured a short segment on “The Day the Earth Smiled”.

From the Cassini Imaging website:

On July 19, 2013, in an event celebrated the world over, NASA’s Cassini spacecraft slipped into Saturn’s shadow and turned to image the planet, seven of its moons, its inner rings – and, in the background, our home planet, Earth.

The CICLOPS site has the full picture. This section is from the Wikipedia page and has Earth marked with an arrow.

Pretty humbling stuff.

There’s more at CICLOPS including a higher resolution image of the Earth and Moon:

Sunday, 12 July 2015

Developments in high throughput sequencing (June 2015 Edition)

This is nearly a month old now but Keith Bradnam’s ACGT blog a while back drew my attention to the June 2015 edition of Lex Nederbragt’s Developments in high throughput sequencing in which he plots Gigabases* per run against (log) read length (*the human genome is about 3Gb):

I’m particularly excited by the two technologies on the right of this graph, which represent the latest single molecule “long read” sequencing technologies, both of which we now have access to through the Ramaciotti Centre for Genomics. In fact, we got our first data from the PacBio RS II (right) and it’s looking good! (More on that later.)

Despite being a bioinformatician with a background in genetics, I have been keeping my distance a bit from “next generation sequencing” as the technical challenges of dealing with short read data far eclipse the scientific interest. (For me, that is - the kinds of things that I am most interested in do not suit short read data.) The new long read technologies are a real game changer, and I see a lot more genomics in my (and this blog’s) future.

Thursday, 9 July 2015

Sydney Sunset

Today, I met up with an ex-student who moved to Sydney this week. After lunch at Coogee and a bit of the afternoon at UNSW, we headed into the city and ended up at circular quay for sunset.

I’ve said it before and I’ll say it again: Sydney does do a good sunset.

Thursday, 25 June 2015

What's Really Warming the World?

It’s hard to believe in 2015 that there are people out there who have still not accepted the reality of man-made climate change. But then some people still use homeopathic medicine and think that vaccination causes autism. Sigh. Anyway, if you have any doubts about the causes of increasing global temperatures, or just really like slick infographics, you can do a lot worse than check out Bloomberg’s page on What’s Really Warming the World?

Sunday, 21 June 2015

The importance of knowing how your data are scaled

A few weeks ago, there was a post on WEIT, The correlation between rejection of evolution and rejection of environmental regulation: what does it mean? It was triggered by a tweet about by the Washington post about a graph comparing attitudes to the environment and attitudes to evolution, broken down by religious affiliation:

We’ll get to the tweet later. First, the graph. It was from a US National Center for Science Education blog post based on 2007 data from the Pew Religious Landscape Study, examining two binary choice statements:

y-axis. Stricter environmental laws and regulations cost too many jobs and hurt the economy; or Stricter environmental laws and regulations are worth the cost.

x-axis. Evolution is the best explanation for the origins of human life on earth. (Agree/disagree)

Data was normalised onto a percentile scale with each circle representing (1) by position, the normalised percentile of that group’s response, (2) by area, the size of that group. (36,000 people were surveyed in total.)

The percentile normalisation method was based on a previous analysis of different Pew questions by Toby Grant, who explains it thus:

Geek note on measurement

The range of each dimension ranges from zero to 100. These scores were calculated by calculating the percentage of each religion giving each answer. The percentages were then subtracted (e.g., percent saying “smaller government” minus percent saying “bigger government”). The scores were then standardized using the mean and standard deviation for all of the scores. Finally, I converted the standardized scores into percentiles by mapping the standardized scores onto the standard Gaussian/normal distribution. The result is a score that represents the group’s average graded on the curve, literally.

A few things annoy me about this:

  1. This is not simply a “Geek note”. Knowing what was done to data is vital for understanding what a plot means. To be fair to Grant, he does mention that he is plotting percentiles in the graph legend. (As far as I can see, Robineau does not mention it anywhere!)
  2. By first normalising to the mean and then converting everything to percentiles, there is a double loss of quantitative information. Following the first normalisation, all you can do is compare groups - there is no absolute information about responses. Following the second, you cannot even compare the degree of difference. What this plot is basically doing is pulling in the outliers to make them look more similar to mean, and spreading out those similar to the mean to make them look more different.
  3. When converting to percentiles, the additional normalisations seem pointless. Unless I've misunderstood, if the data is truly normally distributed then the percentile of the fitted data should be the same as the percentile of the raw data. If not, you shouldn’t do the normalisation in the first place. Either way, I think you are just adding error and confusion. (There is no data presented to support the fact that these opinions are normally distributed.)

It is also worth noting that, to the unwary, the circle sizes could be misleading. The bigger the circle, the more data and the more accurate the estimation of the value. The small circles might have much more random sampling bias in their positions. (Under a null model where all groups are the same, you would expect the large circles to gravitate towards the mean, while the smaller circles should be the outliers.) Most importantly, circles that overlap are not more similar than circles that do not.

It would be more useful to have estimated standard errors plotted for each group. Again, because we have lost the quantitative information, we cannot tell whether a small difference in responses (possibly within measurement error) would have a big difference in percentiles. There are 36,000 people in total but some of the groups are less than 0.5% and therefore have fewer than 200 people.

Robineau’s plot uses the same method although he:

“didn’t rescale to the 0-100 scale, since I didn’t want this to seem like a percentage when it isn’t.”

It's not a percentage but it is a percentile, so 0-100 is entirely appropriate. Leaving it as -1.0 to +1.0 is in fact very misleading, as it implies that people are positive or negative with respect to the questions. In reality, positive just means “above average” and negative is “below average”. I have an above average number of arms: two. This does not mean that I have lots of arms, it just means that some people have fewer arms than me.

These things aside, Robineau asks:

“So what does this tell us?

Thanks to the scaling, the only thing this graph tells us is that (a) there is a rank correlation between the answers to the two questions, and (b) some religious groups (particularly evangelical Christians) appear to agree with these statements less than average, while other groups (notably non-Christians) tend to agree with these statements more than average.

These observations could still be of interest. The real problem comes when people start interpreting this graph as if the normalisations and rescaling have not been done to it. Robineau first:

First, look at all those groups whose members support evolution. There are way more of them than there are of the creationist groups, and those circles are bigger. We need to get more of the pro-evolution religious out of the closet.

Second, look at all those religious groups whose members support climate change action. Catholics fall a bit below the zero line on average, but I have to suspect that the forthcoming papal encyclical on the environment will shake that up.”

This in turn was apparently interpreted by the Washington post to mean this:

The fact is, the normalisation has removed all hope of actually knowing whether there is conflict or not. The percentile scaling removes almost all of the quantitative info on the axes, so proximity on the scale means nothing with respect to proximity of answer. All the groups inside the small top right cluster could have >90% support for the scientific evidence and all of the groups outside <10% support, and you could still get that plot. (It’s hard to tell but the top-right cluster look closer to 1.0 than the bottom-left groups are to -1.0, indicating that they might deviate much more from the mean thanks to the mapping onto a normal distribution. This implies that the data was not normally distributed in the first place and is probably a heavy-tailed or bimodal distribution instead.)

Critically, it is impossible to conclude that any groups “support evolution” or “support climate change action”. As the graph is scaled by percentiles, 0.0 is essentially the point where 50% are above and 50% below. Because the vast majority of groups are religious, of course there are many religious groups above the line. There essentially have to be, unless all religious groups were identical (in which case they would group very slightly below 0.0).

To many, stand-out thing is that atheists and agnostics are all in the top-top right. This graph could easily have been branded “the conflict between science and religion in one chart”! But it cannot even really say that: every group could disagree with the two statements and thus be in conflict with the scientific evidence. You would still get the same plot after the rescaling.

My big question from all of this is: why not make the plot using the raw percentage responses? What do the normalisations actually achieve?

And my big take home message: if you are going to infer things from plots, make sure that you understand how the data were scaled.

Saturday, 20 June 2015

MapTime lives!


It’s fair to say that MapTime has been somewhat neglected in the past couple of years, now that the core team is spread over three continents. However, having given the website a long overdue look over today (after it went down a while ago), I am pleased to report that it still works! I even added a new TimePoint to the Organic Evolution TimeLine.