Wednesday, April 18, 2018

Risky Business: Making your own diagnoses, backed by hundreds of false positives.

For the same reasons that I am hesitant to waste time on silly studies, I think it is worthwhile to not let these things go unchecked, and I'm not seeing anyone else trying to debunk this foolishness, so in addition to more traditional studies, I will take what I think is a much needed critical eye to a study  like this one:

Genome-wide study identifies 611 loci associated with risk tolerance and risky behaviors

I think far more of it is on the way because of 23andMe and Ancestry.com with massive databases available to scientists, who can devise any premise, no matter how ridiculous (eg. Finds Raisins to be tasty, enjoys jazz, reads mystery novels, thinks Britney Spears is smarter than Justin Bieber, Likes strawberry milkshakes more than vanilla, etc.) and create these Genome-wide Association Studies (GWAS) that will crank out hundreds of false positive loci or single nucleotide polymorphisms (SNP's) which they can then claim are linked to whatever it is they are studying.  Generally, these studies seem to me to be more of a window into the psychological assumptions of the authors than anything else.  In this case, their perception of what risk-taking is, what may cause it and even what is considered risky.  
Rather than re-explain my primary premise, that these GWAS studies produce little more than false positive findings, that will never be consistently replicated for that reason, please take a look at my previous post about this subject here.  Before I look at some of the more specific details of the study, let me make a few points right off the bat:
1. There is no psychological diagnosis specifically for risk taking behavior.  Some diagnoses, like Borderline Personality Disorder and Antisocial Personality Disorder, as well as, say someone in a manic episode, might have such a symptom or trait listed among many others, but in and of itself, it is not a diagnosis.  I find it disturbing (consider my examples above) that virtually anything that can be formulated on a questionnaire (as was this), can be taken as a serious basis for a genetic linkage study.  
2. The study, like all of these studies, provides no clear mechanism for how the trait/traits in question become phenotypes for hundreds of different flagged genes, most of which have no clear relation to each other.  
3. Like most GWAS studies, they make some hay about the trait being heritable to some extent, presumably to bolster the idea that their findings might correlate to something.  Yet, to date, no one that I am aware of has been able to explain how hundreds of different genes working together could possibly convey any significant heritability, since the more genes involved, as I understand it, the more watered down your heritability.  I request that anyone that can construct even a hypothetical mathematical model explaining how polygenic mechanisms could have high heritability, please provide it or a link in the comments.  Thus, even if you accept their premise, they provide no explanation of how these SNP's could work to provide risky behavior, nor how that is even theoretically possible.
4. The actually named risk behaviors, "general risk tolerance, adventurousness, and risky behaviors in the driving, drinking, smoking, and sexual domains," are, in many cases, so disparate and marginally related that only someone who has watched too many detective movies from the 1940's would put them together in a single package.  To take it to the level of assuming that these "behaviors" are going to have common genes is a bit beyond the pale.  They appear to unwittingly confirm this, noting: "Our estimates of the genetic correlations between general risk tolerance and the supplementary risky behaviors are substantially higher than the corresponding phenotypic correlations."  This they blame on "environmental factors."

Now, let's look at some of the specific details of the study:

They looked a one million study participants, half from the UK Biobank and half from 23andMe and used a few questions about risk-taking (cleverly, the main one in the UK Biobank is “Would you describe yourself as someone who takes risks?  Yes / No.”).  For the primary risk-taking behavior they try to "replicate" their findings by comparing them to 10 previous cohorts.  They don't do this with the other "related" behaviors.
The questions in the two primary databases are not exactly the same for the two and they are not a perfect match in other ways, but they make the case that they are reasonably correlated.  The questions in the 10 cohorts are also a bit different.  So my first question here, is why bother to combine both these databases and why bother using the cohorts?   Aren't 500,000 people from one or the other primary studies enough?  Wouldn't a more pure database be better than doubling your n with two slightly different ones?  
Although I'll get into their claimed positive finding briefly, let me start with the one negative result that they shared:  They didn't replicate the previous SNP correlations and their results were not consistent with previously proposed neuropathways for risk-taking behavior.  Let's pause right here, because I believe we might have another example of the GWAS shell-game phenomenon (my own term).  GWAS results find a correlation between a trait and an SNP or loci, which might even lead researchers to propose a neuropathway based on the function of the genes involved.  Then we do another GWAS that doesn't confirm the previous findings.  One might expect that people who were going around touting those previous findings as proof of both a genetic component for risk-taking behavior and proof of a particular mechanism, might have some kind of reckoning.  "All I thought I knew was nothing but false positives.  How foolish I've been..."  Of course, no such reckoning ever takes place.  Because, with this study, that has effectively negated all studies that preceded it, we now have a whole new crop of false positives to hang our hat on and brand new neuropathways to work with.  Out with the old and in with the new (until the next GWAS).
That's correct, the study finds other neurologically focused genes, of which there are going to be plenty out of over 600 "positive" findings, and assumes brand new mechanisms based simply on which regions of the brain are most effected by these particular SNP's.  This is even looked at as some sort of proof that they are on the right track, despite the fact that they are inventing these neuropathways on the spot.  
Now I want to look more closely at some of the claimed positive correlations, but first I need to say something about controls.  All GWAS studies are a bit different in terms of which and how many loci are being looked at, how many people are in the study, how many SNP's exist, etc.  So it is impossible to really know in advance how many correlated loci or SNP's you might expect to get at random.  In this case, the authors brush this fact aside with the following: "This empirical replication record matches theoretical projections that take into account sampling variation and the winner’s curse."  As I understand it, they are trying to say here that their results from the "replication" are better than you would expect randomly (correct me if I am misreading that).  In any case, a much cleaner control could be used.
What you should want to establish in a study like this (particularly one with such a shaky, non-scientific premise), is that you aren't just getting a bunch of random positive findings.  In this case, someone might argue that getting 600+ positives findings establishes that.  On the contrary, I think it should make someone a bit suspicious of the findings themselves.  That said, the authors do not say how many loci were looked at to get 600 positive findings so we really have no sure way to know whether that is a lot.  We could try to work our expectations out mathematically, but when you consider how many different gene frequencies and variations we have in a dataset, I would say that even a rough approximation would be difficult. (Addendum:  I have now made a more specific suggestion regarding testing for randomness, that I am referring to as The Pittelli Test)
A simpler approach would be to re-divide the data into two random groups (using the same ratio between those with the trait and those without found in the initial study) and see how many SNP "correlations" you find, since you would know that these are all simply false positives (unless you want to examine your dataset to find some trait the "positives" have in common, like finding raisins to be tasty).  You could, in fact, do this several times to get a more exact expected rate of false positives.  If possible, I challenge the authors to do so as an addendum to their current study.  I'm guessing they will find a similar number of "positive" correlations.

It should be noted that the positive correlations are spread amongst the different traits they looked at, risk-taking behavior being primary, then “adventurousness”, “automobile speeding  propensity”, “drinks per week” , “ever smoker” , and “number of sexual partners”.  There is a bit of a smile on my face as I type this out...  I don't know how much time to spend picking apart the premise that these are going to be "genetically related" by some unknown mechanism.  I would hardly call smokers particularly adventurous or assume they have a lot of sexual partners for example, so I really have a hard time understanding why anything other than the initial risk-taking trait is being looked at.  It merely confounds the data.
Getting into the results, they find 124 "lead SNP's" correlating to their primary risk-taking behavior trait.  They attempt what they call a "replication" of that, which I'll get into shortly, but they do not "replicate" the other 6 traits.  I'd like to pause here to point out an interesting bit of math.  I'll quote the paper: "We identified a total of 865 lead SNPs across the seven GWAS. "  Now, let's say we assume that all of these were nothing more than random, false positive findings.  If we found 124 lead SNP's related to the first trait, and they were entirely random, how many might we expect to find in all 7?  By my math, 124 X 7 = 868.   So almost the exact total you would expect if you assume the first trait findings are random false positives!   That is one hell of a coincidence...
That, of course, assumes that these samples don't have many exact matches between the traits.  Do they?  The paper doesn't note any.   If they don't, that should really tell you something and I hope this is just an omission, because it otherwise appears to disprove their entire premise and makes me wonder what they are touting.  I would expect to find at least a few matches even at random.  How many, you ask?  Well, if they did my control suggestion, I'd give you a precise answer.  In any case, I welcome clarification of this point from the authors.
That said, they do at least attempt to say that there is some genetic overlap: "There is substantial overlap across the results of our GWAS. For example, 72 of the 124 general risk-tolerance lead SNPs are in loci that also contain lead SNPs for at least one of the other  GWAS."  So no exact matches mentioned, but there are a lot of SNP's for one trait that are in the same loci as different SNP's found in others traits.  I assume the thought here would be that, since they are near each other, they might correspond to similar functions.  This is incredibly speculative.  Is that figure (72 out of 124) better than what you might find at random?  Again, a control similar to what I've suggested might clarify.

There is a good portion of the paper that looks at the distribution of SNP's for the primary trait and the 6 others and produces mathematical models to claim that this distribution is not random (that the SNP's are grouped in ways that suggest a correlation).  I admit that it would be difficult for me to fairly address these claims, but I'm thinking again that using such math in a study of this nature is going to be too approximate, considering the variations we are going to see in frequency of SNP's across the entire genome.  In any case, not to beat a dead horse, a control sure would have been handy in making this point.

The study also uses some interesting math to claim that they effectively replicated risk-taking behavior between their data set and the 10 cohorts.  I'm going to quote them and then raise an uncomfortable point:
"The genetic correlation between the discovery and  replication GWAS is 0.83 (SE = 0.13). 123 of the 124 lead SNPs were available or well proxied  by an available SNP in the replication GWAS results. Out of the 123 SNPs, 94 have a concordant sign (P = 1.7×10-9  ) and 23 are significant at the 5% level in one-tailed t tests (P = 4.5×10-8  ) (Extended Data Fig. 5.1). This empirical replication record matches theoretical projections that take into account sampling variation and the winner’s curse."  

Here's the uncomfortable point: None of the 124 SNP's flagged in their original study were found in the 10 cohorts that were used for "replication".  None. 
How many would it take for a valid replication?  120?  80?  or 124?  Who can say?  The bottom line is that there was never anything to replicate in the first place.  I object strongly to these malleable uses of the term replication.  Pick one of the 124 SNP's you found in this study and see if it is also found in a new, independent, follow-up database.  If you do that, then you can at least make a case for replication.
I also want to point out what I will call speculative psychological associations that I often see sprinkled into studies to bolster their case, including in this case.  For brevity, I'll choose one example:
"After Bonferroni correction, we also find significant positive genetic correlations with the  neuropsychiatric phenotypes ADHD, bipolar disorder, and schizophrenia. Viewed in light of the  genetic correlations we find with risky behaviors classified as externalizing (e.g., substance use,  elevated sexual behavior, and fast driving), these results suggest the hypothesis that the overlap with the neuropsychiatric phenotypes is driven by their externalizing component."
In my view, "externalizing" is not a useful or valid grouping for these mental disorders.  As close as I can draw a parallel would be the designation of positive and negative symptoms of schizophrenia.  There is simply no reason to believe that all of these mental disorders are specifically genetically related to risk-taking behavior, much less to each other by the nebulous concept of "externalizing".

In Conclusion:
1. This study does not adequately demonstrate that it is anything more than random false positives.
2. It does not, in fact replicate anything.
3. It does not demonstrate any real relationship between the 7 traits it studies.
4. It essentially contradicts previous studies related to genes and and "risk-taking" behavior, both in terms of not matching the previous SNP's and in terms of not being consistent with the previously proposed neuro pathways proposed for this trait.
5. The study argues for new mechanisms in a circular fashion (for neuro pathways and genetic correlations) under the assumption that their data is something more than random false positives.  In other words, they assume the results are correct and use these results to alter previous assumptions about risky behavior.
6. I believe the results to be random as I have outlined above.


No comments:

Post a Comment