Monday, April 23, 2018

The Pittelli Test for Non-Randomness in a GWAS

I have been making a suggestion for how Genome-wide Association Tests could be checked upfront to determine the likelihood that their alleged genetic correlations are not simply random false positives.  As I've pointed out previously, when you look at hundreds of thousands of loci or SNP's with potentially millions of study participants, it is very likely that some (if not all) of your "significant" p values were only random, false positives.  I want to formalize my suggestion (and, for the fun of it, name it after myself).
Here is my proposal:
When a GWAS is performed, a control should be done checking for random results.  Instead of dividing the study participants by whether or not they possess the trait (or one of the traits) being studied, the participants should be randomly assigned to groups of the same size and the study be performed again to see if "significant" variants are found.  This will work as a control.  If the number of variants is reasonably close to the number found in the actual study protocol, then we have a very good indication that the SNP's or Loci we found were random artifacts.
To give an example, suppose we have a million participants in a study, 100,000 of them possess the trait we are studying, and the GWAS finds 20 significant SNP for that trait.  We would then randomize the participants into one group of 900,000 and one group of 100,000 and follow the same protocol AS IF we were performing the original study.  If, when doing this, we come up with somewhere near 20 "significant" loci or SNP, we have a pretty good idea that the original correlations we found were likely random.  This is going to be far more accurate than attempts to mathematically model likelihoods, since all loci have different numbers of variants and frequencies.  If more than one trait is being looked at, we should randomize according to the numbers in each group (if 50,000 had one trait, 20 thousand another and 30,000 another, we would create 3 random groups to represent that).
As with p-values, we still have to come up with an admittedly arbitrary standard for how many randomly generated "significant" results would lead us to reject the study as nothing more than random.  I am going to make a case for the original study to have at least 5 times the number of significant loci/SNP's as the randomized version (if the original study has less than 5 significant results, then I'll say that the randomized version must have no false positives).
Anytime that a study of this nature is performed, we try to eliminate any confounding factors, but I think that a 5 to 1 ratio should give us confidence that we have a non-random result.  That doesn't necessarily prove the validity of the findings.  It just helps us feel confident that we are looking at results that are not random.
So if we had 4 positives results in our original study and, say, two positive results in the randomized version, I would say that this doesn't pass the Pittelli Test and there is a fairly good probability that we are dealing with false positives.  If, on the other hand, we have 50 positive results in the original and only 5 in the randomized version, then I think we can be very comfortable with the fact that we are not dealing with an entirely random result and we can proceed from there.
In fact, this same random data-set can be used as a basis for comparison for other elements I see in GWAS studies.  For example, many studies try to perform what I will call "Hindsight Replication," in which they correlate their positive findings to previous findings in other studies.  This would have more value if the randomized results cannot make such a correlation and the actual results could.
I am going to contend that most of these GWAS studies will not pass the Pittelli Test.  I will also contend that previously performed studies claiming a number of correlated variants will also generally not pass the Pittelli Test.
In my view, all GWAS studies performed from this point forward, should use this or a similar method to determine whether they have non-random results and, if possible, all studies performed to date should add such a test as an addendum.  If they don't pass the Pittelli Test, they should be retracted.
(Addendum: I have submitted this to a journal for consideration for publication.  Can't say I'm expecting them to bite, but would be interesting).

Addendum:  I have also come up with a test to determine whether population stratification, the other confounding possibility in GWAS studies, is a likely problem for the GWAS.  This test utilizes  previous GWAS studies for the same trait.  That can be found here.

3 comments:

  1. Interesting idea. Another one to consider is “nonrandom” groups of the same size that are clearly nongenetic. Eg take the first 100000 people alphabetically by first name, phone number, zip code, etc. (depending on what’s available). Yes, these will have associations with genetics because geography, names etc have some genetic structure due to immigration etc... but that’s precisely the point! Attributing to genetics patterns that actually have social/economic/geographic causes is likely to be a huge problem with massive datasets and data mining technology that can pick up any correlation for any reason. There are attempts to correct for some confounding variables in GWAS but there is little reason to think that eg a Socioeconomic Status variable actually fully captures the complex reality of such things.

    ReplyDelete
    Replies
    1. Yes, I get mixed messages related to population stratification issues. That, of course, is another potential reason for false positives that wouldn't be accounted for in my test. My expectation is that there will be a bit of pop/strat, so the number of false positives will be a bit lower than those in the original study. In any case, since my test points to a random false positive rate, I will try to come up with something like you mention above that could also rule out pop/strat false positives.

      Delete
    2. Okay, I put another test together to check for population stratification. Check it out here and tell me what you think:
      http://unwashedgenes.blogspot.com/2018/06/a-method-for-assessing-population.html

      Delete