Sunday, April 22, 2018

Educational Attainment genes or a whole lot of nothing

I took a look at this study, because I was told that there are some really fantastic studies related to genetics and intelligence/educational attainment that have come out since 2015.  This one is from 2016.  I am very unimpressed:

Genome-wide association study identifies 74 loci associated with educational attainment

Each of the studies I've looked at so far involving GWAS have a fundamental problem right up front and this one is not an exception.
That problem is that they don't test their datasets to determine how many variants would be "positive" at random.  So when I read that 74 loci (not SNP's mind you) have a positive correlation to educational attainment out of ... 9.3 million SNP's if I am reading this correctly (for the purposes of brevity, I'm going to leave aside the question of how they quantify educational attainment into a single value, but reserve the right to return to it in the future), then I was immediately struck by the fact that this might be just random.  So let's stop right here and examine this.  Let's say we randomized the data here, so that the individuals were no longer grouped by educational attainment or, really, anything at all and randomly grouped in the same ratios.  Would we expect to get zero correlations out of 9.3 million?  Does anyone believe that?
Of course not.  Common sense would tell us that we would get some false positives.  How many false positives would we get?  Well, if the authors weren't hellbent on finally proving the unprovable, they could have tested this themselves.  I'm guessing we would get somewhere in the neighborhood of, oh, say, 74 false positives.  If they took up my suggestion and they came up with a number anywhere close to 74, then there is really no need to continue.  We would know that the data is largely, if not exclusively, false positives.  But without knowing the number of likely false positives, we can't really get a rough idea of how many of the 74 loci are likely to be the real deal (I'm guessing zero, but it's not in my court to prove it).  This is not a small problem, because everything else that is done from that point on, is operating under the assumption that all 74 of these loci are not false positives or at least a negligible number.   This is an assumption that I think anyone can see is already on shaky ground.  (Addendum:  I have quantified this with my own suggestion for a test of non-randomness, which I refer to as The Pittelli Test.)
This isn't a new problem, by the way.  As I've mentioned here, I pointed this out in 2002 and made the same suggestion.  One might ask why they continue to ignore my suggestion?  I have a theory about it, which has to do with author bias and the unconscious mind, but I welcome a response from any of the authors of this or other studies of this nature in which no random control is performed.
The authors would like to believe that this large number of new loci (at least 71 of the 74 had not been found in previous studies) was due to an increase in the n from by combining a previous study of 100,000 to closer to 300,000, which is referred to as a meta-analysis, if I understand their premise.  The problem with this is that the larger the database, the more false positives we are going to get, whereas the number of true positives should, in theory, level off to the actual number of true positives (assuming such true positives exist).  So a large increase in the number of positive correlations is probably more likely secondary to false positives.
If this isn't clear, let me illustrate by example:  Let's say we have a trait that has an actual, full total of 50 SNP's associated with it.  At some point, we should pick up all 50 of these if our database is large enough.  After that point, we should not have an increase in the number of correlated SNP's, because we would have reached the full total, no matter how much larger our database is.  That's not the case with false positives, however, which could expand beyond the total number of SNP's.  So if we do a study in which we expand our dataset and come up with a very large increase in the number of positive correlations, we are probably looking at false positives.
I'm tempted to stop right here, since I think anyone reading this might be at least a little bit convinced that there might be some problems with this study, but I'll make a few more points.  First, the authors try to claim that there were 3 SNP's from a previous meta-analysis that were "replicated" in this study (can they please stop using the term replication for things that aren't a true replication).  I'll quote them here:
"Our meta-analysis identified 74 approximately independent genomewide significant loci. For each locus, we define the ‘lead SNP’ as the SNP in the genomic region that has the smallest P value (Supplementary Information section 1.6.1). Figure 1 shows a Manhattan plot with the lead SNPs highlighted. This includes the three SNPs that reached genome-wide significance in the discovery stage of our previous GWAS meta-analysis of educational attainment."
Anyone else notice a bit of a switcheroo, here?  They started this paragraph out with "loci" and finished it with "SNP."  Another way to say this, is that 3 of the 74 positive loci, had an SNP from a previous study that was also in that loci.  Lead SNP or not, there is no proof that these are the same SNP's.  So we really have no definitive "replications" in our "replication."

Next, another misuse of the term replication:
"To further test the robustness of our findings, we examined the withinsample and out-of-sample replicability of SNPs reaching genomewide significance (Supplementary Information sections 1.7–1.8). We found that SNPs identified in the previous educational attainment meta-analysis replicated in the new cohorts included here, and conversely, that SNPs reaching genome-wide significance in the new cohorts replicated in the old cohorts"
This tries to give the impression that there was an actual replication of the new dataset by the old dataset (in the hind-sight method of replication that I am seeing so much of, of late).  In fact, again, none of the SNP or Loci from the previous study matched this one.  This is an attempt to mathematically re-examine the non-significant loci from the previous study, to make a case for some significance.

I would also like to address the fetal neural tissue claim, which appears to be an attempt to show that these positive loci are consistent with a neurological correlate:
"Candidate genes are preferentially expressed in neural tissue, especially during the prenatal period, and enriched for biological pathways involved in neural development. "
This again, is a kind of proof by hindsight.  If the study had been an attempt to find out whether genes involved in prenatal brain development effect intelligence or educational attainment, then a claim like this would have perhaps a tiny bit of merit.  But this was merely a study to see if there were ANY genetic correlates  to educational attainment.  So basically any genetic correlates that had some neurological function could have been presented here as "evidence" that they were valid.  There are any number of brain-related functions tied to various genes, so you are going to find some in your 74 loci.  We haven't even gotten to the stage of demonstrating that these aren't random, false positives, much less how they function to effect the trait in question.

Lastly, for now, I'll address the alleged correlations with other traits:
 "As shown in Fig. 2, based on overall summary statistics for associated variants, we find genetic covariance between increased educational attainment and increased cognitive performance (P=9.9×10−50), increased intracranial volume (P=1.2×10−6 ), increased risk of bipolar disorder (P=7×10−13), decreased risk of Alzheimer’s (P=4×10−4 ), and lower neuroticism (P=2.8×10−8 ). We also found positive, statistically significant, but very small, genetic correlations with height (P=5.2×10−15) and risk of schizophrenia (P=3.2×10−4 )."
Since I have already brought into question the validity of the variants produced in this study, and would assume  similar questions of validity of these variants in relation to the above noted  traits, it seem likely that these alleged correlations are no more than random data on top of random data.

In Conclusion:
1. No control has been done to determine the likely number of false positives in a study of this nature, and it is quite possible that many, if not all of the 74 loci noted are nothing more than false positives.
2. Rather than a direct approach to determining whether these variants are valid, the authors attempt to demonstrate its validity by assuming the apparent mechanism of the gene is consistent with the trait in question.
3. None of the the SNP's noted in this study has been previously noted in other reasonably large studies of this trait, which suggests that they are being randomly generated.
4. I suggest the authors redo the study with a proper control as suggested above. 
5. Any study that uses this one as part of their foundation, is, in itself, brought into question if it cannot be demonstrated that the 74 loci in question are anything more than random false positives.

No comments:

Post a Comment