Thursday, May 9, 2019

Conjuring some within-family pop strat

Are you ready to hear something? I want you to see if this sounds familiar: any time you try a decent crime, you got fifty ways you're gonna f**k up. If you think of twenty-five of them, then you're a genius... and you ain't no genius. 
-Mickey Rourke to William Hurt in Body Heat

In a recent Twitter discussion, wherein I suggested that "Educational Attainment" GWAS/PGS are largely bogus, bolstered by population stratification and assortative mating, someone noted:  "Sooo, you'll bet against EA3 explaining even 1% in any new European sample within families?" The implication here is that, despite the many studies coming out lately suggesting the extent to which polygenic scores are subject to stratification issues, if even 1% is explained in a "within family" PGS, then at least something is genetic. This is quite a lowering of the bar from the supposed 13% from EA3 (the third large educational attainment GWAS study), which I assume from some of the recent studies (here and here), they see dwindling away. Anyway, something is better than nothing, so this is a way to suggest "something" is there. Certainly, when looking at "in family" GWAS/PGS, you are going to expect significantly less population stratification, since you are looking at individuals that share the same parents, upbringing and DNA.  For example, a recent study showed significant attenuation when looking at "in family" (which surprisingly, considering the authors, they attributed to socioeconomic confounders). Nevertheless,  even a smaller percentage is still something and now they might claim, at the cost of a predictive dilution, that they eliminated all stratification issues by using "within-family" analysis. Well, I'm not so sure about that, as I will discuss below the fold.

The study noted above, looking at the UK Biobank and it's likely stratifcation issues had me thinking about the fact that we are not really able to get at every possible stratification issue.  Honestly, I have just assumed that these small percentages were just noise, and never took them seriously, even though I couldn't come up with any specific reason, which I assumed was above my statistics grade. Now though, as the original optimistic results (if you really think 13 percent is optimistic) appear to be getting pared down by these studies noting pop/strat and in-family discrepancies, within-family is likely to be cited as a base level of genetic correlation.
I want to try to give a reasonable example of how pop strat could creep into a within-family PGS and show slight predictability for a sibling group, using an "independently" generated polygenic score, so does not necessarily indicate that there is a genetic component.   This recent study drove this possibility home for me due to some stratification issues they noted for a variety of traits in the UK BioBank, just based on age .
For example, for both BMI (body mass index) and educational attainment, the PGS predictability was better for the younger age groups than the older, which would seem unusual, all other things equal. They don't give a definitive explanation for this, although they present numerous possibilities. I was struck by this one, though:

SES differences will often be a problem for GWAS in which the sample is not representative of the population; for instance, the most recent major GWAS of educational attainment (Lee et al. 2018) included numerous medical data sets and the 23andMe data set, which are not representative of the national population.

In other words,  they rely on who let themselves get put into these genetic databases since the reason someone gets a genetic test, whether medical, philosophical, socially popular within your group, whether you can afford to take them, etc., can vary and this is going to skew the population being tested. I'll refer to this as "junk stratification" (apologies if this or a similar term has already been coined.  Otherwise, dibs). Thus, there are a whole host of possible stratification issues that we might never think of (see quote at the top of this post).
As an aside, considering the hundreds of studies that have been based all or in part on the UK Biobank, the fact that a paper comes out noting significant stratification issues for things as simple as age and sex, should make a lot of people a bit uncomfortable about the results they cited in their studies and begs the question of why this wasn't noted sooner? I invite any authors of said studies to explain to me why it isn't a big deal.
So let me get back to my point. Because of this stratification, the results of the GWAS/PGS for educational attainment were significantly attenuated with the sibling-based PGS.
...for a range of social and behavioral traits, such as years of schooling completed, pack years of smoking and age at first sexual intercourse, the prediction accuracy of the sib-based PGS was substantially lower than that of the standard PGS (Fig. 3B). It was also significantly lower for two morphological traits, height and whole body water mass.
Clearly, far more than with physical traits, "social and behavioral traits" are subject to stratification issues, which seems like common sense. So let's focus on educational attainment. As noted, the prediction accuracy dropped dramatically with the sibling-based PGS. On the plus side, ideally, this sibling-based PGS can rule out the "junk stratification" problem and find the "real" PGS predictability for a trait. This would be the case, at least, if there was no overall, appreciable difference between siblings with similar genetics when applying these scores. I can think of at least one possibility where this can be skewed.
Let's take the null and assume here that any SNP's we find are non-causal, and really a product of stratification/assortative mating issues (I always assume that).
We already know that younger people skew the PGS for EA higher in the UK BioBank, and I think it's fair to assume (and the authors of the study suggests as much), that this skewing would be the same in the sibling-based group, as they are likely giving DNA for similar reasons (One could think of exceptions, but let's go with that). If we assume that sibling-based analysis will filter out this skewing we are operating under the assumption that siblings are, on average, likely to have the same EA.  I suggest that is wrong and that younger siblings are likely, on average to have a higher EA than their older siblings. This is not about birth order. If you look at graphs of EA over the years, such as the graph at the top right of this link, we see that there has been a continuous progression, at least in the United States and I believe also in Europe, both of high school and college graduates. Looks like about a half a percentage point per year over the past 20 or 30 years. If the average siblings are, say, 3 years apart, then there is about a one and a half percent better chance that the younger one graduated (from either high school or college).
Given that, we have a PGS that was generated that we already know skews for higher PGS for younger people, so they must have some SNP commonalities in greater frequency than older individuals (and they used a fairly liberal cutoff of 10 -4), and this would extend to those in the sibling-based group. So the PGS would likely pick up on some of these same SNP's amongst the younger siblings, giving them better PGS scores, even if the real reason for the discrepancy was just age. The junk stratification has carried over to the sibling-based group and is indeed going to be slightly predictive.
You might argue that this is a small amount, and I think you would be right, but if you are bumping your PGS prediction a couple of percentage points and these sibling-based PGS predictions are already low percentages, then we have already taken a chunk of that from just this one example, which I thought of only because of the recent study showing age stratification. A few more and we have effectively eliminated any legitimate genetic correlation or pgs prediction for the trait (then there is the "red-haired kids" idea I recently saw mentioned by Eric Turkheimer).
I think, given  this, one can make the claim that these GWAS/PGS studies have not definitively demonstrated ANY genetic role for educational attainment. I welcome any responses, particularly from stats people.











No comments:

Post a Comment