Okbay et al. () are quite cautious in how one should interpret their findings since years of educational attainment is a complex phenomena and one cannot separate if these genes are truly related to educational attainment or do they explain the selection process that led one to complete more schooling. Since there are more hypotheses of significant association than data points, one must make corrections for multiple testing, and they are careful to use an independent sample for the replication study. The authors take great care to convince the reader that the observed associations are unlikely to be spurious by both utilizing the latest quality control protocols in the medical genetics literature (Winkler et al. ) and carefully account for population stratification in their analysis. Specifically, the authors conduct a robustness check of their main analyses where they (i) ensure common support is imposed across samples by excluding dissimilar individuals, (ii) accounts for high levels of principal components as additional controls to capture potentially confounding genetic differences across samples, and (iii) includes family fixed effects in the analysis. This paper provides a comprehensive guide on how to undertake and report results from a GWAS.

Among economists that use genetic markers as instruments, there are major differences in how these variables are included in the first stage. Ding et al. () used a series of binary variables for each potential genetic polymorphism in the genes they investigated. A potential concern is that many instrument problems (Hausman et al. ) may result and, to date, no research has investigated using the LASSO in the first stage. Other researchers rather than use a set of discrete binary indicator variables choose to treat the genetic information as a continuous variable and include the count of the number of risk alleles. We suggest that using a count variable is not only more challenging for researchers to interpret first-stage relationships and assess if they are consistent with the scientific literature, but this additionally imposes a strong functional form relationship that first-stage outcomes are linear in the number of risk alleles. We would argue that by allowing for nonlinear relationships through discrete indicator variable, one can easily test whether the linearity restriction is supported by the data. Second, the discrete variables truly shed more light on what features are driving the estimated effect and one can then get a better handle on if the relationships mimic those hypothesized in the scientific literature.

A common refrain in many economic seminars is that one is constrained by data limitations. Indeed, much of the earliest work by economists using genetic data is subject to this limitation. Much of the early research was limited by the genetic information collected within the data being investigated. Generally, the initial genetic markers made available were those that were hypothesized to be the main importance. These markers are called candidate genes, and they were generally chosen to be genotyped since they were located in a particular chromosome region suspected of being involved in the outcome or its protein product may suggest that it could influence the outcome being investigated.

This is a guest post by David Kirtley

