Back to Top

The Human Knockout Project

What does every gene in the human do?  Which genes can cause disease?

The mouse knockout project was started in 2003/2005 as a concerted effort to knockout (KO) every known mouse gene as a resource for helping us understand human disease.   (A knockout, by the bye, essentially means that the gene is mutated in such a way that it can no longer function).  For example, if we head over to the Jackson lab website and search for the gene Nipbl, we'll find the following phenotype:  "Mice heterozygous for a gene trap allele exhibit increased mortality from birth to early adulthood, decreased weight, decreased adipose tissue, delayed ossification, craniofacial abnormalities, and abnormal hearing, behavior, and eye morphology."  -- and indeed, this gene is responsible for the human disease Cornelia de Lange Syndrome.  Gradually as these KO mice get phenotyped we will have a better understanding of the role each gene plays in development and disease. 

A human knockout project, is, by all possible definitions of amoral, a staggering level of amorality.  Mother nature (amoral beast that she is), is making human KOs every second of everyday and it would be possible, via a large sequencing project, to discover (almost) every gene that can lead to a severe, Mendelian,  autosomal dominant (AD) phenotype.  How many kids would we have to sequence to get a handle on these types of genes?  The math is fun but some background first...

AD diseases require that only 1 of the two copies of each gene be dysfunctional to lead to disease.  Typically, severe AD diseases cause childhood disease and, if severe enough, prevent that child have having any children of their own.  The only way these disease can arise is by de novo mutation.  De novo mutations do not come from your parents but are "brand new" mutations, that are pretty specific to you, and are the foundational building blocks of evolution.  The number of de novo mutations that occur in each generation was explored using evolutionary genomics, comparing areas of non-selected DNA in chimps and humans, using the time of divergence of our two species and a bit of guess work to come up with an estimate.  Today, using high-throughput sequencing, we can directly measure the number de novo mutations that occur in a child.  Although there is some difficult in doing this accurately, especially at the whole genome level, a pretty good estimate is 1 coding de novo mutation per generation (that is, in the 80 million coding bases you got from mom and dad, there is probably 1 "new" change). 

OK, so every kid we sequence will (on average) have 1 new mutation.  How many of these mutations are deleterious?  Well, there's a lot of ways a mutation can be deleterious, but the most common (and the easiest to assay) are nonsense mutations.  These mutations essentially tell your cell that the coding region of the gene stops earlier than it should.  In any given gene, about 5% of all possible point mutations cause a stop mutation.  Now, not all stop mutations necessarily stop the gene from working, but lets say that a stop mutation in the first half of the gene IS deleterious.  That means only about 2.5% of all mutations in a gene will be deleterious. 

Now, there are 20,000 (ish) genes in the genome.  So any particular child has a 1/20K change of having a mutation in a particular gene.  In turn, only a 2.5% chance that mutation is deleterious.  If you do the math, you'd need to sequence about 800,000 children to get a knockout in every gene!  Turns out, that's a best case scenario, as you have a good chance of getting multiple mutations in the same gene.  You're helped out a bit by the fact that there are other ways to get deleterious mutations (nonsynonymous mutations, splicing mutations, frameshifts) and that some genes will give you a phenotype even when only a small portion of the gene is destroyed, or only slightly dysfunctional.  So, let's call it an even 1 million kids.

1 million sounds like a lot, but it's only about a quarter of the kids born each year in the USA.  Plus it could probably be reasonably done for 500 million dollars.  Plus, this isn't some pipedream, it is something that can be done now.  There is a proposal to sequence about 1 million veterans.  The problem with this proposal is that any of those who had disease would not have "made it" to adulthood or old age.  Thus, all the really "bad" mutations have been weeded out by natural selection and are not detectable by this approach.  The sequence data from kids, in addition to elucidating de novo disease, will likely give us a plethora of information on recessive disease as well.  Coupling this data to the children's health records, and collecting DNA from the children's parents (which is a less likely possibility for veterans) would essentially crack the genome.  We'd know what each gene does, and how it is involved in disease.  And this information would be available for all future generations.  There are complications, of course, some mutations may only cause disease when coupled with other mutations, but these situations will be rare and will give future genomicists something to do with their time.

Matthew Bainbridge's picture
About the author

Matthew Bainbridge is President and CEO of Codified Genomics, analyst, and sometimes scientist