Inducing molecular lesions in a genome is an effective approach to interrogate the genome for its functional elements. Chemical mutagens that introduce specific base changes, such as EMS, are a popular choice as they efficiently induce point mutations that can alter the activity of a gene product in many different ways. The high frequency of mutagenic events means that a mutant selected for a single locus phenotype should contain a number of EMS-induced “background mutations” within the genome. Experimenters try to minimize the potential impact of background mutations through backcrossing to a wild-type genome. However, mutagenic rates and backcrossing efficiency are theoretical ideas supported by observations at limited loci, i.e. no full genomic snapshots of genome sequences right after EMS mutagenesis and after outcrossing have yet been provided.

We took a broader look at 14 EMS-mutagenized genomes representing strains that were isolated based on specific phenotypic features (different types of neuronal specification defects). We sequenced each entire genome with an Illumina Genome Analyzer II. We find that these strains maintain a substantial mutational load, in spite of outcrossing (ranging from 0 to 5 times) to wild-type animals. Surprisingly, this “background” load includes presumptive loss of function alleles in many protein-coding genes and can lead to complex genetic interactions. It may be assumed that loss-of-function alleles would adversely affect fitness and be removed from the population after backcrossing. In order to look more closely, we filtered all variants uncovered by WGS for such loss of function alleles and categorized them by molecular nature: nonsense, missense and splice site. To isolate EMS-induced mutations from mutations found in our N2 or starting transgenic strain, we compared variants between individual datasets. Variants found common among two or more datasets were considered unrelated to the EMS mutagenesis, even though we cannot completely rule out the possibility that they may be independent EMS hits.

After filtering out such variants, we find, on average, a single mutagenized strain to contain 84 missense, 2 nonsense and 1 splice site mutation. This substantial number of background mutations was surprising because some of the sequenced genomes had been backcrossed several times, but they apparently nevertheless maintained a high mutational load. Non-backcrossed strains have an almost identical average mutation load than backcrossed strains: 84 missense, 4 nonsense and 1 splice site mutation. Moreover, the most backcrossed strain (5x) contained 96 missense, 4 nonsense and 1 splice site mutation. Why would such mutational load be maintained? If we disregard any variants linked to the specific phenotype we selected for, numbers of loss of function mutations between backcrossed and non-backcrossed strains remains equivalent. Thus there is no immediate, trivial explanation. Perhaps specific sets of mutant combinations are balanced and thereby co-maintained. Indeed, we find that in one strain that we analyzed in more depth, one premature stop codon resided in an essential gene, yet the strain was viable. We found this viability to be ensured by an unlinked suppressor mutation. These two variants are therefore fixed.

Premature stop codons that we fortuitously retrieved – and the many more that surely anyone using whole genome sequencing will also fortuitously retrieve – are excellent candidates for genetic loss of function alleles. The strains that contain these alleles are therefore a valuable resource that complements the efforts of the C. elegans knockout consortia. All strains with these alleles will be made available through the CGC.

In sum, given the dirtiness of EMS-mutagenized strains, it is important to be rigorous about validating that a given sequence variant is indeed the phenotype-causing one.