Whole genome sequence analysis for novices

Pete Okkema1 and Adelaide Packard¹

¹Department of Biological Sciences and the Laboratory for Molecular Biology, University of Illinois at Chicago, Chicago IL

Correspondence to: Pete Okkema (okkema@uic.edu)

Oliver Hobert and his colleagues have pioneered the use of whole-genome sequencing (WGS) to identify lesions in C. elegans mutants, and they have produced the MAQGene software pipeline to analyze this data (Sarin et al., 2008) (http://maqweb.sourceforge.net). While MAQGene is excellent, it runs on Linux operating systems and requires a MySQL server, and these requirements are currently beyond our (and perhaps other C. elegans researchers) computer capabilities.

We are using WGS to identify a mutation cu13 that enhances the lethal phenotype of a hypomorphic tbx-2 mutant. Illumina sequencing was used to sequence genomic DNA of several strains that have the cu13 or wild-type alleles. Freely available software packages were used to align sequence reads with the reference C. elegans genome, identify variants, and annotate these variants with predicted effect on gene function. These analyses identified thousands of variants in each sequenced genome, and Microsoft Access was used to sort and compare variants in each genome. A small number of candidate lesions for cu13 were identified, and we are currently determining which of these causes the mutant phenotype. This approach is feasible for novices like us using a desktop computer and fairly rudimentary skills with the command line interface, and we thought others in the C. elegans community might be interested in trying this for themselves. The software packages generally have manuals and tutorials available, and we relied on these heavily.

Sequence alignment: Bowtie 2 was used to index the C. elegans reference genome and to align our fastq sequencing reads to this reference (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml). Bowtie 2 is an ultrafast aligner that outputs a SAM (Sequence Alignment/Map) used in subsequent analyses (Langmead and Salzberg, 2012 PMID 22388286), although Bowtie 2 may be less sensitive than the MAQ aligner used in MAQGene (Nielsen et al., 2011).

Variant identification: The SAMtools software package was used to identify variants and call genotypes based on SAM alignment files (http://samtools.sourceforge.net/) (Li et al., 2009). SAM files were initially converted to their binary equivalent BAM format and sorted using ‘samtools view’ and ‘samtools sort’ commands. Information regarding sequence quality and possible genotype was calculated using the ‘samtools mpileup’ command and stored in the BCF file format. Variants were called and written to a VCF (Variant Call Format) file using the ‘bcftools view’ and ‘vcfutils.pl’ commands. VCF is a widely used text file format storing information regarding variant position and sequence, sequence quality, and predicted genotype.

Variant annotation: C. elegans genome annotations were retrieved from the UCSC Genome Browser Annotation Database using the Perl-based software package ANNOVAR (http://www.openbioinformatics.org/annovar/) (Wang et al., 2010). ANNOVAR was used to convert our VCF files to ANNOVAR input files and annotate variants using the ‘perl convert2annovar.pl’ and ‘perl annotate_variation.pl’ commands. ANNOVAR outputs one file annotating all variants indicating the genomic features they hit, and a second file indicating the amino acid changes for exonic variants. For convenience, these files were combined into a single table using Microsoft Access.

Variant and sequence visualization: The Integrative Genomics Viewer (IGV) (http://www.broadinstitute.org/igv/home) was used to visualize variants and the underlying sequence reads (Thorvaldsdottir et al., 2012). Variants called in VCF files and sequence alignments in BAM files can loaded into tracks in the IGV browser and can be rapidly viewed at a wide range of genomic scales.

There are a variety of software options available for each of the steps (Nielsen et al., 2011), and while the software described here works for us, we are evaluating other approaches for each of these steps.

References

Langmead B, and Salzberg SL. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357-359.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, and Durbin R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079.

Nielsen R, Paul JS, Albrechtsen A, and Song YS. (2011). Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 443-451.

Sarin S, Prabhu S, O’Meara MM, Pe’er I, and Hobert O. (2008). Caenorhabditis elegans mutant allele identification by whole-genome sequencing. Nat. Methods 5, 865-867.

Thorvaldsdottir H, Robinson JT, and Mesirov JP. (2012). Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. April 19 (Epub ahead of print).

Editor's note:
Articles submitted to the Worm Breeder's Gazette should not be cited in bibliographies. Material contained here should be treated as personal communication and cited as such only with the consent of the author.

The WBG

An online publication service of WormBook

Whole genome sequence analysis for novices

References

Related

Leave a Reply Cancel reply