Worm Breeder's Gazette 14(2): 20 (February 1, 1996)

These abstracts should not be cited in bibliographies. Material contained herein should be treated as personal communication and should be cited as such only with the consent of the author.

The C. elegans genome sequencing project: A progress report

The C. elegans Genome Consortium

Genome Sequencing Center, Washington University School of Medicine, St. Louis, Missouri, USA and The Sanger Centre, Hinxton Hall, Cambridge, UK.

Over 30 megabases of sequence data has been completed from 972 clones. This figure should rapidly increase as the anticipated rate of finishing during 1996 is between 2.5 and 3.0 megabases per month. The breakdown of finished sequence generated per chromosome is as follows:

	II=7,270,380  III=7,197,211  IV=3,909,405  X=11,871,238

With an additional 24 megabases of sequence in progress the total amount of C. elegans DNA sequence available to researchers is now over 54MB. Sequencing is also currently underway on both chromosomes I and V.

The total number of predicted genes in the sequenced region is 6157 of which approximately 48% will have significant similarity to genes previously characterised in other organisms. Around 30% of predicted genes are confirmed as having transcripts, that is they are associated with one or more EST sequences. The predicted gene density on the X chromosome differs substantially from that of the autosomes. Currently the X chromosome has a predicted gene density of one gene per 6.2KB (20% coding) whilst within the autosomes this increases to one gene per 4.6KB (31% coding). These figures may be slightly misleading as all the chromosomes are at differing stages of completion and there is an over- representation of the gene rich regions from the autosomes. All the predicted protein products from the genome project are contained within the WORMPEP database which can be obtained from the Sanger Centre via FTP.

From the total number of expressed sequence tags derived from sequenced regions of the genome and the number of genes predicted therein it can be estimated that the C. elegans genome contains around 13900 genes. This calculation has the caveat that although it should not be affected by changes in gene density the accuracy of the estimation relies heavily on our ability to correctly predict the presence of genes. Currently, although we have sequenced only 30% of the genome this region is predicted to contain 44% of all C. elegans genes. This is due to the fact that genome sequencing has so far taken place mainly within the gene rich regions of the chromosomes and significant amounts of sequence from the less gene dense chromosome extremities have not yet been derived.

The genomic sequence data has also been able to determine a number of aberrant 5' splice sites where the usual GT is replaced by GC. The rest of the splice site matches strongly the observed consensus for 5' splice site and thus may compensate for the presence of the GC and allow for correct processing within the spliceosome.

Splice site	 Gene 			Evidence

AA/GCAAGTT	T04A8.16		EST data
AG/GCTAGTC	F11C1.5			EST data
AT/GCAAGTT	ZK1098.1		EST data
AG/GCAAGTT	let-653 [1] 		cDNA sequence
TG/GCAAGTT	CO7H4.2			homology
AG/GCAAGAT	F17E5.1 lin-2 [2]	cDNA sequence    
_--------------------------------------------------------------------
AG/GTAAGTT     5' splice site consensus

For further information on the C. elegans gene predictions and annotations from the sequencing project contact John Spieth (jspieth@watson.wustl.edu) or Steve Jones (sjj@sanger.ac.uk). For information on the distribution of ACEDB contact Richard Durbin (rd@sanger.ac.uk) or Jean Thierry-Mieg (mieg@kaa.cnrs-mop.fr). For information on sequencing plans or estimated completion times contact Richard Wilson (rwilson@watson.wustl.edu) or Alan Coulson (alan@sanger.ac.uk). All requests for cosmid clones should be sent to Alan Coulson (alan@sanger.ac.uk).

All of the C. elegans sequence data is available via anonymous FTP and the World Wide Web. Both the Cambridge and St. Louis groups now support on-line searching of the finished sequences and sequences in progress. Please note though that each site only carries their own unfinished data.

The FTP and WWW sites for St. Louis and the Sanger Centre are:-

St. Louis:

FTP: genome.wustl.edu (directory: /pub/gsc1/sequence/st.louis/elegans)
WWW: http://genome.wustl.edu/gsc/gschmpg.html

Sanger Centre:

FTP: ftp.sanger.ac.uk (directory /pub/databases/C.elegans_sequences)
WWW: http://www.sanger.ac.uk/

ACEDB data releases can be obtained from:
ncbi.nlm.nih.gov (130.14.20.10) in the USA, in repository/acedb
ftp.sanger.ac.uk (193.60.84.11) in the UK, in pub/acedb
lirmm.lirmm.fr (193.49.104.10) in France, in genome/acedb

[1] Jones and Baillie, MGG 248:719-726 (1995)
[2] Hoskins et al, Development, in press