Worm Breeder's Gazette 12(2): 26 (January 1, 1992)

These abstracts should not be cited in bibliographies. Material contained herein should be treated as personal communication and should be cited as such only with the consent of the author.

Update on cDNA Sequencing

The Sequencing Consortium

MRC Laboratory of Molecular Biology, Cambridge, CB2 2QH, England and Dept of Genetics, Washington University School of Medicine, St Louis, MO 63110, USA

This is another interim report on the state of the cDNA sequencing and mapping project that we are carrying out alongside the genomic sequencing effort. We have now sequenced 1222 of the 1743 clones in Chris Martin's library. Of these 27 appear to be duplicates in the strong sense that their sequence looks identical and they come from neighboring micro-titre wells that were most likely cross-contaminated. The following results therefore come from the 1195 left after removing these duplicates. Remember that we are only producing a single gel reading from each clone (5' end of the cDNA insert).

Similarities: There are 373 (31%) with BLASTX scores to the public protein databases greater than 100, which is a high enough threshold that these similarities are very likely to be meaningful. A list of these is given on the following pages.

[Figures excluded due to low reproducability]

Trans-spliced leaders: 166 sequences (14%) had trans-spliced leaders at the 5' end, 149 to S l1 and 17 to SL2 .

Repeat isolations: A number of genes clearly have been cloned more than once. We can unambiguously detect this only when the 5' ends of the inserts overlap, which happens for 334 sequences that fall into 130 overlap groups. This is not out of line with Chris Martin's estimate of repeat frequency. It leaves us with 991 distinct sequences of which 861 are represented only once. However it is likely that there are additional repeats that we don't see this way because we have different parts of the same gene.

Hits to previously determined worm sequences: There were exact matches to 13 previously determined C. elegans sequences in the public databases, and to 3 of the 44 predicted genes on the first 4 genomic cosmids that we have sequenced. The latter figure allows us to generate an estimate of (44/3)*991 = 14,500 genes in the worm, but the 95% confidence limits for this are 6,000 to 50,000, so we need more data to make this a useful approach to estimating gene density.

Mapping: we are now in the process of mapping all the clones using the YAC "polytene" filters. This data will be in ACEDB and will also be provided if you ask for a clone.

All the sequences, similarity and mapping data will be in ACEDB, and any clone is available along with more detailed data from the St. Louis lab.