Worm Breeder's Gazette 9(3): 11

These abstracts should not be cited in bibliographies. Material contained herein should be treated as personal communication and should be cited as such only with the consent of the author.

Of Computers, Strains and Worms

J. Thomas, L. Avery, B. Meyer, B. Horvitz and troupe

We have spent some time recently attempting to enter the information 
age with our strain and allele lists. Two aspects of this attempt 
might be of interest to others. First, the Horvitz and Meyer labs' 
strain and allele lists will soon be available on floppy disks for 
others to peruse as light bedtime reading. Second, we have written a 
variety of programs that deal with entry of information and searching 
the database, which might be useful to others (at least those with 
some computer programming skills).
Our strain and allele lists are presently being entered into 
computer files (this is not as big a job as it might seem if it is 
split amongst all members of a lab). The lists will describe about 4,
200 strains and about 1,400 alleles (including all of the n and y 
alleles) entered as ASCII text files. These lists will be most useful 
to someone who has a program or operating system that can perform 
sophisticated searches on files. However, even DOS provides the 'find' 
command which will find exact matches (for example, you can look up 
all lines that contain e61). The minimum hardware requirement is any 
computer, but a fast CPU (like those in a MacIntosh or a PC/AT) and a 
hard disk will make information searches a great deal faster. With a 
PC/AT with a hard disk we get search times around 1.2 seconds per 1,
000 typical entries (for string searches).
As an aside, we have also typed into ASCII files and sent to the CGC 
a variety of other information that might be useful to people: a 
modified form of the gene list (with the addition of a number of 
alleles, but lacking most let genes and all of the descriptive text), 
the title, author and location of all of the Worm Breeder's Gazette 
articles to date, and the complete Ll and adult parts lists (with 
updated information and a few corrections). Mark Edgley informs us 
that he has typed in the complete (but not proofread) gene list as it 
appears in Jonathan Hodgkin's handout from the last worm meeting. If 
you send Mark a PC or MS DOS formatted floppy diskette you can get 
these files from the CGC.
With the advent of the personal computer age, it seems a good time 
to consider the possibility of labs sharing strain and allele 
information. This will be particularly useful for allele lists. While 
particular strains can be constructed, alleles cannot be. It might be 
that an allele generated in one lab would be useful only to another 
lab, a fact that at present is very difficult to recognize. Our lists 
will be available to anyone who sends us a request and a floppy disk. 
It would gladden our hearts and bring a song to our lips if other labs 
undertook to type their strain and allele lists into a computer and 
made them generally available. Everyone in the field will benefit from 
such an undertaking.
We also have written a variety of programs that permit us to enter, 
maintain, and use our database of strains and alleles. These programs 
require the UNIX operating system (or equivalent; we use XENIX for the 
PC/AT) running on a suitable machine (a PC or Mac won't do). Anyone is 
welcome to these programs, but they will require some fiddling to 
adapt them to 
your
bookkeeping system. Anyone who knows how to shell program in UNIX or 
wants to learn (this is much easier than learning a real computer 
language) should not have too much trouble, but be warned that the 
documentation is fair to poor.
The structure of our database is very simple. There are two basic 
files: strains and alleles. Each contains one line (of any length) per 
entry and each line holds a variety of information organized into 
fields (a field in a database is the equivalent of a column in a table;
it contains one type of information for each entry). Fields are 
separated by a special character that is not used anywhere else, in 
our case a tab character (the field separator allows the computer to 
keep track of the way information is organized in each entry). Each 
strain entry has fields containing genotype, source, date, freezer 
position, strains used in construction, and comments. Each allele 
entry contains information about source, mutagen, date, phenotype, 
gene, qualifiers (such as mat, ts, or sd), chromosome and comments. 
Dfs and Dps are treated as alleles. Each of these fields can be useful 
for searching the collection for specific sorts of information.
The simplest way to search the database is using a flexible string 
search program (such as grep or egrep in XENIX). Early betting was 
that this would be too slow to be convenient. However, on our machine 
the search time on a sample database was about 7 seconds for 5,500 
strains. We judge that the speed of cheap computers will increase 
faster than the size of our collection. String searches have two 
advantages over more complicated database searches (which are faster): 
they are intuitively simple to understand, and they are much more 
flexible. We have also written (shell) programs that use sorted files 
to speed up the search. With our present strain collection size these 
searches are only slightly faster, but whereas the time required for a 
string search increases linearly with size, the time for these 
searches increases very slowly with size (in one trial, 8 seconds to 
search a database of 30,000 strains). Such speed may be important if 
one wants to search all the strains in many collections.