Worm Breeder's Gazette 9(3): 11
These abstracts should not be cited in bibliographies. Material contained herein should be treated as personal communication and should be cited as such only with the consent of the author.
We have spent some time recently attempting to enter the information age with our strain and allele lists. Two aspects of this attempt might be of interest to others. First, the Horvitz and Meyer labs' strain and allele lists will soon be available on floppy disks for others to peruse as light bedtime reading. Second, we have written a variety of programs that deal with entry of information and searching the database, which might be useful to others (at least those with some computer programming skills). Our strain and allele lists are presently being entered into computer files (this is not as big a job as it might seem if it is split amongst all members of a lab). The lists will describe about 4, 200 strains and about 1,400 alleles (including all of the n and y alleles) entered as ASCII text files. These lists will be most useful to someone who has a program or operating system that can perform sophisticated searches on files. However, even DOS provides the 'find' command which will find exact matches (for example, you can look up all lines that contain e61). The minimum hardware requirement is any computer, but a fast CPU (like those in a MacIntosh or a PC/AT) and a hard disk will make information searches a great deal faster. With a PC/AT with a hard disk we get search times around 1.2 seconds per 1, 000 typical entries (for string searches). As an aside, we have also typed into ASCII files and sent to the CGC a variety of other information that might be useful to people: a modified form of the gene list (with the addition of a number of alleles, but lacking most let genes and all of the descriptive text), the title, author and location of all of the Worm Breeder's Gazette articles to date, and the complete Ll and adult parts lists (with updated information and a few corrections). Mark Edgley informs us that he has typed in the complete (but not proofread) gene list as it appears in Jonathan Hodgkin's handout from the last worm meeting. If you send Mark a PC or MS DOS formatted floppy diskette you can get these files from the CGC. With the advent of the personal computer age, it seems a good time to consider the possibility of labs sharing strain and allele information. This will be particularly useful for allele lists. While particular strains can be constructed, alleles cannot be. It might be that an allele generated in one lab would be useful only to another lab, a fact that at present is very difficult to recognize. Our lists will be available to anyone who sends us a request and a floppy disk. It would gladden our hearts and bring a song to our lips if other labs undertook to type their strain and allele lists into a computer and made them generally available. Everyone in the field will benefit from such an undertaking. We also have written a variety of programs that permit us to enter, maintain, and use our database of strains and alleles. These programs require the UNIX operating system (or equivalent; we use XENIX for the PC/AT) running on a suitable machine (a PC or Mac won't do). Anyone is welcome to these programs, but they will require some fiddling to adapt them to your bookkeeping system. Anyone who knows how to shell program in UNIX or wants to learn (this is much easier than learning a real computer language) should not have too much trouble, but be warned that the documentation is fair to poor. The structure of our database is very simple. There are two basic files: strains and alleles. Each contains one line (of any length) per entry and each line holds a variety of information organized into fields (a field in a database is the equivalent of a column in a table; it contains one type of information for each entry). Fields are separated by a special character that is not used anywhere else, in our case a tab character (the field separator allows the computer to keep track of the way information is organized in each entry). Each strain entry has fields containing genotype, source, date, freezer position, strains used in construction, and comments. Each allele entry contains information about source, mutagen, date, phenotype, gene, qualifiers (such as mat, ts, or sd), chromosome and comments. Dfs and Dps are treated as alleles. Each of these fields can be useful for searching the collection for specific sorts of information. The simplest way to search the database is using a flexible string search program (such as grep or egrep in XENIX). Early betting was that this would be too slow to be convenient. However, on our machine the search time on a sample database was about 7 seconds for 5,500 strains. We judge that the speed of cheap computers will increase faster than the size of our collection. String searches have two advantages over more complicated database searches (which are faster): they are intuitively simple to understand, and they are much more flexible. We have also written (shell) programs that use sorted files to speed up the search. With our present strain collection size these searches are only slightly faster, but whereas the time required for a string search increases linearly with size, the time for these searches increases very slowly with size (in one trial, 8 seconds to search a database of 30,000 strains). Such speed may be important if one wants to search all the strains in many collections.