Back to SeqWord Project main page
Signature OligoDatabase |
OligoDBViewer download (12.8 Mb)last debugging: 2011.04.05 Database download:
|
||||
The program GUI interface was tested only under Windows control. Errors in the GUI interface are very likely when using with other OS. Use command line utilities instead, which provide users with all functionality of the GUI interface. |
Installation Toolbar and window functionality Selection of signature words Report panel Word filtering Word search and browsing Database info and help Command line utilities Publications |
Oligonucleotides of 8 to 14mers fall into a grey
area of the genome linguistics. Shorter words are intensively used for
calculating oligonucleotide usage patterns to allow genome comparison.
Longer words are used as unique genetic markers for species identification.
Words of 8 to 14 nucleotides are ubiquitous in all genomic DNA and cannot be
used as unique markers. Approaches of oligonucleotide pattern
comparison based on analysis of frequencies of all possible permutations are
not applicable either as the total number of permutations to be
considered is huge (word_length4) and only a small portion of these words
is informative. Thus, calculating of frequencies of all 8 to
14mers in a given genome would be time and memory consuming, while the
signal-to-noise ration would not be satisfactory. Here we want to present a database of 172,636 selected 8 to 14mer oligonucleotides, which frequencies varies significantly in different bacterial genomes. The database is supplied with a GUI to allows database browsing and selection of the best discriminative words for a given set of genomes or taxonomic units. A database file is available for download containing the information about frequencies of signature words in 724 bacterial genomes. Users may add new genomes to the database or remove them from the database to keep it updated and focused on the species of interest. The program was developed for comparative genomics and binning unknown sequences or groups of sequences to bacterial taxonomic units.
|
|||||||||||||||||||||||||||||||||||
Installation |
The program
OligoDBViewer is written on Python and needs
Python ver. 2.5.4
to be installed on the machine. GUI interface is based on
Pmw megawidgets. This module is
included in the download file. Download the file OligoDBViewer.zip from the site http://www.bi.up.ac.za/SeqWord/downloads/. Unzip file to a selected directory. A folder OligoDBViewer will appear with several files and subordinate folders inside. Run Python OligoDBViewer.exe.py. A starting window of the program will appear. Use the command File->Open to open a database file. The viewer cannot create a new empty database. An example of a small database example.wdb (35.1 Mb, 174 bacterial genomes) you may find in the subfolder db. Select this file and click Open. You may download the database file bacteria.wdb. At the time of writing this document, the database size was 118 Mb and it contained 733 bacterial genomes. However, this database is regularly updated and the real size of the database file may be bigger. |
|||||||||||||||||||||||||||||||||||
Toolbar and window functionality |
1. List of
taxonomic units
▪
New taxon – first asks for the file in
FASTA or GenBank format with the genome sequence (alternatively choose Edit->Add). Program automatically
calculates frequencies of the signature words in a given genome. Next a
dialog pops up (shown right) where class, genus, species, strain and accession
must be filled in. If a GenBank (.gbk) file was provided, program tries to
identify the phylogeny of the organism basing on the stored data. The
process of word frequency calculation is time and computer memory consuming.
It'd better to run it on a remote server using a command line utility
dbupdate_cmd.py. Users may create own databases by saving the existing database under a new name followed by adding and removing genomes in the databases. Two existing database then may be merged by the command File->Merge database.
|
|||||||||||||||||||||||||||||||||||
Selection of signature words specific for a set of taxonomic units | Program provides several algorithms of selecting the
signature words specific for a given set of taxonomic units. Depending on
the number of selected taxa, the program rum may take from several seconds
to hours. The progress bar is displayed in the Python Cmd window as shown
below. The progress bar Python module developed by Nilton Volpato in 2005 is
freely distributed under the GNU Lesser General Public License as published
by the Free Software Foundation.
1.
Selection of diverse words. Description of the command buttons on the report panel is given below. 2. Selection of common abundant words. 3. Selection of common rare words. 4. Comparison of different taxonomic units. Select a radio-button to display the list of classes, genera
or species. Then in the scrolled listbox chouse the taxonomic units to
compare. User may click Show/Hide Filter
button to set the word filter (will be discussed below), but at this stage
it is not recommended to use the filter as it will significantly slow down
the program. 5. Confronted comparison of taxonomic units. Select the radio-button to display the list of classes,
genera, species or chromosomes. Choose the radio-button
+/-, +++
or -- to select the divergent, abundant
or rare word selection, correspondingly. Then in the scrolled listbox select the
sample taxonomic unit and click the button Select.
The name of the taxon will appear in the area
Selected taxon, the button Select
turns to Reset and the listbox will
allow multiple selection of taxonomic units. Now select the taxonomic units
against which you want to compare the sample taxon and click the button
Add to
display these items in the list To compare with.
User may click Show/Hide Filter button
to set the word filter (will be discussed below), but at this stage it is
not recommended to use the filter as it will significantly slow down the
program.
|
|||||||||||||||||||||||||||||||||||
Report panel | Results of word selection by different algorithms are
displayed on the report panel as it was shown above. The report
panel contains a toolbar with several command buttons. Number of these buttons may
vary depending on the type of the report. It may comprise following buttons: ▪ << – shift the result table leftward for one screen; ▪ >> – shift the result table rightward for one screen; These two commands are used if the result table contains too many columns. Five columns are displayed at a time. If more genomes are selected, used the buttons << and >> to navigate around the wider table. ▪ Remove – open Remove genomes or words. Remember, that when genomes are removing using this dialog, the scores of the selected words will not be updated! To select words for a smaller set of genomes, select the required genomes and recalculate the table. ▪ Export – export the result table or the list of top scored signature words to a text editor that allows saving the table as a text file. The saved report file may be imported back to the Report panel using the menu command File->Import report. ▪ Set Filter – open Word filter dialog. This option will be explain in the next section. ▪ Dist.Table – calculate distance table for selected genomes. This option is available only for Diverse words report. ▪ Corr.Table – calculate correlation table for selected genomes. This option is available only for Diverse words report. ▪ Binning – calculate distances between an unknown sequence or a group of sequences stored in a FASTA file and the selected genomes or taxonomic units based on counting the signature words listed on the Report panel. Click the button Binning and select the input FASTA file in the Open file dialog. Distance values in the range from 0 (identity) to 10 (maximal distance) will be assigned to each selected genome or taxonomic unit.
|
|||||||||||||||||||||||||||||||||||
Filtering of the words | Program allows setting the score and word length thresholds; removing from the resulted list the word
permutations with lower scores; and limiting the length of the list. The filter settings dialog
is shown below:
Filter settings may be done prior to running the word selection
algorithms described above, but we recommend not to set filtering of word
permutations, wordshifts and constituent words at this stage, as it may make
the program too slow.
|
|||||||||||||||||||||||||||||||||||
Browsing words in the database | Current version of the database contains 172,636 selected
signature words, which frequencies vary significantly in different bacterial
genomes. Among these words there are 21,155 8mers; 28,040 9mers; 24,214
10mers; 18,468 11mers; 15,326 12mers; 26,105 13mers and 39,328 14mers that
is 0.096% of all possible 8 to 14mer oligonucleotides. (In this database an oligonucleotide and its reverse complement are considered as
the same word.
Thus, the total number of words is Σ[4i]/2
– Σ[2i | when i%2==0], where i=[8,9,10,11,12,13,14]).
To check which words are in the database or to display frequencies of the specific words in different genomes, use the command File->Search words. A dialog will pop up that is shown below:
Enter an oligonucleotide of 8 to 14 bp long into the field
Search for and click on the button
>> to add this oligonucleotide to the
list Selected words. Alternatively, you
may enter only a few starting letters (nucleotides) of the word and click the button
Search. A list of all words in the
database starting with the entered letters will appear in the listbox
Available words. Click on the word of
interest and then click on the button >>
to add this oligonucleotide to the list Selected
words. |
|||||||||||||||||||||||||||||||||||
Database info and help | .Use the command
Info->Database info
to display the number of signature words and genomes in the currently open
database:
To open the current help file in the browser on your computer, choose the command Help->Help.
|
|||||||||||||||||||||||||||||||||||
Command line utilities |
Extensive calculations of the word scores for multiple genomes and the
database update may consume a lot of computer time and memory, hence they
should be done on more powerful server machines. To facilitate such
calculations and the database updates on remote servers, several command
line utilities are available in the download ZIP file. The command lines
with argument settings are shown below:
python oligodb_cmd.py -i
input.txt -d bacteria -o output.out
-w words.txt -p 0 -f
0,0,0,8,14,0,10000,10,10
This program is used to prepare an input file
for the program oligpdb_cmd.py that
was described above.
Selection of high scored words and filtering
of the word list may be separated to achieve better performance. Use the
program filter_cmd.py to filter the
output file of the program oligpdb_cmd.py
with the word list report.
The database may be updated with new genomes
represented by FASTA or GenBank files.
|
|||||||||||||||||||||||||||||||||||
Publications | A manuscript is submitted for publication |