MetaLingvo 1.0

Program download

Databases: Not available by now

The program GUI interface was tested only under Windows control. Errors in the GUI interface are very likely when using with other OS. Use command line utilities instead, which provide users with all functionality of the GUI interface.

The Program Metalingvo calculates patterns of oligonucleotide frequencies. To accomplish this, specified oligonucleotide words (kmers) are defined by length and are searched for throughout the sequences of interest. This is done to analyze the oligonucleotide usage bias in the collection of sequences. Patterns of deviations of the oligonucleotide frequencies from the expected frequencies are indicative of phylogenetic links between sequences. The program can search for kmers of between 2 and 7 nucleotides in length then generates a phylogenetic profile of the different sequences based on the observed patterns. This can be visualized in a multidimensional plot as well as being incorporated into the database of the specific kmer pattern used. This database is then also used to identify unknown sequences.

MetaLingvo program was developed for:

Linguistics analysis and visualization of genome DNA sequences;
DNA sequence clustering by similarity of oligonucleotide frequency patterns;
Alignment free phylogenomic study;
Binning of unknown DNA sequences and metagenomic DNA reads to taxonomic units;
Creation of user databases of genome specific oligonucleotide usage patterns;

basing on analysis of frequencies of oligonucleotides – oligonucleotide usage patterns (OUP).

Counts of words of different lengths N from 2 to 7-mer were analyzed in this work applying different schemes of normalization. Different types of OU patterns were abbreviated as type_N-mer. Types were “n0” for non-normalized, “n1” for normalized by mononucleotide frequencies, “n2” for normalized by dinucleotides and so on. For example, the non-normalized tetranucleotide usage pattern is denoted as n0_4mer, trinucleotide usage pattern normalized by dinucleotides is n2_3mer, pentanucleotide usage pattern normalized by trinucleotides is n3_5mer. Each OU pattern is characterized by three statistical parameters: D – distance between two patterns of the same type (distances between 2 sequences always are calculated in 4 possible combinations: direct-direct, direct-reverse, reverse-direct, reverse-reverse and the smallest value is returned); PS – pattern skew, distance between the two patterns of the direct and reverse strands of the same DNA sequence; and OUV – oligonucleotide usage variance. Correspondingly, the nomenclature is as follows: distance between a local n0_4mer pattern and the corresponding global pattern – D:n0_4mer; pattern skew of a n0_3mer pattern – PS:n0_3mer; variance of a n3_7mer pattern – OUV:n3_7mer.

Three different types of OU patterns may be calculated by this program:

n - pattern of frequencies of oligonucleotides throughout the DNA sequence are calculated;
s - disparity of olgonucleotide frequencies between two strands of the DNA molecule is measured;
d - both, frequencies and disparities of oligonucleotides are calculated and an attempt is made to calculated distances between patterns by taking into consideration both these characteristics.

However, all these pattern types may be calculated by the program, only n-patterns were used further for sequence binning and comparison.

Installation	The program MetaLingvo is written on Python and needs Python ver. 2.5.4 to be installed on the machine. GUI interface is based on Pmw megawidgets. This module is included in the download file. Download the file MetaLingvo1.0.zip from the site http://www.bi.up.ac.za/SeqWord/downloads/. Unzip file to a selected directory. A folder MetaLingvo will appear with several files and subordinate folders inside. Run Python MetaLingvo.exe.py. A main program window will appear.

Quick start	To open MetaLingvo double click on the MetaLingvo.exe.py icon in the MetaLingvo folder. A starting window of the program will appear. Use the command File->Open to open a database file. An example of a small database example.wdb (35.1 Mb, 174 bacterial genomes) you may find in the subfolder db. Select this file and click Open. You may download the database file bacteria.wdb. At the time of writing this document, the database size was 118 Mb and it contained 733 bacterial genomes. However, this database is regularly updated and the real size of the database file may be bigger.
Main Window	Interface Main window displays files and folders in the current directory (folder names are shown in brackets). Command buttons • Process – calculate OUP for selected sequence files in FASTA, GBK or GBFF formats and open for them a new workspace window; • Identify – calculate OUP for selected sequences and compare them with the standard patterns stored in the database. Toolbar • Open – open selected item; • Level Up – open the parent folder; • Select All, Invert Selection and Deselect – select or deselect items in the list; • Single/Multiple – toggle the mode of item selection in the list from multiple to single; • Checkbox Include subfolders – if set, process all sequence files in the selected folder and all subfolders. Menu 1. Menu File • Open – open saved files of workspaces (wsp), datasets (wtw), projections (dvw), clusters (clu) and identification reports (rep); • Convert to FASTA – save selected sequences to FASTA files n sequences per file; • Exit – close the program. 2. Menu Command • Process dataset – calculate OUP for selected sequence files in FASTA, GBK or GBFF formats and open for them a new workspace window; • Identify sequences – calculate OUP for selected sequences and compare them with the standard patterns stored in the database. 3. Menu Database • Edit – open the Database Editor window; • Add sequences – calculate OUP for selected sequences and add them to the database. 4. Menu Preferences • Show Current Set – show the current preference setting; • Pattern type – open the dialog Set Pattern Type (see the pattern type definition). Threshold values set the levels of hierarchical nodes. • Show Current Set – show the current preference setting; • Start directory – open Brows folder dialog to select the folder that will be the current directory every type at the program startup; • Buffer Size – number of OUP object to be hold by the program in RAM (500 by default); • Save Current Settings – save the current settings as the options by default; 5. Menu Phylogeny • Distance Matrix – built a distance matrix for the selected sequences by direct comparison of calculated OUP (calculate D-values as described above). 6. Menu Help • Help – open this file in the browser. • About – open About dialog.
Main Window	Functionality Select the sequence files to process. Sequences may be saved in FASTA, GBK or GBFF formats one or multiple sequences per a file. Double-click a folder name or select a folder and click the button Open to go on the level down. Double-click a sequence file with multiple sequences to open it and select only some of them to process. Double-click the item [..] on the top of the list or click the button Level Up to go to the previous level. Set the check-button Including Subfolders if you want to process all the sequences in the selected folder and subfolders. You may select multiple files and folders in the list. To calculate OUP for the selected sequences click the button Process or choose the menu command Command->Process Dataset. A dialog Set Pattern Type will pop up (see above). Make your settings and click OK. The program will calculate OUP for selected sequences and organize them in a tree-like structure by OUP similarity that will be displayed in a new workspace window (see below). Click Identify button or choose the menu command Command->Identify Sequences. Select the database name in a dialog that pops up. OUP will be calculated for selected sequences and compared with the standard sequences in the database. The result of the identification will be reported on the right panel of the main window. (More about database creation and sequence binning see below). Choose the menu command Phylogeny->Distance Matrix. A distance matrix for the selected sequences will be calculated by direct comparison of OUP (calculate D-values as described above). The distance matrix will be shown in the text editor window from where it may be saved as a text file by File->Save command. The distance matrix is formatted to be then used by the Phylip package programs (neighbour.exe, fitch.exe and kitch.exe) to build a phylogenetic tree. The results of formal tests for additivity and ultrametry of the distance matrix are included. Sequences in the distance matrix are sorted in a way that the outermost sequence is the first one in the list (an outgroup). The sequence names in the matrix are 10-character abbreviated (Phylip format requirement) but a list of abbreviations is included at the end of the output file:
Dataset Window	Interface Workspace contains following information OUP calculated for the sequences; Tree-like dataset structure representing clusters of similar OUP and the outermost patters inside every hierarchical node; Multidimensional projections of the OUP; Trees of clusters of OUP grouped by their coordinates in a multidimensional space; Reports of identification of selected OUP against the standard ones saved in the database. To display one of these types of information, double-click the corresponding header and then double-click an item in the subordinate list. (When a workspace is first created as it was described above, only Dataset and PATTERNS headers are present, but other data may be calculated from the dataset and stored in the workspace file.) Menu options and the set of command buttons will change depending on what type of workspace information is currently displayed. Workspace window in Dataset mode contains two views: Tree and Stat. Tree view is shown below: Tree view Tree view displays a tree like graph representing compositional similarities between studied sequences. It is not a phylogenetic tree but a rather rough cladogram the major function of which is to analyze the set of sequence OUPs and identify the outermost elements on the different levels of hierarchy (two outermost patterns are identified for every node). Outermost elements of clusters joined above the threshold depicted by the value and red line are shown blue. Enter new threshold value and click Reset: Outermost elements are used in further calculation. Optimally, there have to be 15-25 outermost elements. Increasing the threshold value decreases the number of outermost elements and contrariwise. Command buttons • Multi-D – create a multidimensional space for selected OUPs; • Close – close the right panel; • Threshold field and Reset – move the separation line (red line on the picture above) around the tree. The outermost patterns of nodes that lie left to the line are depicted by blue pattern names. Toolbar • Open – open selected item; • Edit – edit OUP name that by default is the original sequence name; • Delete – permanently delete selected OUPs from the workspace; • Save – save workspace to the WSP file; Menu 1. Menu File • Merge Files – merge the current workspace with an existing dataset file (WTW); • Copy Source Files – save selected sequences to FASTA files. First the command pops up a dialog with the list of all OUPs in the workspace where the user may select all or several items. The DNA sequences, if source files still exist on the disk, will be copied to the specified folder in new FASTA files (one sequence per file). • Save Workspace and Save Workspace As – save changes in the workspace to the same or a new file. • Save Dataset – save the workspace in a dataset file (WTW). Only the pattern and the pattern tree will be saved while calculated multidimensional spaces, cluster trees and reports will be dropped. • Save picture – the pattern tree will be saved as a graphic EPS file. • Exit – close the workspace window. 2. Menu Command • Recalculate – recalculate a new workspace for selected sequences (source sequence files shouldn't be removed from the disc). A dialog pops up that allows selecting the pattern for recalculating either by names or only outermost ones (see the Threshold field and Reset button described above). • Multidimensional Projection – create a multidimensional space for selected OUPs. • Identify Sequences – compare OUPs with the standard patterns stored in the database. 3. Menu View • Outermost Elements – show the list of outermost elements - the same which are depicted on the tree by blue pattern names (see the Threshold field and Reset button described above). • Select – show the dialog Select Elements. Selected patterns will be highlighted in the tree by red pattern names. Choose the command again, click Unselect All and OK to remove selection. . 4. Menu Phylogeny • Distance Matrix – built a distance matrix for the selected sequences by direct comparison of calculated OUP (calculate D-values as described above). 5. Menu Database • Edit – open the Database Editor window; • Add sequences – add selected OUPs to the database. Stat view Stat view displays distribution of OUP statistical parameters for the dataset sequences. To superimpose distributions, one has to select corresponding parameters for the X and Y axes and click Draw: • PS – pattern skew (disparity between patterns calculated for two strands of the same DNA); • OUV – pattern variance; • GC – G+C content; • Length – length of the source sequence; Below a dependence of pattern OUV on the GC-content of the sequence is displayed. To save the picture use the button Save picture. Pattern view The pattern concepts and statistics: The program MetaLingvo makes use of short oligonucleotide words (kmers) to search through sequences of interest, identifying regions of high similarity that are indicative of conserved features. These areas are then plotted in a statistical output for ease of use. Different ways of word ordering Two primary parameters govern the word order, these are the normalization as well as word length. Once these parameters are set by the user the program will search for all corresponding words that match the preferences. Eg. A normalization of 1 and word length of 3 will yield a n1_3mer word that is then searched for.
Dataset Window	Functionality The main use of the dataset is to identify the most distant patterns among the selected sequences that is an necessary step for projection of these patterns into a multidimensional space to reduce the number of unnecessary variables with the noise information caused by them while rectify the principal components of relationships between OUPs. Once the sequences of interest are processed and a new workspace is created, it will automatically create a dataset of those sequences. This dataset can then be saved from the workspace by clicking on File->Save dataset. Similarly from the main window a saved dataset can be opened by clicking on File->Open->Dataset. The dataset stores the locations of the source sequence files that enables recalculating the dataset. Each time the dataset is launched, it checks the source file paths. If they were re-located, the program will show Open folder dialog to allow choosing a new location. Deletion of source files will damage the dataset! To build a multidimensional space, click the button Multi-D or choose the menu command Command->Mutidimensional Projection. Select Elements for Processing dialog pops up that allows users to select OUPs for further processing.
Multi-D projection	Interface Menu: Command Command->Recalculate: Recalculates the Multi-D Projection Command->Identify sequences:To identify individual sequences click Command->Identify Sequences. Then select the sequence of interest from the list and click Identify. This will generate a report that gives all the relevant sequence information. Command->Generate cluster tree:To generate a cluster tree click the “Generate cluster tree” button in the Multi-D workspace or use the command, this will generate a tree based on the pattern coordinates and set cluster thresholds. Menu: Edit Edit->Set dimensions: When selecting number of dimensions keep in mind that there must always be 1 more sequence than the number of dimensions selected, eg. for 5 dimensions there needs to be 6 sequences set as out-groups; Edit-> Set outgroups: The order and number of the outgroups will define the dimensions of the plot, e.g. changing the order of the outgroups will reshuffle the plot but each point will still be in the same relative position compared to the outgroups. Edit-> Set geometry: This is to transform the Multi-D projection, it will change the view but not the Multi-D resolution. Some functions that are available are: zooming; rotation by degrees and using the mouse with pressed left or right buttons; shifting by X, Y and Z oblique pole. Edit->Set references: Adding and removing references will cause a window to pop-up that lists current references as well as sequences that could be used as references. Scaffold sequences are used to predict coordinates of a new sequence in the Multi-D projection; as more references are set, a more accurate prediction is generated but at the expense of an increase of memory and time consumption. References can be hidden or stocked but this will not affect the projections. Edit->Delete patterns: Deletes patterns Menu: View View->Select: Selects specific sequences (does not work) View->Unselect all: Unselects all selected sequences View->export co-ordinates: Exporting table and coordinates(does not work) Command Buttons: Select - the same as View->Select Unselect - the same as View->Unselect Set references - the same as Edit->Set references Show table -Shows Resolution table Shift Down - does not work Shift Up - does not work *Table* To generate the resolution table of the Mult-D projection select the “Show Table” button to generate the above resolution table. This table can be edited with the “Show distances” and Delete selected” buttons. The table shows the coordinates of each sequence in each dimention as well as the convergence which is a percentage of total similarity beteween the sequence and the Multi-D space.. Button Show distances/Hide distances - distances to outgroup patterns; Delete selected - Deletes selected sequences Close -Closes the table. Functionality The Multi-D Plot is a graphical representation of the OUP's in a multi-dimensional graph that shows similarities and differences between the OUP's based on the relative distances from the selected reference sequences. The number of dimensions of the multi-D plot are variable and can be set but the appropriate number of outlier groups must be selected so that there are always n+1 outliers for n dimensions. This serves to provide appropriate reference planes for the visualization of the OUP's with respect to one another. A resolution table is also generated so that the Graphical representation can be viewed in a more detailed manner. Menu Commands Renaming and deletion of Multi-Ds: In the workspace the list “Items to show:” lists the Multidimensional plots and these can be edited using the rename or delete icons. Selecting and deselecting sequences in Multi-D: When a new multidimensional projection is generated it is possible to select specific sequences that should be included or left out of the plot. Also sequences can be selected on the plot itself by double clicking on the plot points to select them.
Cluster trees	Functionality A Cluster tree plot shows the groupings of OUP's that are most similar to each other in a cladogram compared to more distant sequences. This cluster then forms a part of the database which is searchable to identify unknown sequences. The clusters that are generated have a statistical cut-off value to ensure that the OUP's are sufficiently similar. The defult cut-off value is 67.8% but this can be manipulated to generate higher or lower similarity clusters. After a cluster tree is generated then specific nodes can be selected from the View menu this enables the user to better visulize the results. From here refrence sequences can also be selected. Generating a cluster tree: To generate a cluster tree click the “Generate cluster tree” button in the Multi-D workspace this will generate a tree based on the pattern coordinates and set cluster thresholds. Renaming and deletion of cluster trees: In the workspace the newly generated cluster tree can be selected from the list, “Items to show” the it can be renamed or deleted. Overview of the cluster tree: Cluster thresholds: These show the branches of the tree as multiples of 10. Cluster numbering: These numbers in Red show the designation of each cluster Color code for sequence resolutions: Sequences marked in green show High similarity, Yellow-Low similarity Supporting information: This is the information about the pattern used as well as the resolution and is given on the same line as the cluster threshold values. Cluster consistency and setting the cutoff value: The number of dimensions as well as the cutoff value is set at the top of the cluster tree window. Incrementing and decrementing of the dimension; At the top of the cluster tree window the <<DIM or DIM>> buttons can be used to decrease or increase the dimensions respectively Menu Commands: Menu: View View->Hide/Show Outliers: Hiding and showing outliers Menu: Command Command->Recalculate: Recalculating of the cluster tree Menu: Edit Edit->Rearrange nodes: Rearrangement of the cluster tree Edit->Delete clusters or nodes: Deletion of clusters and nodes Edit->Set references: Setting references - this same as for Multi-D Menu: File File->Export clusters to> : Exporting of clusters of sequences to external sequence files, here FASTA, GBK or Gbff format can be chosen. File->Convert to text: Exporting the cluster layout to a text file File-> Show end-node elements: Listing end-node (leaf) elements Menu: View view->Select or unselect all: Selecting and deselecting nodes Menu: Database Database->Import to database: Importing the tree to the database
Database concepts and Database Editor	Functionality The database is a collection of all the tables that were generated using the same pattern parameters (Eg. n1_4mer). These tables are them combined into clusters that can be searched against an unknown sequence. The unknown sequence will be inserted in the cluster that has the most similar OUP profile and from this the identity of the sequence can be gained. Menu Commands Menu: Database Database->Edit database: Calling database editor In the database editor, Databases are listed according to the pattern used E.g. n1_4mer is the database for all 4mer words used. Menu: File File->Organize: Table renaming and hiding by using the Organize dialog File->Check sources: Checking availability of source sequences File->Save as cluster tree file: Save table as a separate cluster tree File->Import tables or Export selected tables: To import or export database tables, in database editor File->Import sequence: To add sequences from a file Menu: Edit Edit->Recalculate: Cluster tree recalculation Edit->Set references: Setting references - this same as for Multi-D and cluster tree Menu: View View->Select/Unselect/Find: Selecting, unselecting and finding node elements View->Search: Searching for end-nodes which are only in the selected database View->Global search: Searches all databases for the end –node Switching between Tree and Associations views Overview of the Associations view: hiding and showing of singleton patterns; “Show only Cluster” hiding and showing subordinate levels; “Show only top level nodes” converting clusters to tables; Edit->Convert Clusters to table assigning subordinate tables: Add/Remove dialog; meaning of the node /unidentified/: This is a node containing sequences that do not fall into the clusters or unidentified sequences. Adding Sequences to a database To add sequences from workspace click Database->Add sequences. You will then be prompted to select the sequences of interest to add, then to which database they should be added. This will then display a window “changed tables” where the changes to the database can be observed then saved.
Sequence identification	Functionality The program MetaLinvo can be used to identify unknown sequences by comparing the OUP of the unknown sequence to those stored in the database. This comparison will insert the unknown sequence into a cluster that displays the best matching OUP's and also show the sequences that have the highest similarity. The unknown sequence must be of a certain length for an appropriate comparison to be made. The longer the sequences the longer computing time will be. Identification of individual sequences; To identify individual sequences click Command->Identify Sequences. Then select the sequence of interest from the list and click Identify. This will generate a report that gives all the relevant sequence information. This report shows the best hits for the Unknown sequence as well as their locations in the specific database that was searched. Statistical parameters such as deviation are also shown for the calculated OUP.