- 1 How to create a new Database?
- 2 How to create a new project?
- 3 How to set the genome fasta file?
- 4 How to set the genome NCBI taxonomy ID?
- 5 How to search for sequence similarities with merlin?
- 6 How to perform transport proteins identification?
- 7 How to generate transport reactions?
- 8 How to annotate transporters unavailable in merlin?
- 9 How to integrate the identified transport proteins and generated transported reactions into the metabolic model?
- 10 How to predict compartments?
- 11 How to load KEGG metabolic data?
- 12 How to build draft model?
- 13 How to curate the draft model?
- 14 How to build a SBML model?
- 15 How to format a Database?
- 16 How to delete a Database?
How to create a new Database?
To create a new database click on Database-> New Database.
Since the creation of a new database does not involve creating a new project, the database login information is required.
How to create a new project?
To create a new project click on Project -> Create Project.
Database login information is required. The user should select one of the available databases. If none is available a new database has to be created beforehand. The default login information allows accessing merlins database (if the downloaded version includes mysql). Yet other databases (remote or local) can be accessed by merlin. The default parameters are written in the fill in boxes (default password = password).
The *.faa or *.fna genome fasta files are required to perform remote and local similarity alignments with merlin. IMPORTANT: When setting the fasta file(s), ALL (fasta) files in the selected FOLDER are retrieved by merlin. Be careful when setting the files in a new project!
How to set the genome fasta file?
The fasta files should be set while creating a project or afterwards, using the Project->Set Genome Files feature.
On Windows use 7-zip or similar to uncompress the *.gz files.
The NCBI fasta files have the following configuration:
>NP_414542.1 thr operon leader peptide [Escherichia coli str. K-12 substr. MG1655]
The non-ncbi fasta files should have the following configuration to be used within merlin:
To set the fasta file click on Project -> Set fasta Files.
IMPORTANT: When setting the fasta file(s), ALL (fasta) files in the selected FOLDER are retrieved by merlin. Be careful when setting the files in a new project!
How to set the genome NCBI taxonomy ID?
If the genome uploaded to merlin was not downloaded from RefSeq's or GenBank's ftp website it is mandatory to insert the NCBI Taxonomy ID. Thus, the user should retrieved the NCBI Taxonomy ID from http://www.ncbi.nlm.nih.gov/taxonomy.
For instance, for Escherichia coli str. K-12 substr. MG1655 the NCBI Taxonomy ID is 511145. If only the organism genus was known the NCBI Taxonomy ID should be set to 561.
Ultimately, the Taxonomy ID should be set to 131567 (cellular organisms). The taxonomy ID can be set while creating a new project, or afterwards in Project -> Set fasta Files.
How to search for sequence similarities with merlin?
The genome fasta files should already have been assigned to the project.
To perform BLAST sequence similarities searches with merlin go to Enzymes -> BLAST. There are two servers available: NCBI annotation and EBI annotation.
To perform HMMER sequence similarities searches with merlin go to Enzymes -> HMMER annotation.
How to perform transport proteins identification?
To perform the identification of transport proteins with merlin the user has to use the TransMembrane protein prediction with Hidden Markov Models (TMHMM) server to predict which genes will have transmembrane helices in their secondary structure. This action requires uploading to the TMHMM web-server at http://www.cbs.dtu.dk/services/TMHMM/ the same fasta files submitted to merlin. The files are submitted one at a time and the resulting html page should be saved with the extension '.tmhmm', so that merlin can recognize such files.
Afterwards, open the Transporters -> Transport Proteins Identification operation. In the dialog box set the path to the TMHMM files directory and configure the algorithm parameters, namely:
-> The minimum number of helices a gene sequence must have to be compared to the TCDB sequences. -> The minimum similarity threshold between the gene sequences and the TCDB sequences -> The alignment algorithm.
This method can take several hours to be completed, depending on the computer processing unit.
If an error occurs and the transport proteins identification cannot be performed the user should edit the run.bat located in the utilities folder in merlin, or run.sh file for unix users. The MAXMEM parameter should be increased from "-Xmx1536M" to "-Xmx3064M" or more, so that merlin can perform the similarity search alignments.
How to generate transport reactions?
To generate the transport reactions to be integrated into the model, the user should go to Transporters -> Transport Reactions generation.
Software parameters should be configured, namely:
The alpha value which is the frequency and taxonomy scores weight; The Minimum Frequency which is the minimum number of times a metabolite has to be associated to a given gene, so that a transport reaction for such metabolite is generated and linked to the gene. The beta value which is the penalty for metabolites with less frequency than the minimum required; The cut-off threshold for metabolites selection; Whether to verify the balance of the generated transport reactions; If the metabolite chemical formula is not available the reaction will not be validated. Whether to keep reactions with metabolites not available on KEGG.
To access a list with these reactions the user can go to the /temp folder inside merlin and find a text file with all the generated transport reactions. A visualizer for this information is currently in development.
Though being very small in bioinformatics terms, TCDB already has over 12000 annotated entities. However, merlin developers only manually annotated 3251 entities.
Thus, merlin users can contribute to this effort by annotating TCDB entities unannotated in merlin, but for which similarities were found when performing the transport proteins similarity search. For this, the users must generate the transport reactions with merlin. If new TCDB entities are found, the user should abort the operation and go to:
[local merlin folder]/temp/[project_name]/Th_[Cut_off_threshold]__al_[alpha]__be_[beta]/reactionValidation[true/false]/kegg_only[true/false]
And open the file [project_name]UnAnnotatedTransporters.out as a spreadsheet with tab separated columns.
The first line (after the headers) of this file contains an example annotation for a TCDB transporter. The other lines should be filled in by the user. The mandatory fields are: direction, metabolites and reversibility.
The metabolites field has some rules:
Each metabolite should be separated by semi-colons (e.g. glucose; sucrose).
If the metabolite(s) is(are) co-transported with another metabolite by symport the co-metabolites should be separated by a colon (e.g. glucose; sucrose : Na+).
If the metabolite(s) is(are) co-transported with another metabolite by antiport the co-metabolites should be separated by a two slashes (e.g. glucose; sucrose // Na+).
After the annotation, the resulting file should be sent to email@example.com for confirmation and formatting. The user will receive a file which can be then uploaded to merlin using the operation available at Transporters -> New Transporters Loading.
How to integrate the identified transport proteins and generated transported reactions into the metabolic model?
To integrate the transport proteins and reactions in the model, the user should go to Transporters -> Transporters Integration.
This operation will associate the previously generated reactions (which can be visualized in a text file located at
[local merlin folder]/temp/[project_name]/Th_[Cut_off_threshold]__al_[alpha]__be_[beta]/reactionValidation[true/false]/kegg_only[true/false])
to the reactions in the model.
Also, new proteins are added to the Proteins tab providing TC family numbers and gene-proteins associations.
How to predict compartments?
The compartments prediction is handled differently for eukaryotes and prokaryotes in merlin.
For prokaryotes, the HTML files retrieved from the PSORTb web interface (Long Output Format) should be loaded using the operation Compartments-> Load PSORTb v3.0 Results. Then, the operation Compartmens-> Perform Compartments Prediction should be performed.
For eukaryotes, the first step is skipped because the second operation retrieves the results from WoLF PSORT remotely.
The genes are automatically assigned with the main compartment predicted by these programs. Moreover, if alternative compartments have scores that differ by less than a user defined percentage (default value of 10%) from the main compartment, the gene will also be assigned to those compartments.
After the compartments prediction, the results should be integrated in the internal model Compartmens-> Compartments Integration, generating a fully compartmentalised draft model.
How to load KEGG metabolic data?
Retrieving metabolic data from KEGG involves accessing the Database -> Load KEGG Data.
If KEGG has its own annotation for the target genome, such annotation can also be retrieved. Being so, the enzymatic (re-)annotation performed within merlin is integrated with KEGG's annotation and the internal model is assembled using both annotations.
Several panels were developed for the visualisation and editing of the KEGG data associated with a given metabolic model, namely, the Genes Viewer, the Proteins Viewer, the Metabolites Viewer, the Reactions Viewer and the Pathways Viewer. The Proteins Viewer includes a sub-viewer for the visualisation of information for enzymes, the Enzymes Viewer. Likewise, the Metabolites Viewer comprehends a couple of sub-viewers: the Reactants/Products Viewer and the Compounds/Reactions Viewer. The first sub-viewer is a fast and easy way to check if a metabolite is a reactant, a product or if it can have both roles in the network. The second sub-viewer is used to determine in which reactions a given metabolite participates.
How to build draft model?
The combination of the output of the enzymes, transporters and compartments annotation with the data retrieved from KEGG, generates a draft model with all the reactions.
How to curate the draft model?
The Reactions viewer allows the user to perform the curation of the model. The panel shows reactions grouped per pathway (thus the repetition of reactions is not uncommon) with different automatically sorted colours in each pathway.
This panel allows visualising reactions in all pathways or to select just a specific pathway. The Draw in Browser opens the homepage of the selected KEGG Pathway map, in the default internet browser, and ‘paints’ all enzymes and reactions, included in the internal model, which belong to that pathway. This feature, together with the Model-> Find unconnected reactions in the network operation, which paints in red these reactions names and descriptors, allows easily finding gaps in the model. The Model-> Find unbalanced reactions in the network operation is also very useful for finding and labelling stoichiometrically unbalanced reactions within this view. In this case, the reaction name is bolded and italicised.
When the integration of the transporters annotation is performed, a surrogate pathway is created by merlin, the Transporters Pathway including all transport reactions that met the integration criteria for being inserted in the model. After performing the integration of the compartments data, spontaneous and other reactions not associated to genes are automatically assigned to the internal compartment (cytosol for Eukaryotes and cytoplasm for Prokaryotes).
How to build a SBML model?
The operation Model-> Export to SBML allows exporting the internal model in the SBML format with MIRIAM annotations.
How to format a Database?
To erase data click on Database-> Clean Database. Select the project and which information will be removed.
KEGG_INFORMATION - to remove the information loaded from KEGG.
HOMOLOGY_INFORMATION - to remove the information retrieved from homology searches.
TRANSPORT_PROTEINS_INFORMATION - to remove the information retrieved from the local similarity alignments.
TRANSPORTERS_INFORMATION - to remove the information of the transport systems and reactions provided by merlin.
COMPARTMENTS_INFORMATION - to remove the information of the compartments prediction loaded into merlin.
ALL_INFORMATION - to remove all information from the selected project database.
How to delete a Database?
To delete a database click on Database-> DROP Database. Select the project to DROP and click Ok. This action cannot be undone! All the loaded information will be lost, and the database will be removed! The project will be also removed from merlin\'s "Clipboard".