Home Docs Download Contact

 

Publication

  • Tu Q, Tang HX, Ding D. MedBlast: searching articles related to a biological sequence. Bioinformatics. 2004; 20(1):75-77. (PDF)

 

Help of the MedBlast Online

  • Fasta format: the first line is the name or description and should begin with a '>' symbol, the following lines are pure sequence data. An example:

    >gi|3023354|sp|P56381|ATPE_HUMAN ATP synthase epsilon chain, mitochondrial
    MVAYWRQAGLSYIRYSQICAKAVRDALKTEFKANAEKTSGSNVKIVKVKKE


NAME

  MedBlast: searching articles related to a biological sequence


SYNOPSIS

  To search articles by sequences:
     ./medblast.pl -fasta sample.fas -out sample.htm
  To search articles by BLAST results:
     ./medblast.pl -blast sample.blast -out sample.htm
  To specify more options:
     ./medblast.pl -fasta sample.fas -out sample.htm -e 1e-100 -noindirect


DESCRIPTION

Introduction

  There are two kinds of references associated with sequences:
  . Direct reference:
       references cited by the sequence annotation;
       references citing the sequence in its text;
  . Indirect reference:
       references containing gene symbols of the given sequence;
  Besides, gene symbols may have other aliases; sequences may have
  relatives, which include redundant (e. g. protein and gene sequences)
  and close homologous. The articles related to the given sequence can
  be obtained by addressing all these issues.
  MedBlast take a sequence in FASTA format as input. The program first
  uses BLAST to search the GenBank nucleic acid and protein non-redundant
  (nr) databases, to extend to those homologous and corresponding nucleic
  acid and protein sequences. Users can input the BLAST results directly,
  but it is recommended to input the result of both protein and nucleic
  acid nr databases. The hits with low e-values are chosen as the
  relatives because the low similarity hits often do not contain specific
  information. Very long sequences, e.g. 100k, which are usually genomic
  sequences, are discarded too, for they do not contain specific direct
  references. User can adjust these parameters to meet their own needs.
  Afterwards, MedBlast use Eutilities toolset
  (http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) of
  NCBI to retrieve corresponding articles of each sequence from PubMed.
  Redundant multiple protein sequences in one hit are all considered,
  because their annotation information are sometimes not redundant.
  For direct references, MedBlast use the Elink tool to retrieve the
  articles cited by the relatives and the Eseach tool to search the
  articles citing the relatives.
  For indirect references, MedBlast reads in the sequence and obtains the
  gene symbols from the 'gene' tags of the 'CDS' features of the sequence
  file. The 'CDS' feature must overlap with the HSP fragment of the hit,
  in order to neglect information from non-homologous fragments.
  Sometimes, the value of 'gene' tag is not a validated symbol, for
  example, A, E+E', 14-3-3, ORF12, gene 27. A set of rules were employed
  to filter these words, which include word length, non-word characters,
  ORF names, numbers and so on. Users may also define a stop-word list to
  discard some particular words. Some tags contain a symbol followed some
  other words, such as 'HSP60 gene', which can be rescued automatically.
  Then, the program looks up the symbols in the precompiled thesaurus of
  this organism, to find other aliases. The corresponding gene/proteins
  in different organisms should be obtained by homology, instead of by
  alias, because highly homologous sequences are always the corresponding
  gene/proteins, while same symbols in different organisms may not be
  the same gene/proteins. So the program also tries to find the binomial
  organism name of the sequence, and then gets the common name from NCBI
  taxonomy database. By these information, MedBlast construct the query
  word for the Eseach tool, in which the symbol and the binomial name
  can be in any field and common name should be in MeSH terms, for
  example, 'ATP5E AND ("Homo sapiens" OR human [mh])'.
  Finally, MedBlast output a report in HTML format, including the tables
  summarizing all direct and indirect references. Redundant references
  are ignored.

Installation

  The program is available freely at http://medblast.sibsnet.org (We also
  maintain an online server there.) The tarball include the program, conf,
  thesauruses and other data. Now the available thesauruses include
  E.coli, S.cerevisiae, mouse and human.
  The system requirement: perl 5.6 or above, bioperl 1.0.2 or above.
  Notice bioperl 1.0.2 is required to get the multiple sequences in one
  hit. Both UNIX and Windows are supported.
  Some configurations are required. Just edit the medblast.conf.
  Then the program will run happily. :)

Usage

  Required parameters:
  -fasta      : fasta sequence as input, to submit BLAST.
  -blast      : blast result as input, which can be analyzed directly
                without waiting for BLAST again. The GI numbers is
                required. The corresponding translated BLAST is
                recommended so that the program can get more information.
                Either fasta sequence or blast result as input is
                required.
  -out        : output HTML file. required.
  Optional parameters:
  -e          : limit of the e-value of HIT and HSP, default is 1e-20,
                which make the program analyze the almost identical
                sequences only. If you want more information of homologous
                sequences, you can increase the e-value limit.
  -l          : limit of the hit sequence length, default is 100000.
  -r          : max references retrieved each time, default is 40.
  -t          : limit of retry times if network error occurs. default is 3.
  -nodirect   : do not retrieve direct references.
  -noindirect : do not retrieve indirect references.
  -nohyphen   : do not allow symbols with hyphens. see "KNOWN BUG".
  -noseqcache : do not read/write sequence from local disk cache.
  -thesaurus  : user defined thesaurus file. These alias will be used
                regardless of organisms. Each line is all the aliases
                of one gene, divided by space character. The remark lines
                should be leading with '#'. You can use 'name:' and
                'version:' in different remark lines to specify the
                thesaurus name and version. Here is a sample:
                # my thesaurus
                # name: user-defined
                # version: 1.0
                gene1 alias1 alias2
                gene2 alias
  -stopword   : use user defined stopword list. This will disable the
                default stop word list, or you can add the words into
                the default list directly. Each line is a stop word.
  MedBlast supports proxy. You can set proxy by environment variable.
  For example, in Bash shell, you can set proxy:
  export HTTP_PROXY="http://my.home.org:8080";
  Output:
  MedBlast outputs an HTML file, include direct references (table and
  summary), indirect references (table and summary), errors, options
  and BLAST results. The references in the tables are bold font while
  outputted ones are normal font. The corresponding references are list
  under the symbol, while if the symbol has been outputted, the
  references are replaced by (...), if something wrong with the search,
  there will be a (x), if no reference of the symbol is found, there
  will be a (-). If there are too many references of a symbol, there
  will be a MORE link after the list. Each reference in summaries has
  a link to the PubMed record.


WARNING

  DO NOT OVERLOAD NCBI SERVERS PLEASE!
  There are some default limits in the program to avoid overloading
  NCBI servers:
  Besides, users should comply the limits of EUtilities of NCBI. see:
  http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html


KNOWN BUG

  If a symbol contains a hyphen and it is not in the Entrez index, PubMed
  will break apart the symbol and repeat the searching process
  automatically. Unrelated data will be returned from this function, but
  we can not disable it now. If it happened, try '-nohyphen' option to
  ignore all symbols with hyphens, or use stopword list to ignore the
  symbol you do not like.
  If you use bioperl 1.2, please manually fix the bug of
  Bio::Factory::FTLocationFactory.pm, to read the pdb sequence:
  change the line:
  } elsif(($op eq "join") || ($op eq "order")) {
  to:
  } elsif(($op eq "join") || ($op eq "order") || ($op eq "bond")) {


AUTHOR

  Qiang Tu   (Chinese Academy of Sciences)
  Haixu Tang (University of California, San Diego)
  Dafu  Ding (Chinese Academy of Sciences)
  Email: qtu@sibs.ac.cn
  Any comments and suggestions are welcome.
  We will be very happy if it is useful to you. :)
  November 2002.


SEEALSO

  Our online server at http://medblast.sibsnet.org


COPYRIGHT

  Copyright 2002,  Chinese Academy of Sciences.   All Rights Reserved.
  This program is free software for academic users. You may copy or
  redistribute it under the same terms as Perl itself.
  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.