MedBlast: searching articles related to a biological sequence
To search articles by sequences:
./medblast.pl -fasta sample.fas -out sample.htm
To search articles by BLAST results:
./medblast.pl -blast sample.blast -out sample.htm
To specify more options:
./medblast.pl -fasta sample.fas -out sample.htm -e 1e-100 -noindirect
There are two kinds of references associated with sequences:
. Direct reference:
references cited by the sequence annotation;
references citing the sequence in its text;
. Indirect reference:
references containing gene symbols of the given sequence;
Besides, gene symbols may have other aliases; sequences may have
relatives, which include redundant (e. g. protein and gene sequences)
and close homologous. The articles related to the given sequence can
be obtained by addressing all these issues.
MedBlast take a sequence in FASTA format as input. The program first
uses BLAST to search the GenBank nucleic acid and protein non-redundant
(nr) databases, to extend to those homologous and corresponding nucleic
acid and protein sequences. Users can input the BLAST results directly,
but it is recommended to input the result of both protein and nucleic
acid nr databases. The hits with low e-values are chosen as the
relatives because the low similarity hits often do not contain specific
information. Very long sequences, e.g. 100k, which are usually genomic
sequences, are discarded too, for they do not contain specific direct
references. User can adjust these parameters to meet their own needs.
Afterwards, MedBlast use Eutilities toolset
(http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) of
NCBI to retrieve corresponding articles of each sequence from PubMed.
Redundant multiple protein sequences in one hit are all considered,
because their annotation information are sometimes not redundant.
For direct references, MedBlast use the Elink tool to retrieve the
articles cited by the relatives and the Eseach tool to search the
articles citing the relatives.
For indirect references, MedBlast reads in the sequence and obtains the
gene symbols from the 'gene' tags of the 'CDS' features of the sequence
file. The 'CDS' feature must overlap with the HSP fragment of the hit,
in order to neglect information from non-homologous fragments.
Sometimes, the value of 'gene' tag is not a validated symbol, for
example, A, E+E', 14-3-3, ORF12, gene 27. A set of rules were employed
to filter these words, which include word length, non-word characters,
ORF names, numbers and so on. Users may also define a stop-word list to
discard some particular words. Some tags contain a symbol followed some
other words, such as 'HSP60 gene', which can be rescued automatically.
Then, the program looks up the symbols in the precompiled thesaurus of
this organism, to find other aliases. The corresponding gene/proteins
in different organisms should be obtained by homology, instead of by
alias, because highly homologous sequences are always the corresponding
gene/proteins, while same symbols in different organisms may not be
the same gene/proteins. So the program also tries to find the binomial
organism name of the sequence, and then gets the common name from NCBI
taxonomy database. By these information, MedBlast construct the query
word for the Eseach tool, in which the symbol and the binomial name
can be in any field and common name should be in MeSH terms, for
example, 'ATP5E AND ("Homo sapiens" OR human [mh])'.
Finally, MedBlast output a report in HTML format, including the tables
summarizing all direct and indirect references. Redundant references
are ignored.
The program is available freely at http://medblast.sibsnet.org (We also
maintain an online server there.) The tarball include the program, conf,
thesauruses and other data. Now the available thesauruses include
E.coli, S.cerevisiae, mouse and human.
The system requirement: perl 5.6 or above, bioperl 1.0.2 or above.
Notice bioperl 1.0.2 is required to get the multiple sequences in one
hit. Both UNIX and Windows are supported.
Some configurations are required. Just edit the medblast.conf.
Then the program will run happily. :)
Required parameters:
-fasta : fasta sequence as input, to submit BLAST.
-blast : blast result as input, which can be analyzed directly
without waiting for BLAST again. The GI numbers is
required. The corresponding translated BLAST is
recommended so that the program can get more information.
Either fasta sequence or blast result as input is
required.
-out : output HTML file. required.
Optional parameters:
-e : limit of the e-value of HIT and HSP, default is 1e-20,
which make the program analyze the almost identical
sequences only. If you want more information of homologous
sequences, you can increase the e-value limit.
-l : limit of the hit sequence length, default is 100000.
-r : max references retrieved each time, default is 40.
-t : limit of retry times if network error occurs. default is 3.
-nodirect : do not retrieve direct references.
-noindirect : do not retrieve indirect references.
-nohyphen : do not allow symbols with hyphens. see "KNOWN BUG".
-noseqcache : do not read/write sequence from local disk cache.
-thesaurus : user defined thesaurus file. These alias will be used
regardless of organisms. Each line is all the aliases
of one gene, divided by space character. The remark lines
should be leading with '#'. You can use 'name:' and
'version:' in different remark lines to specify the
thesaurus name and version. Here is a sample:
# my thesaurus
# name: user-defined
# version: 1.0
gene1 alias1 alias2
gene2 alias
-stopword : use user defined stopword list. This will disable the
default stop word list, or you can add the words into
the default list directly. Each line is a stop word.
MedBlast supports proxy. You can set proxy by environment variable.
For example, in Bash shell, you can set proxy:
export HTTP_PROXY="http://my.home.org:8080"
Output:
MedBlast outputs an HTML file, include direct references (table and
summary), indirect references (table and summary), errors, options
and BLAST results. The references in the tables are bold font while
outputted ones are normal font. The corresponding references are list
under the symbol, while if the symbol has been outputted, the
references are replaced by (...), if something wrong with the search,
there will be a (x), if no reference of the symbol is found, there
will be a (-). If there are too many references of a symbol, there
will be a MORE link after the list. Each reference in summaries has
a link to the PubMed record.
DO NOT OVERLOAD NCBI SERVERS PLEASE!
There are some default limits in the program to avoid overloading
NCBI servers:
Besides, users should comply the limits of EUtilities of NCBI. see:
http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
If a symbol contains a hyphen and it is not in the Entrez index, PubMed
will break apart the symbol and repeat the searching process
automatically. Unrelated data will be returned from this function, but
we can not disable it now. If it happened, try '-nohyphen' option to
ignore all symbols with hyphens, or use stopword list to ignore the
symbol you do not like.
If you use bioperl 1.2, please manually fix the bug of
Bio::Factory::FTLocationFactory.pm, to read the pdb sequence:
change the line:
} elsif(($op eq "join") || ($op eq "order")) {
to:
} elsif(($op eq "join") || ($op eq "order") || ($op eq "bond")) {
Qiang Tu (Chinese Academy of Sciences)
Haixu Tang (University of California, San Diego)
Dafu Ding (Chinese Academy of Sciences)
Email: qtu@sibs.ac.cn
Any comments and suggestions are welcome.
We will be very happy if it is useful to you. :)
November 2002.
Our online server at http://medblast.sibsnet.org
Copyright 2002, Chinese Academy of Sciences. All Rights Reserved.
This program is free software for academic users. You may copy or
redistribute it under the same terms as Perl itself.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
|