A tool to find sequence pattern matches in the Protein Data Bank (PDB) database of protein structures.
Sequery is a tool to search the sequences of the protein structures in the Protein Data Bank (PDB) for a particular pattern of residues, which may include exact matches and acceptable substitutions based on a user-specified amino acid substitution matrix and/or a numerical threshold. Sequery was developed by Michael E. Pique, Michael A. Siani, Leslie A. Kuhn, Elizabeth D. Getzoff, and John A. Tainer, Department of Molecular Biology, The Scripps Research Institute and is distributed here with permission of the authors. Development of Sequery was supported in part by NSF grant BIR9631436.
A useful complement to Sequery is SSA (Superpositional Structure Assignment), which automates the assignment of secondary structure to tetrapeptides identified by a Sequery search. For more information and software availability, see SSA Information and Documentation.
For literature references related to Sequery, please see the section on Algorithmic Details.
Information on obtaining and installing Sequery can be found in Installing Sequery.
Sequery runs by searching for matching sequences (strings) in a precomputed ascii text file which contains sequences for each of the PDB files (or a specific subset of PDB files). Sequery is run using input supplied on stdin, either by redirection of a file containing the sequence patterns to match, see end of this section for an example, or by prompting the user for each pattern. It is common to run Sequery using tetrapeptide matching patterns since sequences of this length or shorter often have close matches in the PDB, but any length is possible. Execution is stopped via ctrl-D. Command line options are as follows:
-s SequenceFile | The SequenceFile contains the listing of PDB codes and corresponding amino acid sequences. If omitted, Sequery defaults to searching sequery/lib/pdbseq.asc. Sequery uses this list to search for sequence pattern matches. This list can be generated using either genpdbseq or genpdbselectseq. Using genpdbselectseq is the suggested method, including only sequences with less than 25% identity, as this eliminates statistical bias introduced by including structurally related proteins. For more information, see the Additional Scripts section below. Note that any sequence database formatted according to the example below may be used. |
Example portion of a sequence file (excerpt
from pdbselectAug98.all.asc):1hmc B 4 143 SEYCSHMIGSGHLQSLQRLIDSQMETSCQITFEFVDQEQLKDPVCYLKKA FLLVQDIMEDTMRFRDNTPNAIAIVQLQELSLRLKSCFTKDYEEHDKACV RTFYETPLQLLEKVKNVFNETKNLLDKDWNIFSKNCNNSFAEC 1myp A 1 104 CPEQDKYRTITGMCNNRRSPTLGASNRAFVRWLPAEYEDGFSLPYGWTPG VKRNGFPVALARAVSNEIVRFPTDQLTPDQERSLMFMQWGQLLDHDLDFT 1grx _ 1 84 MQTVIFGRSGC(13)YSVRAKDLAEKLSNERDDFQYQYVDIRAEGITKED LQQKAGKPVETVPQIFVDQQHIGGYTDFAAWVKENLDA PEPA | |
The fields are the PDB code, the chain identifier ( ``_''
indicates that there are no chain ID's in this structure),
the first residue number of the chain, the last residue
number of the chain, and the sequence of the chain. In some
structures, such as 1grx, the sequence field will contain
residue number. These indicate an instance of non-sequential
number in the sequence, e.g. due to the lack of diffractive
density for a mobile loop in the protein, and are used to
maintain correct sequence numbering for Sequery
output.
|
|
-d DefinitionFile | The DefinitionFile is a file containing acceptable amino acid substitutions. If omitted, Sequery defaults to using sequery/lib/sequery.defs. The supplied substitution file with each line corresponding to a set or equivalence class of substitutable amino acids, sequery/lib/sequery.defs, was determined based on the Dayhoff mutation data matrix, although any set of substitutions could be provided in this format. When entering the sequence pattern for a Sequery, an upper-case charater indicates a search for an exact match while a lower-case character indicates that all equivalent residues from this file may be considered as substitutes (e.g. ``A'' to match alanine only and ``a'' for all residues equivalent to alanine). Further details can be found below in Sequence Query Patterns. |
-w WildcardFile | The WildcardFile contains a listing of user-defined acceptable amino acid substitutions (again, with each line containing a set of amino acids that can substitute for each other), e.g. from acceptable variation observed in a sequence alignment or from mutagenesis studies. When entering sequence patterns during Sequery execution, the user enters the line number within the WildcardFile corresponding to the acceptable amino acids at that position. For example, based on the example WildcardFile (sequery/lib/wilddef.dat), entering a 2AAA would find all patterns starting with tyrosine, phenylalanine, or tryptophan (line 2 in the file), followed by 3 alanines. |
-o OutputFile | The OutputFile is the file where the user would like output to be placed. If omitted, output will be written to sequery.match, overwriting any previously existing sequery.match. |
-x NumberOfContextResidues | This is the number of residues printed (in lower-case) on either side of the matched sequence pattern (in upper-case). Default is 4. |
-v | verbose mode -- More output (mostly for debugging purposes) |
-q | quiet mode -- Suppresses all output to the screen except for error messages. Pattern matches will still be output to the output file. |
-? or -h | Gives version and help info |
sequery -s lib/pdbseq.asc -d lib/sequery.defs \ -w wilddef.dat -q -o search.matches < search.patterns
Note: The following assumes the user is running with the default definition and wildcard files.
Sequence query patterns follow the regular expression rules, as documented in the UNIX ``ed'' man page. Following is a brief overview of pattern rules followed by several examples:
../bin/sequery -d ../lib/sequery.defs -s ../lib/pdbseq.asc \ -w example.wilddef.dat -o example.matches < example.patterns
Sequery will output the results of the pattern queries to the supplied output file (or to sequery.match if no output filename is given). Output will appear as follows:
1cem _ 153 to 156 -> aatdADEDiala matching ADED 1occ A 93 to 96 -> apdmAFPRmnnm matching 1234Explanation:
Line 1 | PDB Code: 1cem Chain ID: the underscore indicates there was no chain ID specified in the PDB file. Matching Residues: 153-156 Matching & Flanking Sequence: aatdADEDiala (upper-case is matched sequence, lower-case is flanking sequence) Query Pattern: ADED |
Line 2 | PDB Code: 1occ Chain ID: A Matching Residues: 93-96 Matching & Flanking Sequence: apdmAFPRmnnm (upper-case is matched sequence, lower-case is flanking sequence) Query Pattern: one character each from lines 1,2,3, and 4 from wildcard file |
Most errors will occur due to improper query patterns. These errors will appear simply as non-run queries. The current version of Sequery shows unpredictable behavior with proteins with residues having negative residue numbers and will occasionally produce segmentation faults if sequence patterns would result in a very large number of matches. (In this case, break the query into two or more subqueries and combine the results.)
Syntax:
addname SequeryOutputFile [columns]
SequeryOutputFile: file generated
by sequery
columns: total number of columns of each of line of
output (default 80). If not specified, output is truncated to 80
chars per line.
Syntax:
genpdbselectseq pdb-select-list-file
output-sequence-file
Explanation of use: The PDB Select list is a list of all proteins/chains in the PDB whose sequence identity is less than a certain percentage. (There are different lists for different identity threshold levels.) Each chain in the list represents a set of related chains. By using the lowest-identity threshold (25%), any structural bias in Sequery analysis is minimized. (This bias arises from the fact that if a sequence query identifies sequences in a series of related proteins whose structures are known, any subsequent structural analysis will contain more bias towards these related structures.)
Syntax:
genpdbseq PDBFile > SequenceFile
Examples:
genpdbseq 2sod.pdb > 2sod.ascseq
genpdbseq *.pdb > pdb.ascseq
Sytnax:
minipdbextract SequeryOutputFile [x]
SequeryOutputFile: file generated by
sequery
x (optional): Number of flanking residues to
include (default is 0)
Sequery was developed as a successor to Searchwild, which is described in
Collawn JF, Kuhn LA, Liu LF, Tainer JA, Trowbridge IS
Transplanted LDL and mannose-6-phosphate receptor
internalization signals promote high-efficiency endocytosis of
the transferrin receptor
EMBO J., 10(11) (Nov): 3247-3253 (1991)
Collawn JF, Stangel M, Kuhn LA, Esekogwu V, Jing SQ, Trowbridge
IS, Tainer JA
Transferrin receptor internalization sequence YXRF
implicates a tight turn as the structural recognition motif
for endocytosis
Cell, 63(5) (Nov 30): 1061-1072 (1990)
Other references related to the use of Sequery include the following:
Craig L, Sanschagrin PC, Rozek A, Lackie S, Kuhn LA, Scott JK
The Role of Structure in Antibody Cross-Reactivity Between
Peptides and Folded Proteins
J. Mol. Biol., 281(1) (Aug 7): 183-201 (1998)
Chang CP, Lazar CS, Walsh BJ, Komuro M, Collawn JF, Kuhn LA,
Tainer JA, Trowbridge IS, Farquhar MG, Rosenfeld MG
Ligand-induced internalization of the epidermal growth
factor receptor is mediated by multiple endocytic codes
analogous to the tyrosine motif found in constitutively
internalized receptors
J. Biol. Chem., 268(26) (Sep 15): 19312-19320
(1993)
Questions should be directed to Dr. Leslie Kuhn at:
kuhn@agua.bch.msu.eduor to Michael Pique at:
mp@scripps.edu