Using Sequery

A tool to find sequence pattern matches in the Protein Data Bank (PDB) database of protein structures.


  1. Introduction
  2. Getting and Installing Sequery
  3. Library Files Included with Sequery
  4. Running Sequery
  5. Output Files Produced by Sequery
  6. Bugs and Error Messages
  7. Other Scripts and Tools Provided with Sequery
  8. Algorithmic Details of Sequery
  9. Contact Information

Introduction

Sequery is a tool to search the sequences of the protein structures in the Protein Data Bank (PDB) for a particular pattern of residues, which may include exact matches and acceptable substitutions based on a user-specified amino acid substitution matrix and/or a numerical threshold. Sequery was developed by Michael E. Pique, Michael A. Siani, Leslie A. Kuhn, Elizabeth D. Getzoff, and John A. Tainer, Department of Molecular Biology, The Scripps Research Institute and is distributed here with permission of the authors. Development of Sequery was supported in part by NSF grant BIR9631436.

A useful complement to Sequery is SSA (Superpositional Structure Assignment), which automates the assignment of secondary structure to tetrapeptides identified by a Sequery search. For more information and software availability, see SSA Information and Documentation.

For literature references related to Sequery, please see the section on Algorithmic Details.

Back to Table of Contents


Getting and Installing Sequery

Information on obtaining and installing Sequery can be found in Installing Sequery.

Back to Table of Contents


Library Files Included with Sequery

Several library files are included in the sequery/lib directory for use with Sequery. The use of these files is explained later in Running Sequery.

Back to Table of Contents


Running Sequery

Sequery runs by searching for matching sequences (strings) in a precomputed ascii text file which contains sequences for each of the PDB files (or a specific subset of PDB files). Sequery is run using input supplied on stdin, either by redirection of a file containing the sequence patterns to match, see end of this section for an example, or by prompting the user for each pattern. It is common to run Sequery using tetrapeptide matching patterns since sequences of this length or shorter often have close matches in the PDB, but any length is possible. Execution is stopped via ctrl-D. Command line options are as follows:
-s SequenceFile The SequenceFile contains the listing of PDB codes and corresponding amino acid sequences. If omitted, Sequery defaults to searching sequery/lib/pdbseq.asc. Sequery uses this list to search for sequence pattern matches. This list can be generated using either genpdbseq or genpdbselectseq. Using genpdbselectseq is the suggested method, including only sequences with less than 25% identity, as this eliminates statistical bias introduced by including structurally related proteins. For more information, see the Additional Scripts section below. Note that any sequence database formatted according to the example below may be used.

Example portion of a sequence file (excerpt from pdbselectAug98.all.asc):
1hmc B    4    143 SEYCSHMIGSGHLQSLQRLIDSQMETSCQITFEFVDQEQLKDPVCYLKKA
                   FLLVQDIMEDTMRFRDNTPNAIAIVQLQELSLRLKSCFTKDYEEHDKACV
                   RTFYETPLQLLEKVKNVFNETKNLLDKDWNIFSKNCNNSFAEC
1myp A    1    104 CPEQDKYRTITGMCNNRRSPTLGASNRAFVRWLPAEYEDGFSLPYGWTPG
                   VKRNGFPVALARAVSNEIVRFPTDQLTPDQERSLMFMQWGQLLDHDLDFT
1grx _    1     84 MQTVIFGRSGC(13)YSVRAKDLAEKLSNERDDFQYQYVDIRAEGITKED
                   LQQKAGKPVETVPQIFVDQQHIGGYTDFAAWVKENLDA
                   PEPA
	  
The fields are the PDB code, the chain identifier ( ``_'' indicates that there are no chain ID's in this structure), the first residue number of the chain, the last residue number of the chain, and the sequence of the chain. In some structures, such as 1grx, the sequence field will contain residue number. These indicate an instance of non-sequential number in the sequence, e.g. due to the lack of diffractive density for a mobile loop in the protein, and are used to maintain correct sequence numbering for Sequery output.

-d DefinitionFile The DefinitionFile is a file containing acceptable amino acid substitutions. If omitted, Sequery defaults to using sequery/lib/sequery.defs. The supplied substitution file with each line corresponding to a set or equivalence class of substitutable amino acids, sequery/lib/sequery.defs, was determined based on the Dayhoff mutation data matrix, although any set of substitutions could be provided in this format. When entering the sequence pattern for a Sequery, an upper-case charater indicates a search for an exact match while a lower-case character indicates that all equivalent residues from this file may be considered as substitutes (e.g. ``A'' to match alanine only and ``a'' for all residues equivalent to alanine). Further details can be found below in Sequence Query Patterns.
-w WildcardFile The WildcardFile contains a listing of user-defined acceptable amino acid substitutions (again, with each line containing a set of amino acids that can substitute for each other), e.g. from acceptable variation observed in a sequence alignment or from mutagenesis studies. When entering sequence patterns during Sequery execution, the user enters the line number within the WildcardFile corresponding to the acceptable amino acids at that position. For example, based on the example WildcardFile (sequery/lib/wilddef.dat), entering a 2AAA would find all patterns starting with tyrosine, phenylalanine, or tryptophan (line 2 in the file), followed by 3 alanines.
-o OutputFile The OutputFile is the file where the user would like output to be placed. If omitted, output will be written to sequery.match, overwriting any previously existing sequery.match.
Other command line options include the following:
-x NumberOfContextResidues This is the number of residues printed (in lower-case) on either side of the matched sequence pattern (in upper-case). Default is 4.
-v verbose mode -- More output (mostly for debugging purposes)
-q quiet mode -- Suppresses all output to the screen except for error messages. Pattern matches will still be output to the output file.
-? or -h Gives version and help info

Sequery can be run in batch mode via the following (where search.patterns contains one line for each pattern to search):
      sequery -s lib/pdbseq.asc -d lib/sequery.defs \
         -w wilddef.dat -q -o search.matches < search.patterns

Back to Table of Contents


Sequence Query Patterns

Note: The following assumes the user is running with the default definition and wildcard files.

Sequence query patterns follow the regular expression rules, as documented in the UNIX ``ed'' man page. Following is a brief overview of pattern rules followed by several examples:

Back to Table of Contents


Example Files

There are example input, output, and wildcard files included in the sequery/examples directory. The example output was generated with the following command (run in the examples directory):
../bin/sequery -d ../lib/sequery.defs -s ../lib/pdbseq.asc \
     -w example.wilddef.dat -o example.matches < example.patterns

Back to Table of Contents


Output Files Generated by Sequery

Sequery will output the results of the pattern queries to the supplied output file (or to sequery.match if no output filename is given). Output will appear as follows:

1cem _  153 to  156 -> aatdADEDiala matching ADED
1occ A   93 to   96 -> apdmAFPRmnnm matching 1234
Explanation:
Line 1 PDB Code: 1cem
Chain ID: the underscore indicates there was no chain ID specified in the PDB file.
Matching Residues: 153-156
Matching & Flanking Sequence: aatdADEDiala (upper-case is matched sequence, lower-case is flanking sequence)
Query Pattern: ADED
Line 2 PDB Code: 1occ
Chain ID: A
Matching Residues: 93-96
Matching & Flanking Sequence: apdmAFPRmnnm (upper-case is matched sequence, lower-case is flanking sequence)
Query Pattern: one character each from lines 1,2,3, and 4 from wildcard file

Back to Table of Contents


Bugs and Error Messages

Most errors will occur due to improper query patterns. These errors will appear simply as non-run queries. The current version of Sequery shows unpredictable behavior with proteins with residues having negative residue numbers and will occasionally produce segmentation faults if sequence patterns would result in a very large number of matches. (In this case, break the query into two or more subqueries and combine the results.)

Back to Table of Contents


Other Scripts and Tools Provided with Sequery

Included in the sequery/share directory are a few additional tools that may be useful and are described below. These scripts must be modified upon installation to point to the correct directories. See Installing Sequery for more information.

Back to Table of Contents


Algorithmic Details of Sequery

Sequery was developed as a successor to Searchwild, which is described in

Collawn JF, Kuhn LA, Liu LF, Tainer JA, Trowbridge IS
Transplanted LDL and mannose-6-phosphate receptor internalization signals promote high-efficiency endocytosis of the transferrin receptor
EMBO J., 10(11) (Nov): 3247-3253 (1991)

Collawn JF, Stangel M, Kuhn LA, Esekogwu V, Jing SQ, Trowbridge IS, Tainer JA
Transferrin receptor internalization sequence YXRF implicates a tight turn as the structural recognition motif for endocytosis
Cell, 63(5) (Nov 30): 1061-1072 (1990)

Other references related to the use of Sequery include the following:

Craig L, Sanschagrin PC, Rozek A, Lackie S, Kuhn LA, Scott JK
The Role of Structure in Antibody Cross-Reactivity Between Peptides and Folded Proteins
J. Mol. Biol., 281(1) (Aug 7): 183-201 (1998)

Chang CP, Lazar CS, Walsh BJ, Komuro M, Collawn JF, Kuhn LA, Tainer JA, Trowbridge IS, Farquhar MG, Rosenfeld MG
Ligand-induced internalization of the epidermal growth factor receptor is mediated by multiple endocytic codes analogous to the tyrosine motif found in constitutively internalized receptors
J. Biol. Chem., 268(26) (Sep 15): 19312-19320 (1993)

Back to Table of Contents


Contact Information

Questions should be directed to Dr. Leslie Kuhn at:

kuhn@agua.bch.msu.edu

or to Michael Pique at:

mp@scripps.edu