Chazin Home Chazin Home | Ca-binding Protein DB | Vanderbilt Home Vanderbilt Home
Research Description | Publications | Wisdom | Search
How to contribute | About this page

Generating and Analyzing Distance Difference Matrices (DDMs)

Originally written by: Melanie Nelson (mnelson@scripps.edu)


This page describes how to run DISCOM to compare protein structures, how to format the output as text tables, and how to analyze DDMs using InsightII. For information about using Mathematica and GRASP to analyze DISCOM output, see Randy's DISCOM tips.

The instructions in this file are for DISCOM version 1.2.4. Earlier versions do not function the same with respect to comparisons of structures of different proteins. The latest version of DISCOM is available as part of Garry' Gippert's suite of GAP programs for protein structure analysis, and can be downloaded from Garry Gippert's programs page. People at TSRI can also copy the exectuable directly from bianca. It is found at /usr/local/bin/discom. This version has been compiled for IRIX 6.2.

Generating the DDM
Preparing the PDB files
The .fam files
Running DISCOM
Analyzing the DDM
Generating a text table
Analyzing DDMs in Insight
Counting and Sorting DDs

DISCOM is a C program that generates distance difference matrices (DDMs) of two protein structures or two families of structures. A distance difference (DD) is defined for the same pair of atoms (i,j) in two structures by taking the distance (d) between them in one structure and subtracting the corresponding distance in the other structure. The DDs for two structures A and B are therefore defined as:
DDij = dij(A) - dij(B)
The DDM is a matrix containing values of these differences for all or any defined subset of atoms in the structures.

I recommend using an alpha-carbon DDM for most purposes. This DDM provides information about the relative disposition of the backbone in the two structures. DDMs can be generated to compare two structures of the same protein or two structures of homologous proteins.

I. Generating the DDM

A. Preparing the PDB files: The new version of DISCOM is not very particular about the PDB files used. DISCOM ignores the last four columns of the standard PDB format. There must be white space between the required columns, but the exact number of spaces is not important.

DISCOM works just as well on families of structures (such as NMR families) as on single structures. If one or both of the two structures you will be comparing are actually structure families, you must first breaking the family into individual PDB files. I have a script which does this. It assumes that the individual structures are seperated by a line containing the word MODEL (as is the case for files downloaded from the PDB). The script is called splitpdb, and is located in ~mnelson/bin/pdb_tools. It is a shell script which calls gawk. It absolutely requires either gawk or a new version of nawk. It generates a set of pdb files, all with the same name base. The first line of the output files is HEADER inputfile. The command line is:
splitpdb inputfile
As written, this script calls the new PDB files model_001.pdb, model_002.pdb, etc. Currently the only way to change this is to change the script. At the end of the script there is a function which opens each successive file. Change the line setting the filename. Put whatever filename base you want in the sprintf statement.

B. The .fam files: Since DISCOM works with NMR families, you will need two .fam files even if you are comparing two single structures. The .fam files are simply lists of the names of the pdb files to be compared. For example, if I want to compare a family of structures with four members named model_001.pdb, model_002.pdb, model_003.pdb, and model_004.pdb, with a single structure named xtal.pdb, I would have two .fam files. The .fam file for the family would contain the following four lines:
model_001.pdb
model_002.pdb
model_003.pdb
model_004.pdb
and the .fam file for the single strucutre would only contain the following line:
xtal.pdb

C. Running DISCOM: If DISCOM is in your path or you have created a link from your running directory, you can run it as follows:
If you are comparing two structures of the same protein (and these two strcutrues have the same numbering), type:
discom -sf1 ex1.fam -sf2 ex2.fam -sub CA > output
where ex1.fam is the .fam file for the first structure or family, and ex2.fam is the .fam file for the second structure or family. Of course, you can call these files anything you want, although they may need to end in the .fam extension.
If you are comparing two structures of different proteins (or if you are using two structures of the same protein with different numbering), type:
discom -sf1 ex1.fam -sr1 1-18,19-75 -sf2 ex2.fam -sr2 1-18,20-76 -sub CA > output
where the numbers after the sr flags are the numbers of the residues in each family that will be compared. In this example, the second family has a one residue insertion relative to the first family.

The output file is just a file containing the parameters of the run. The actual DDM is contained in a file called dd.mtx. The program also creates a file called ad.mtx, which contains the absolute values of the entries in dd.mtx. It is possible to choose different names for these files using the -ddm and -adm flags. However, I usually just rename the files after the run.

As defined above, the DDM is a symmetrical matrix. Of course, you don't need the same distance differences twice, so actually the DDs are contained only in the half of the matrix above the diagonal. The half below the diagonal contains the rms distance difference. This is only meaningful if you are comparing two families of structures. In this case, a DD is only statistically significant if it is larger in magnitude than the rms distance difference. As currently implemented, there is no similar check for the statistical significance of DDs involving crystal structures. In this case, I currently consider only DDs greater in magnitude than 1 angstrom to be significant. Of course, this number could be changed if, for example, you know that you are using a relatively low resolution structure.

II. Analyzing the DDM

The raw output is not in a format that is easy to analyze. I generally analyze the DDM in two ways: 1) using the matrix of numbers, reformatted for ease of analysis, 2) using Insight to display the DDs and lines between two alpha-carbons in a structure.

A. Generating a Text Table: I have a rather clunky set of scripts that will reformat the raw matrix. There are a bunch of awk and sed scripts and a shell wrapper, all found in ~mnelson/bin/discom_stuff. For a DDM comparing two structures of the same protein you will need:

For a DDM comparing structures of two different proteins you will need: All of these scripts must be in the same directory as the input file (the dd.mtx file). You choose a cutoff value for significant entries when you run the wrapper script. The command line is:
reformatter.same inputfile sig_value
or:
reformatter.diff inputfile sig_value pdbfile1 pdbfile2
pdbfile1 and pdbfile2 are PDB files containing the correct residue names and numbers for proteins compared with DISCOM. The names from pdbfile1 end up as the row names in the table and the names from pdbfile2 end up as the column names. If one of your PDB files has residue numbers greater than 100 I suggest using this file as pdbfile1 (row names), because the output will look nicer this way. The inputfile should end in an extension identifying it as the matrix file-- I use .mtx. As mentioned above, I usually use a sig_value of 1 angstrom. All entries smaller in magnitude than this cutoff will be replaced by *.** in the reformatted output. If you do not want this to happen, use a cutoff of 0. These scripts require gawk, and will not run correctly with awk. The output is a set of files, each on containing a page-width of the DDM. These files have the same root name as the inputfile, and end with "fnl0X", where X is a number identifying the order of the files. The files must be printed rotated on the page-- i.e. in landscape orientation, and the should be printed using the courier font. I use the enscript command to do this:
enscript -r files
Unfortunately, enscript is only found on some machines. I think that any pre-print filter that allows you to print in the landscape orientation should work, though.

The reformatter.same shell wrapper also creates a .stats file. This file contains some simple statistics on the DDM, such as the number of non-significant entries, and the number of entries between certain ranges of numbers.

B. Analyzing DDMs in Insight: Tools are also available for mapping distance differences onto a protein structure, using InsightII. DDs are displayed as lines between the two alpha carbons in one of the structures that was compared. The current implementation requires the NMR_Refine module of Insight.

Before the DDs can be displayed in Insight, they must be converted to the rstrnt format. Randy Ketchem wrote a perl script that does this. It is called ddmrstrnt, and can be found in ~mnelson/bin/discom_stuff. It takes the dd.mtx file and a .seq file as input. The s.seq file format is:
3 SER 4 PRO
Spacing is not important. A .seq file can be generated from a PDB file using my alpha_only script.

Randy's perl script includes a minumum distance option. When this option is used, only DDs whose absolute values are greater than or equal to the user supplied minimum distance are included in the output file. The command line is:
ddmrstrnt -r [minimum distance] -m dd.mtx -s sequence [> out.rstrnt]

Once the .rstrnt file is ready, follow the following steps to display the DDs as restraints using the NMR_Refine module of Insight:

C. Counting and Sorting DDs: It is often useful to count the number of DDs above a given cutoff that involve residues in certain regions. For instance, in characterizing the calcium-induced conformational changes in calmodulin, it was useful to count the DDs in each interhelical interface. The results of this analysis are available online. I wrote a perl script called ddcounter to count the number of DDs above a user specified cutoff involving one residue from region A and one residue from region B. It also provides the most positive and most negative DDs involving residues from these two ranges. The ddcounter script is available in ~mnelson/bin/discom_stuff. The input file for ddcounter is the MSI rstrnt file generated by ddmrstrnt. The command line for ddcounter is:
ddcounter -c cutoff -f dd.rstrnt -a start-end -b start-end [>outfile]
All command line flags are required. All DDs involving one residue in range a and one residue in range b with an absolute value greater than cutoff will be counted. The cutoff selected will have no effect on the most positive and most negative DDs reported.

A perl script to sort DDs by size is also available. It is called ddmsorter, and is also available in ~mnelson/bin/discom_stuff. It also takes an MSI rstrnt file as input.


Last updated: 7/9/98, by Melanie Nelson