# $Revision: 11270 $ # $Author: saulius $ # $Date: 2024-05-07 12:01:40 +0000 (Tue, 07 May 2024) $ Computation of dihedral (torsion) angles ======================================== Saulius Gražulis Vilnius, 2024 m. CONVENTIONS =========== The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [1]. PROGRAM ======= Write a Perl*) program that calculates torsion angles for DNA and RNA macromolecules. Compute the following angles: alpha, beta. *) Note: Ada programs are acceptable; in that case all native Perl features mentioned in this specification SHOULD be replaced by the corresponding Ada features. NB: for this particular assignment, you MAY NOT use external libraries to read PDB files or calculate dihedral angles; these functions MUST be implemented in the program. Program name pattern: pdbx-MOL-torsion-XY where characters 'MOL' MUST be replaced by the name of the processed molecule type ('pep' – peptides, 'na' – nucleic acids, 'pna' – proteo-nucleic acids, 'sac' – saccharides (sugars)), and the letters 'X' and 'Y' MUST be replaced by a Latin lowercase letters designating the two processed angles (α – a, β – b etc.). Latin equivalents of Greek letters MUST be taken from the IUPAC & IUB recommendations [2]. Program invocation: pdbx-MOL-torsion-XY 1zyx.pdb 2zyx*.pdbx The programs MUST read all files provided on the command line and compute the dihedral angles for all relevant residues. If the files are not provided, the program MUST read from STDIN. NON-FUNCTIONAL REQUIREMENTS =========================== The program must be able to process inputs of arbitrary size using a fixed amount of RAM. In particular, it is not acceptable to read in all atoms from the input stream and then analyse them in RAM. DATA FORMATS ============ INPUT ----- The input data can be provided in PDB and/or PDBx format. The program MUST detect the format of the input stream automatically. The program MAY use a line-oriented ATOM parser for PDBx file instead of the full CIF parser. The program SHOULD read the following records: -- 'ATOM' and 'HETATM' records with the required atom coordinates (from PDB and PDBx data streams); -- 'HEADER' records, to obtain PDB ID (from a PDB stream, if such records are provided); -- 'CRYST1' records to determine a format switch point (from PDB streams); -- 'data_' records, to determine a format switching point and to obtain the data block name (from a PDBx stream). All other records and ATOM and HETATM records that do not describe the required atoms MUST be ignored. The program MUST accept a mixed PDB+PDBx data stream, i.e. the stream that is obtained by concatenating PDB and PDBx files in any order. The format switching points MUST be detected by analysing 'CRYST1', 'HEADER' and 'data_' records. OUTPUT ------ The output data MUST be written to STDOUT, using CSV [3,4] or TSV [5]; the exception from the format definition is the first line as described below. The first line – program Id ........................... Before the standard CSV or TSV stream begins, the program MUST output the first line starting with the hash character ("#") on the first column and containing the program Id generated by the version control system. The line MUST NOT contain the dollar signs '$' or other version control keyword delimiters so that the program's Id is not confused with the similarly looking keywords in the output data files. Example: # Id: pdbx-MOL-torsion-XY 1238 2009-09-03 07:14:51Z author In this way, the first line of the program's output will always contain a time stamp in a standardised format, and the remaining lines starting with the second line (1-based) a standard CSV or TSV data stream will be written. This standard data stream can be obtained, for example, using the 'tail -n +2' GNU Linux/Unix command. The second line – column headers ................................ The second line MUST contain column names, using the same separators as the data. The following columns MUST be output, in that order: 1. keyword – line keyword, indicating the dihedral angle type. 2. angle1 – the value of the first angle; 3. angle2 – the value of the second angle; 4. DATAID – a PDB ID from the HEADER record (for PDB files) or a data block name from the 'data+' header (for PDBx files); 5. chain – chain ID; 6. resname – residue name; 7. resnum – residue number (with the insertion code, if that code is present) 8. file – name of the processed file. The remaining data stream ......................... In the subsequent lines the program MUST output the information about the computed dihedral angles. For each residue of the input molecule, a single output line SHOULD be produced with the data for each of the above-mentioned columns, separated by the relevant CSV and TSV separators and quoted if necessary: - keyword that reflects the angle type. The keywords MUST be composed of the program name's MOL fragment and the letters corresponding to the X and Y parts of the program name. All letters MUST be Latin upper-case letters. Greek letters used to name angles MUST be transliterated to Latin using IUPAC recommendations [2]. For example, the peptide φ and ψ angles (Ramachandran plot angles) will be marked using the keywords PEPFQ (PEP stands for 'peptide', F corresponds to φ and Q corresponds to ψ); - value of the first angle in degrees (a floating point number); - value of the second angle in degrees (a floating point number); If values of the first or the second angle are undefined, an empty string is output instead of the missing angle; - PDB identifier or the data block identifier (a lone dash, the "-" symbol (without the quotes) if the identifier is not known); - the chain identifier (think about what to do when the chain ID is empty); - the residue name; - the residue number (with the insertion code if necessary); - the input file name (a lone dash, the "-" symbol (without the quotes) if the input is STDIN, use "./-" for a file actually named "-"); Example: ======== NB: the numbers in the example are arbitrary, not necessary physically correct: #Id: pdbx-MOL-torsion-XY 1238 2009-09-03 07:14:51Z author keyword angle1 angle2 DATAID chain resname resnum file PEPFQ 3.8 1xyz B ALA 2 /usr/data/PDB/1xyz.pdb PEPFQ 3.8 109.6 1xyz B ALA 3 /usr/data/PDB/1xyz.pdb PEPFQ 3.8 109.51 1xyz B GLY 3A /usr/data/PDB/1xyz.pdb PEPFQ 3.8 109.51 1xyz B ASP 4 /usr/data/PDB/1xyz.pdb PEPFQ 3.8 1xyz B ASP 5 /usr/data/PDB/1xyz.pdb When writing the value of the dihedral angles, print only the significant decimal digits. The program MAY output one extra "guard" digit to reduce rounding errors. The residue order for the dihedral angle calculation MUST be determined using the residue numbers; this means that if a residue 203 follows the residue 200, then 'angle1' for the res. 203 and 'angle2' for the res. 200 are undefined and should be output as empty lines. ERROR DIAGNOSTICS ================= Using native Perl diagnostics and the Perl warn() and die() functions is permissible where appropriate. The program MUST diagnose the following situations: - incorrect floating number format of the input file ("use warnings" diagnostics is OK); - incorrect syntax in the input line (e.g. wrong number of fields); - missing files; - unreadable files; - no atom records are found in the input stream (warning); Questions to think about: ========================= - How many decimal digits should you output from your program? What is the accuracy of your result? What floating point representation format (%f, %g, %e) is appropriate for your output, what are the benefits and the drawbacks of different floating point number textual representations? References: =========== 1. S. Bradner "Key words for use in RFCs to Indicate Requirement Levels" (1997) URL: https://tools.ietf.org/html/rfc2119 2. International Union of Pure and Applied Chemistry and International Union of Biochemistry (1974) Abbreviations and symbols for description of conformation of polypeptide chains. Pure and Applied Chemistry 40(3), 291-308. Walter de Gruyter GmbH. DOI: https://doi.org/10.1351/pac197440030291 3. RFC 4180. Common Format and MIME Type for Comma-Separated Values (CSV) Files. https://www.ietf.org/rfc/rfc4180.txt [accessed: 2022-04-05T11:49+03:00] 4. Library of Congress. CSV, Comma Separated Values (RFC 4180). https://www.loc.gov/preservation/digital/formats/fdd/fdd000323.shtml [accessed: 2022-04-05T11:50+03:00] 5. Library of Congress. TSV, Tab-Separated Values. https://www.loc.gov/preservation/digital/formats/fdd/fdd000533.shtml [accessed: 2022-04-05T11:51+03:00] Colophon ======== $Id: dihedral-angles.txt 11270 2024-05-07 12:01:40Z saulius $ $URL: file:///home/saulius/svn-repositories/paskaitos/VU/bioinformatika-III/u%C5%BEduotys-praktikai/dvisieni%C5%B3-kamp%C5%B3-u%C5%BEduotis/tasks/en/dihedral-angles.txt $