...ANNOTATIONS...
HEADER VIRAL PROTEIN 18-AUG-20 7JTL
TITLE STRUCTURE OF SARS-COV-2 ORF8 ACCESSORY PROTEIN
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: ORF8 PROTEIN;
COMPND 3 CHAIN: A, B;
COMPND 4 SYNONYM: ORF8, NON-STRUCTURAL PROTEIN 8, NS8;
COMPND 5 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: SEVERE ACUTE RESPIRATORY SYNDROME CORONAVIRUS
SOURCE 3 2;
...ATOM RECORDS...
ATOM 1 N GLN A 18 40.917 33.173 126.601 1.00 73.92 N
ATOM 2 CA GLN A 18 42.025 33.162 125.645 1.00 73.89 C
ATOM 3 C GLN A 18 41.754 32.204 124.493 1.00 67.88 C
ATOM 4 O GLN A 18 42.414 31.177 124.374 1.00 66.48 O
ATOM 5 CB GLN A 18 43.339 32.782 126.337 1.00 73.82 C
ATOM 6 N GLU A 19 40.764 32.527 123.665 1.00 60.90 N
ATOM 7 CA GLU A 19 40.434 31.695 122.517 1.00 50.80 C
ATOM 8 C GLU A 19 40.928 32.273 121.201 1.00 50.81 C
ATOM 9 O GLU A 19 41.122 31.517 120.240 1.00 47.32 O
ATOM 10 CB GLU A 19 38.918 31.467 122.431 1.00 55.27 C
4 Working with PDB files
PDB format
Biomolecular structures can be saved in different formats such as PDB, mmCIF, XML, etc. The PDB format is the most common format and is supported by all visualization and docking programs. In this format the annotations for the structure are stored as different tags such as TITLE, COMPND, SOURCE, REMARK, etc. The atomic coordinates are saved under ATOM and HETATM tags. The PDB
class of Bio library has a PDBParser()
function to instantiate a PDB parser object which can be used to read and write structure files in pdb format. The get_structure
function takes two arguments — a name for the new structure object and a pdb file name.
from Bio.PDB import *
= PDBParser().get_structure("Protease", "7jtl.pdb") structure
C:\Users\bioinfo guru\.conda\envs\bio-env\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line 3855.
warnings.warn(
C:\Users\bioinfo guru\.conda\envs\bio-env\lib\site-packages\Bio\PDB\StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain B is discontinuous at line 3963.
warnings.warn(
The structure object has a attribute header
that stores the annotations for the structure as a dictionary.
print("The structure title is:",structure.header["name"])
print("The structure resolution is:",structure.header["resolution"])
print("The structure keywords are:",structure.header["keywords"])
The structure title is: structure of sars-cov-2 orf8 accessory protein
The structure resolution is: 2.04
The structure keywords are: sars-cov-2, sars2, covid-19, coronavirus, accessory protein, host- factor restriction, rna virus, immune evasion, mhc-i, open reading frame 8, viral protein
The coordinates in the structure object has a hierarchical organization as shown in the diagram below. The structure object is at the top of the hierarchy which has one or more models. Each model can have one or more chains; and so on. This organization, in programming terms, can be thought of as a nested list. To navigate through the structure object we can either use for
loop or some of the available in-built functions. The later being computationally efficient choice. The objects at each level in this organization has dedicated functions and attribute to get more information about the level.
Iterating the hierarchical organization using tandem for
loops. The code below creates a dictionary having chain names as keys and a list of residues in that chain as values.
import collections
= collections.defaultdict(list)
chain_residues for model in structure:
for chain in model:
for residue in chain:
if(residue.resname != 'HOH'):
id].append(residue.resname)
chain_residues[chain.print(chain_residues)
defaultdict(<class 'list'>, {'A': ['GLN', 'GLU', 'CYS', 'SER', 'LEU', 'GLN', 'SER', 'CYS', 'THR', 'GLN', 'HIS', 'GLN', 'PRO', 'TYR', 'VAL', 'VAL', 'ASP', 'ASP', 'PRO', 'CYS', 'PRO', 'ILE', 'HIS', 'PHE', 'TYR', 'SER', 'LYS', 'TRP', 'TYR', 'ILE', 'ARG', 'VAL', 'GLY', 'ALA', 'ARG', 'LYS', 'SER', 'ALA', 'PRO', 'LEU', 'ILE', 'GLU', 'LEU', 'CYS', 'VAL', 'ASP', 'GLU', 'SER', 'LYS', 'SER', 'PRO', 'ILE', 'GLN', 'TYR', 'ILE', 'ASP', 'ILE', 'GLY', 'ASN', 'TYR', 'THR', 'VAL', 'SER', 'CYS', 'LEU', 'PRO', 'PHE', 'THR', 'ILE', 'ASN', 'CYS', 'GLN', 'GLU', 'PRO', 'LYS', 'LEU', 'GLY', 'SER', 'LEU', 'VAL', 'VAL', 'ARG', 'CYS', 'SER', 'PHE', 'TYR', 'GLU', 'ASP', 'PHE', 'LEU', 'GLU', 'TYR', 'HIS', 'ASP', 'VAL', 'ARG', 'VAL', 'VAL', 'LEU', 'ASP', 'PHE', 'ILE', 'NA'], 'B': ['GLN', 'GLU', 'CYS', 'SER', 'LEU', 'GLN', 'SER', 'CYS', 'THR', 'GLN', 'HIS', 'GLN', 'PRO', 'TYR', 'VAL', 'VAL', 'ASP', 'ASP', 'PRO', 'CYS', 'PRO', 'ILE', 'HIS', 'PHE', 'TYR', 'SER', 'LYS', 'TRP', 'TYR', 'ILE', 'ARG', 'VAL', 'GLY', 'ALA', 'ARG', 'LYS', 'SER', 'ALA', 'PRO', 'LEU', 'ILE', 'GLU', 'LEU', 'CYS', 'VAL', 'ASP', 'GLU', 'ALA', 'SER', 'PRO', 'ILE', 'GLN', 'TYR', 'ILE', 'ASP', 'ILE', 'GLY', 'ASN', 'TYR', 'THR', 'VAL', 'SER', 'CYS', 'LEU', 'PRO', 'PHE', 'THR', 'ILE', 'ASN', 'CYS', 'GLN', 'GLU', 'PRO', 'LYS', 'LEU', 'GLY', 'SER', 'LEU', 'VAL', 'VAL', 'ARG', 'CYS', 'SER', 'PHE', 'TYR', 'GLU', 'ASP', 'PHE', 'LEU', 'GLU', 'TYR', 'HIS', 'ASP', 'VAL', 'ARG', 'VAL', 'VAL', 'LEU', 'ASP', 'PHE', 'ILE']})
The Selection
module in Bio.PDB
has unfold_entities
function that facilitates navigation through different levels in the structure object. This function takes two arguments – an entity list and the target level. The targen level is a single letter code via Atom, Residue, Chain, Model, or Structure. E.g. we can get all the atoms in a structure object using Selection.unfold_entities(structure, 'A')
. The diagram below shows the different arguments for the unfold_entities
function to unpack at different levels starting at the structure level. Note that structure
here refers to the name of the structure object instantiated using PDBParser()
.
Quiz: Write a code to get a dictionary having chain names as keys and a list of residues in that chain as values using unfold_entities().
Show answer
= collections.defaultdict(list)
chain_residues2 = Selection.unfold_entities(structure,"R")
residues_all id].append(r1.resname)\
[chain_residues2[r1.get_parent().for r1 in residues_all if r1.resname != 'HOH']
print(chain_residues2)
4.1 Sequence from pdb file
The parse
function in the Bio.SeqIO
module can be used to retrive the sequences of polymers in the pdb file. There are two different types of sequences that can be retrieved. 1) sequence from the ATOM record and 2) sequence from the SEQRES record (if present). The residues which are disordered in the structure would not appear in the sequence retrived using the pdb-atom
argument while those residues would be there in the output with pdb-seqres
argument.
from Bio import SeqIO
for record in SeqIO.parse("7jtl.pdb", "pdb-atom"):
print("Chain", record.annotations["chain"])
print(record.seq)
Chain A
QECSLQSCTQHQPYVVDDPCPIHFYSKWYIRVGARKSAPLIELCVDEXXSKSPIQYIDIGNYTVSCLPFTINCQEPKLGSLVVRCSFYEDFLEYHDVRVVLDFI
Chain B
QECSLQSCTQHQPYVVDDPCPIHFYSKWYIRVGARKSAPLIELCVDEAXXXSPIQYIDIGNYTVSCLPFTINCQEPKLGSLVVRCSFYEDFLEYHDVRVVLDFI
from Bio import SeqIO
for record in SeqIO.parse("7jtl.pdb", "pdb-seqres"):
print("Chain", record.annotations["chain"])
print(record.seq)
Chain A
SNAQECSLQSCTQHQPYVVDDPCPIHFYSKWYIRVGARKSAPLIELCVDEAGSKSPIQYIDIGNYTVSCLPFTINCQEPKLGSLVVRCSFYEDFLEYHDVRVVLDFI
Chain B
SNAQECSLQSCTQHQPYVVDDPCPIHFYSKWYIRVGARKSAPLIELCVDEAGSKSPIQYIDIGNYTVSCLPFTINCQEPKLGSLVVRCSFYEDFLEYHDVRVVLDFI
The residue object
To get a list of residues within an object at any level above residue, we can use the get_residue
function. To get the residue name and id for a particular residue, resname
and id
attributes for the residue object can be used. The get_parent
function returns the corresponding chain for a residue object. The get_list
function for the residue object return a list having all the atoms within that residue.
=0
ctr=""
prot_seqfor residue in model.get_residues():
if(ctr==5):
break
print(residue.resname, residue.id[1])
+=residue.resname
prot_seq+=1
ctrprint(prot_seq)
GLN 18
GLU 19
CYS 20
SER 21
LEU 22
GLNGLUCYSSERLEU
To convert the sequence in three letter code to single letter code, we can use SeqUtils
package available in biopython. The seq1()
function returns single letter sequence for a three letter sequence. Similarly, to convert single letter sequence to three letter we have seq3()
function.
from Bio.SeqUtils import seq1
seq1(prot_seq)
'QECSL'
In case the structure has some modified residues then custom mapping of three letter code to single letter code can be done using the custom_map
keyword argument to seq1()
function. E.g., let say a protein has a phosphorylated serine residue for which the three letter code is SEP and there is no standard single letter code for this. We can add custom mapping for SEP to S as follows.
+="SEP"
prot_seqprint(seq1(prot_seq)) #without custom map
print(seq1(prot_seq,custom_map={"SEP": "S"}))
QECSLX
QECSLS
4.2 Atom object
The information within an atom object can be accessed using different functions specific of a particular property. E.g., to get the name and coordinates for an atom, get_name
and get_coord
functions can be used. Similarly, to modify a value for any property, functions with set_
prefix can be used.
= list(structure[0]["A"].get_residues())
res_list print(res_list[3])
= res_list[3].get_list()
atoms_resi3 print(atoms_resi3)
print(atoms_resi3[1].get_coord())
<Residue SER het= resseq=21 icode= >
[<Atom N>, <Atom CA>, <Atom C>, <Atom O>, <Atom CB>, <Atom OG>]
[ 44.851 36.104 119.41 ]
4.3 Distance Calculations
The distance between any two atoms can be easily calculated using the -
operator.
= res_list[3].get_list()[1]
ca_3 = res_list[15].get_list()[1]
ca_15 = ca_3 - ca_15 # distance between two atoms
distance print(distance)
10.335114
Calulating Ca-Ca distance matrix
C-alpha distance matrix is a visual tool to analyze intra-molecular interactions in a protein structure. This matix has pairwise distances for all Ca atoms. The distance matrix can be converted to a contact map using a distance threshold.
from scipy.spatial.distance import squareform
import matplotlib.pyplot as plt
= Selection.unfold_entities(structure, "A")
atom_list = []
ca_ca_dist = [x for x in atom_list if "CA" in x.name]
ca_atom_list for x in range(len(ca_atom_list)):
for y in range(x + 1,len(ca_atom_list)):
-ca_atom_list[y])
ca_ca_dist.append(ca_atom_list[x]= squareform(ca_ca_dist)
dist_mat ## calculate contact map
= dist_mat.copy()
cont_map >7]=0 cont_map[cont_map
= plt.subplots(1,2, sharey=True)
fig, ax = ax[0].imshow(dist_mat, cmap="YlGn")
a 0].set_title("Distance Map")
ax[= ax[1].imshow(cont_map, cmap="YlGn")
b 1].set_title("Contact Map")
ax[=ax[0], shrink=0.5, label="Distance ($\mathrm{\AA}$)")
fig.colorbar(a, ax=ax[1], shrink=0.5, label="Distance ($\mathrm{\AA}$)")
fig.colorbar(b, ax plt.show()
4.4 Interface residues
The Bio.PDB
has NeighborSearch
function that is useful for calculating neighboring residues for a given selection of atoms. This function can be used to find interface residues in a multimeric complex. For this, we first need to define atoms selections having atoms for each of the monomeric subunits. Then we need to call this functions for each subunit to get the surrounding residues for both the subunits.
= Selection.unfold_entities(structure[0]["A"], "A")
A_atoms = Selection.unfold_entities(structure[0]["B"], "A") B_atoms
Search atoms around chain A.
= NeighborSearch(A_atoms)
ns = []
contacts_B for atom in B_atoms:
= ns.search(atom.coord, 3)
close_atoms if len(close_atoms)>0:
for a in close_atoms:
1])
contacts_B.append(a.get_parent().get_id()[print("Residues of chain B around chain A are:")
print(', '.join(map(str, contacts_B)))
Residues of chain B around chain A are:
314, 20, 339, 120, 120, 24, 118, 119, 307, 115, 115, 341, 51, 53, 119, 115, 310, 358, 349, 387, 387, 201, 403
Search atoms around chain B.
= NeighborSearch(B_atoms)
ns
= []
contacts_A for atom in A_atoms:
= ns.search(atom.coord, 3)
close_atoms if len(close_atoms)>0:
for a in close_atoms:
1])
contacts_A.append(a.get_parent().get_id()[print("Residues of chain A around chain B are:")
print(', '.join(map(str, contacts_A)))
Residues of chain A around chain B are:
20, 53, 120, 120, 118, 119, 214, 115, 115, 213, 51, 53, 287, 115, 262, 18, 50, 119, 275, 272, 285, 282, 290
4.5 Binana
BINANA (BINding ANAlyzer) is a python library that can be used for protein-ligand binding analysis. The tool is of particular use when analyzing protein-ligand docking results. The source code is available for download. After download the file, unzip the archive and add the path of the location where binana is there to sys.path
as shown below. You should now be able to import binana.
import sys
"C:/Users/bioinfo guru/Downloads/binana-2.1/python")
sys.path.append(import binana
The first thing that we need to do is to load the receptor and the ligand files. Once we have the these two files as binana objects we can proceed with interaction analysis. The interactions class has a set of functions for retriving different types of inter-molecular interactions. E.g., getting a list of hydrogen bonds between the receptor and the ligand using get_hydrogen_bonds()
. This function returns a dictionary having information about each hydrogen bond between the receptor and the ligand. Similarly, we can get information about all types of intermolecular interactions (including hydrogen bonding) using get_all_interactions
function.
= binana.load_ligand_receptor.from_files("C:/Users/bioinfo guru/Downloads/binana-2.1/python/example/ligand.pdbqt", \
ligand, receptor "C:/Users/bioinfo guru/Downloads/binana-2.1/python/example/receptor.pdbqt")
= binana.interactions.get_hydrogen_bonds(ligand, receptor)
hbond_inf for hbond_label in hbond_inf["labels"]:
print(hbond_label)
('A:CHT(1):N1(14)', 'A:CHT(1):H1(16)', 'A:ASP(157):OD2(285)', 'LIGAND', {'distance': 2.6500811308335455, 'angle': 16.087842801376098})
('A:CHT(1):O6(22)', 'A:ASN(156):2HD2(276)', 'A:ASN(156):ND2(274)', 'RECEPTOR', {'distance': 2.9006795755477723, 'angle': 35.51562311681741})
= binana.interactions.get_all_interactions(ligand, receptor)
all_inf print(all_inf.keys())
"hydrogen_bonds"] all_inf[
dict_keys(['closest', 'close', 'electrostatic_energies', 'active_site_flexibility', 'hydrophobics', 'hydrogen_bonds', 'halogen_bonds', 'ligand_atom_types', 'pi_pi', 'cat_pi', 'salt_bridges', 'metal_coordinations', 'ligand_rotatable_bonds'])
{'counts': {'HDONOR_LIGAND_SIDECHAIN_OTHER': 1,
'HDONOR_RECEPTOR_SIDECHAIN_OTHER': 1},
'mol': <binana._structure.mol.Mol at 0x1c1af8dc460>,
'labels': [('A:CHT(1):N1(14)',
'A:CHT(1):H1(16)',
'A:ASP(157):OD2(285)',
'LIGAND',
{'distance': 2.6500811308335455, 'angle': 16.087842801376098}),
('A:CHT(1):O6(22)',
'A:ASN(156):2HD2(276)',
'A:ASN(156):ND2(274)',
'RECEPTOR',
{'distance': 2.9006795755477723, 'angle': 35.51562311681741})]}
= binana.output.dictionary.collect_all(all_inf) all_data
print(len(all_data["hydrogenBonds"]))
for x in all_data["hydrogenBonds"]:
print(x)
2
{'ligandAtoms': [{'chain': 'A', 'resID': 1, 'resName': 'CHT', 'atomName': 'N1', 'atomIndex': 14}, {'chain': 'A', 'resID': 1, 'resName': 'CHT', 'atomName': 'H1', 'atomIndex': 16}], 'receptorAtoms': [{'chain': 'A', 'resID': 157, 'resName': 'ASP', 'atomName': 'OD2', 'atomIndex': 285}], 'metrics': {'distance': 2.6500811308335455, 'angle': 16.087842801376098}}
{'ligandAtoms': [{'chain': 'A', 'resID': 1, 'resName': 'CHT', 'atomName': 'O6', 'atomIndex': 22}], 'receptorAtoms': [{'chain': 'A', 'resID': 156, 'resName': 'ASN', 'atomName': 'ND2', 'atomIndex': 274}, {'chain': 'A', 'resID': 156, 'resName': 'ASN', 'atomName': '2HD2', 'atomIndex': 276}], 'metrics': {'distance': 2.9006795755477723, 'angle': 35.51562311681741}}