2  Motif analysis

A set of sequences of equal length can be used to create a motif. The create() function in the motifs class takes a list of sequences as an argument to instantiate a motif object.

import Bio
from Bio import motifs
from Bio.Seq import Seq
instances = [Seq("TACAA"), Seq("TACGC"), Seq("TACAC"), Seq("TACCC"), Seq("AACCC"), Seq("AATGC"), Seq("AATGC")]
m1 = motifs.create(instances)
print(m1)
print("The length of motif is", len(m1))
TACAA
TACGC
TACAC
TACCC
AACCC
AATGC
AATGC

The length of motif is 5

The counts function returns the frequency of each alphabet at all the positions in the motif. Frequency of a particular alphabet at different positions within the motif can also be accessed. The concensus function returns the concensus sequence for the motif.

print(m1.counts)
print("The frequency of 'A' at different positions in the motif is:", m1.counts["A"])
print("The frequency of 'A' at the first position is:", m1.counts["A",0])
print("The concensus sequence for the motif is:",m1.consensus)
        0      1      2      3      4
A:   3.00   7.00   0.00   2.00   1.00
C:   0.00   0.00   5.00   2.00   6.00
G:   0.00   0.00   0.00   3.00   0.00
T:   4.00   0.00   2.00   0.00   0.00

The frequency of 'A' at different positions in the motif is: [3, 7, 0, 2, 1]
The frequency of 'A' at the first position is: 3
The concensus sequence for the motif is: TACGC

To search a particular motif in a sequence, instances.search function can be used.

sequence_set = [Seq("AACCGGTT"),Seq("AACCCGTT"),Seq("CATTACAA")]
motif_p1 = motifs.create(sequence_set)
print(motif_p1.instances)
AACCGGTT
AACCCGTT
CATTACAA
test_seq=Seq("TACACTGCATTACAACCCAAGCATTA")
for pos, seq in motif_p1.instances.search(test_seq):
    print("%i %s" % (pos, seq))
7 CATTACAA

2.0.1 Motif for protein sequences

By default, the create() function considers the input sequence as a DNA sequence. For creating a motif for protein sequences, the keyword argument alphabet need to be specified the all the amino acids in single letter code.

instances = [Seq("LXXLL"),Seq("LXXLL")]
motif_prot = motifs.create(instances,alphabet="ACDEFGHIKLMNPQRSTVWXY")
motif_prot.counts
{'A': [0, 0, 0, 0, 0],
 'C': [0, 0, 0, 0, 0],
 'D': [0, 0, 0, 0, 0],
 'E': [0, 0, 0, 0, 0],
 'F': [0, 0, 0, 0, 0],
 'G': [0, 0, 0, 0, 0],
 'H': [0, 0, 0, 0, 0],
 'I': [0, 0, 0, 0, 0],
 'K': [0, 0, 0, 0, 0],
 'L': [2, 0, 0, 2, 2],
 'M': [0, 0, 0, 0, 0],
 'N': [0, 0, 0, 0, 0],
 'P': [0, 0, 0, 0, 0],
 'Q': [0, 0, 0, 0, 0],
 'R': [0, 0, 0, 0, 0],
 'S': [0, 0, 0, 0, 0],
 'T': [0, 0, 0, 0, 0],
 'V': [0, 0, 0, 0, 0],
 'W': [0, 0, 0, 0, 0],
 'X': [0, 2, 2, 0, 0],
 'Y': [0, 0, 0, 0, 0]}

2.0.2 Motif from MSA

from Bio import AlignIO
DEAD_align = AlignIO.read("DEAD2.aln","clustal")
print(DEAD_align)
#Read alignment for proteins with DEAD motif
#DEAD_align = AlignIO.read("DEAD2.aln","clustal")
#print(DEAD_align)
Alignment with 23 rows and 1199 columns
--------------------------------------------...--- sp|P17844|DDX5_HUMAN
--------------------------------------------...--- sp|Q92841|DDX17_HUMAN
MNWNKGGPGTKRGFGFGGFAISAGKKEEPKLPQQSHSAFGATSS...--- sp|Q86XP3|DDX42_HUMAN
--------------------------------------------...--- sp|O00571|DDX3X_HUMAN
--------------------------------------------...--- sp|O15523|DDX3Y_HUMAN
--------------------------------------------...--- sp|Q9UHI6|DDX20_HUMAN
--------------------------------------------...--- sp|Q13838|DX39B_HUMAN
--------------------------------------------...--- sp|Q9UMR2|DD19B_HUMAN
--------------------------------------------...--- sp|P38919|IF4A3_HUMAN
--------------------------------------------...--- sp|P26196|DDX6_HUMAN
------------------------MVLAQRRRGGCEKLRAGPQA...--- sp|Q96GQ7|DDX27_HUMAN
--------------------------------------------...--- sp|Q9H0S4|DDX47_HUMAN
--------------------------------------------...KRM sp|Q8TDD1|DDX54_HUMAN
--------------------------------------------...--- sp|Q9NY93|DDX56_HUMAN
--------------------------------------------...--- sp|Q8NHQ9|DDX55_HUMAN
--------------------------------------------...--- sp|Q9NVP1|DDX18_HUMAN
----------------------MAPDLASQRHSESFPSVNSRPN...--- sp|Q9H8H2|DDX31_HUMAN
--------------------------------------------...--- sp|Q9UJV9|DDX41_HUMAN
...
--------------------------------------------...--- sp|Q9H2U1|DHX36_HUMAN
print(DEAD_align[1:5,442:448])
Alignment with 4 rows and 6 columns
LDEADR sp|Q92841|DDX17_HUMAN
FDEADR sp|Q86XP3|DDX42_HUMAN
LDEADR sp|O00571|DDX3X_HUMAN
LDEADR sp|O15523|DDX3Y_HUMAN
instances_DEAD = [s1.seq for s1 in DEAD_align[:,442:448]]
motif_DEAD = motifs.create(instances_DEAD, alphabet='ACDEFGHIKLMNPQRSTVWY')
#print(motif_DEAD.counts)
df_DEAD_motif = pd.DataFrame.from_dict(motif_DEAD.counts)

2.1 Logomaker

The logomaker package offers a rich set of functionality to work with sequences/motifs to create sequence logos. It can be installed via pip install logomaker. This library uses pandas and matplotlib to generate sequence logos. The savefig() function of the plt object can be used to save an image of the logo. The resolution of the reulting image can be adjusted using the dpi keyword argument. To draw sequence logo the Logo() function can be used which take the sequence motif in the form of pandas dataframe as an argument.

import logomaker
import pandas as pd
import matplotlib.pyplot as plt
ss_logo = logomaker.Logo(df_DEAD_motif)
plt.savefig("fig1.png",dpi=300)

To normalize the values on the y-axis use normalize() function.

motif_pwm = motif_DEAD.counts.normalize()
df_motif_pwm = pd.DataFrame.from_dict(motif_pwm)
ss_logo_pwm = logomaker.Logo(df_motif_pwm)

2.2 Decorating logos

The logo fonts can be changed using the font_name argument. Positions within the logo can be highlighted by adding background color as shown below.

ss_logo_pwm = logomaker.Logo(df_motif_pwm,font_name='Franklin Gothic Book')
ss_logo_pwm.highlight_position(p=1, color='pink')
ss_logo_pwm.highlight_position(p=2, color='pink')
ss_logo_pwm.highlight_position(p=3, color='pink')
ss_logo_pwm.highlight_position(p=4, color='pink')

2.3 Reading motifs

Motif files such available from Jaspar database can be read directly to create a motif object.

df_jaspar_motif = pd.DataFrame()
fh = open("MA0007.2.jaspar")
for m in motifs.parse(fh, "jaspar"):
    print(m.counts)
    df_jaspar_motif = pd.DataFrame.from_dict(m.counts)
fh.close()
        0      1      2      3      4      5      6      7      8      9     10     11     12     13     14
A: 6277.00 6497.00   0.00 6462.00 11206.00  26.00 10426.00 1478.00 4353.00 3312.00 3241.00   0.00 2214.00 2656.00 1599.00
C: 1112.00   0.00   0.00 1304.00   0.00 11115.00   0.00 2976.00 2096.00 3249.00 142.00 525.00 1262.00 2434.00 7032.00
G: 3049.00 4709.00 11206.00 1254.00   0.00  65.00 107.00 4023.00 2151.00 3291.00 305.00 10681.00 671.00 2460.00 298.00
T: 768.00   0.00   0.00 2186.00   0.00   0.00 673.00 2729.00 2606.00 1354.00 7518.00   0.00 7059.00 3656.00 2277.00
ss_logo = logomaker.Logo(df_jaspar_motif)

ss_logo = logomaker.Logo(df_jaspar_motif,font_name='Franklin Gothic Book')
ss_logo.highlight_position(p=3, color='magenta')
ss_logo.highlight_position(p=7, color='lightgreen')