In the following data, I am trying to run a simple markov model.
Say I have a data with following structure:
pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8
1 A T T A A G A C A|C C G C T T A G A
2 T G C T G T T G T|A A T A T C A A T
3 C A A C A G T C C|G G A C G C G C G
4 G T G T A T C T G|T C T T T A T C T
Block M represents data from one set of catergories, so does block S.
The data are the strings
which are made by connecting letter along the position line. So, the string value for M1 is A-T-C-G, and so is for every other block.
There is also one hybrid block
that has two string which is read in same way. The question is I want to find which string in the hybrid block most likely came from which block (M vs. S)?
I am trying to build a markov model which can help me identify which string in hybrid block
came from which blocks. In this example I can tell that in hybrid block ATCG
came from block M
and CAGT
came from block S
.
I am breaking the problem into different parts to read and mine the data:
Problem level 01:
unique keys
for all the columns.pos
with value 1) and create another key. In the same line I read the value from hybrid_block
and read the strings value in it. The pipe |
is just a separator, so two strings are in index 0 and 2
as A
and C
. So, all I want from this line is a defaultdict(<class 'dict'>, {'M1': ['A'], 'M2': ['T'], 'M3': ['T']...., 'hybrid_block': ['A'], ['C']...}
As, I progress with reading the line, I want to append the strings values from each column and finally create.
defaultdict(<class 'dict'>, {'M1': ['A', 'T', 'C', 'G'], 'M2': ['T', 'G', 'A', 'T'], 'M3': ['T', 'C', 'A', 'G']...., 'hybrid_block': ['A', 'T', 'C', 'G'], ['C', 'A', 'G', 'T']...}
Problem level 02:
I read the data in hybrid_block
for the first line which are A and C
.
Now, I want to create keys' but unlike fixed keys, these key will be generated while reading the data from
hybrid_blocks.
For the first line since there are no preceding line the
keyswill simply be
AgAand
CgCwhich means (A given A, and C given C), and for the values I count the number of
Ain
block Mand
block S`. So, the data will be stored as:
defaultdict(<class 'dict'>, {'M': {'AgA': [4], 'CgC': [1]}, 'S': {'AgA': 2, 'CgC': 2}}
As, I read through other lines I want to create new keys based on what are the strings in hybrid block
and count the number of times that string was present in M vs S
block given the string in preceeding line. That means the keys
while reading line 2
would be TgA' which means (T given A) and AgC. For the values inside this key I count the number of times I found
T in this line, after A in the previous lineand same for
AcG`.
The defaultdict
after reading 3 lines would be.
defaultdict(<class 'dict'>, {'M': {'AgA': 4, 'TgA':3, 'CgT':2}, {'CgC': [1], 'AgC':0, 'GgA':0}, 'S': {'AgA': 2, 'TgA':1, 'CgT':0}, {'CgC': 2, 'AgC':2, 'GgA':2}}
I understand this looks too complicated. I went through several dictionary
and defaultdict
tutorial but couldn't find a way of doing this.
Solution to any part if not both is highly appreciated.
pandas
setupfrom io import StringIO
import pandas as pd
import numpy as np
txt = """pos M1 M2 M3 M4 M5 M6 M7 M8 hybrid_block S1 S2 S3 S4 S5 S6 S7 S8
1 A T T A A G A C A|C C G C T T A G A
2 T G C T G T T G T|A A T A T C A A T
3 C A A C A G T C C|G G A C G C G C G
4 G T G T A T C T G|T C T T T A T C T """
df = pd.read_csv(StringIO(txt), delim_whitespace=True, index_col='pos')
df
pandas
with some numpy
'AgA'
type stringsd1 = pd.concat([df.loc[[1]].rename(index={1: 0}), df])
d1 = pd.concat([
df.filter(like='M'),
df.hybrid_block.str.split('|', expand=True).rename(columns='H{}'.format),
df.filter(like='S')
], axis=1)
d1 = pd.concat([d1.loc[[1]].rename(index={1: 0}), d1])
d1 = d1.add('g').add(d1.shift()).dropna()
d1
Assign convenient blocks to their own variable names
m = d1.filter(like='M')
s = d1.filter(like='S')
h = d1.filter(like='H')
Count how many are in each block and concatenate
mcounts = pd.DataFrame(
(m.values[:, :, None] == h.values[:, None, :]).sum(1),
h.index, h.columns
)
scounts = pd.DataFrame(
(s.values[:, :, None] == h.values[:, None, :]).sum(1),
h.index, h.columns
)
counts = pd.concat([mcounts, scounts], axis=1, keys=['M', 'S'])
counts
If you really want a dictionary
d = defaultdict(lambda:defaultdict(list))
dict_df = counts.stack().join(h.stack().rename('condition')).unstack()
for pos, row in dict_df.iterrows():
d['M']['H0'].append((row.loc[('condition', 'H0')], row.loc[('M', 'H0')]))
d['S']['H0'].append((row.loc[('condition', 'H0')], row.loc[('S', 'H0')]))
d['M']['H1'].append((row.loc[('condition', 'H1')], row.loc[('M', 'H1')]))
d['S']['H1'].append((row.loc[('condition', 'H1')], row.loc[('S', 'H1')]))
dict(d)
{'M': defaultdict(list,
{'H0': [('AgA', 4), ('TgA', 3), ('CgT', 2), ('GgC', 1)],
'H1': [('CgC', 1), ('AgC', 0), ('GgA', 0), ('TgG', 1)]}),
'S': defaultdict(list,
{'H0': [('AgA', 2), ('TgA', 1), ('CgT', 0), ('GgC', 0)],
'H1': [('CgC', 2), ('AgC', 2), ('GgA', 2), ('TgG', 3)]})}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With