Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read two lines from a file and create dynamics keys in a for-loop, a follow-up

This question follows the problem in question: How to read two lines from a file and create dynamics keys in a for-loop?

But, the nature of the problem has evolved to certain complexity that I want to address.

Below is the structure of my data separated by space.

chr pos         M1  M2  Mk  Mg1  F1_hybrid     F1_PG    F1_block    S1  Sk1   S2    Sj
2   16229767    T/T T/T T/T G/T C|T 1|0 726  .  T/C T/C T/C
2   16229783    C/C C/C C/C A/C G|C 0|1 726 G/C G/C G/C C|G
2   16229992    A/A A/A A/A G/A G|A 1|0 726 A/A A/A A/A A|G
2   16230007    T/T T/T T/T A/T A|T 1|0 726 A|T A|T A|T A|T
2   16230011    G/G G/G G/G G/G C|G 1|0 726 G/C C|G C|G G/C
2   16230049    A/A A/A A/A A/A T|A 1|0 726 A|T .   A/T A/T
2   16230174    .   .   .   C/C T|C 1|0 726 C|T T|C T|C C|T
2   16230190    A/A A/A A/A A/A T|A 1|0 726 T|G G|T T|G T|G
2   16230260    A/A A/A A/A A/A G|A 1|0 726 G/G G/G G/G G/G

Explanation:

  • there are two major categories of data in the above file. Data from Group M have sample name starting with M, and similarly group S that has several columns names starting with S.

  • And there is a hybrid column (represented by F1_hybrid).

  • the data is the string along the position line. The F1_hybrid is phased with pipe (|) distinguishing the two letters. So, the two strings values from F1 are C-G-G-A-C-T-T-T-G, while another string value is T-C-A-T-G-A-C-A-A. One of this string is from M-group while the other is from S-group but I need to do some statistical analyses to do so. However, I can tell that visually that T-C-A-T-G-A-C-A-A string most likely came from M-group.

Procedure:

  • I read the first line and create a unique keys using the column information.

  • Then I read the second and 3rd line and the values in F1_hybrid, which is C|T with G|C. Now, I need to calculate how many GgC (explained as G given C) vs. CgT (C given T) exist between M-group vs. S group.

  • Then read 3rd (G|C) with 4th (G|A) line in F1_hybrid. So, the states are GgG and AgC. Similarly, I now count have many GcG vs. AgC exist in M vs. S group.

Therefore, I am trying to build a Markov-model which counts the number of state for a phased string from F1 and taking the observed counts in group M vs group S.

I am now explaining, how to count the number of any XgY (X given Y) based on F1_hyrbid:

  • It important to note the conditions before doing the count. The existing condition may be phased (which is represented by having pipe) vs. unphased (if the if two line have at least one slash (/).

Condition 01:

The M1 sample has state as (T/T with C/C) for 2nd and 3rd line. since the separator is a slash (/) and not pipe (|) we cannot tell which exact state M1-sample is in. But, we can create combination matrix (for previous state with present state)

    T     T
C  CgT   CgT
C  CgT   CgT

Now, we can tell that there are 4 total CgT

and we keep doing the same matrix if this condition meets.

Condition 02

Same is the case for other samples from Group M, except for Mg1 where the G/T is preceeding A/C. So, the matrix is:

    G     T
A  AgG   AgT
C  CgG   CgT

So, here we observed 1 count of CgT.

Condition 03:

But, if the earlier state - present state are phased by pipe in both states (like A|T at position 16230007 with C|G at position 16230011 for sample Sk1) we can do a direct count of phase state of observed state at that position, that there are only CgA and GgT, so count of CgT is 0.

Condition 04: If one of the state has pipe (|) but other has slash (/), the condition will be same as both state having slash.

Condition 05: If any of the previous_state or present_state is period(.) the observation count is automatically zero (0) for the state expected from F1_hybrid.

So, the expected output should be something like this:

pos     M1  M2  Mk  Mg1 H0  H1  S1  Sk1 S2  Sj
16..9783    4-CgT   4-CgT   4-CgT   1-CgT   GgC CgT 0   1-CgT   1-CgT   1-CgT
16..9992    4-AgC   4-AgC   4-AgC   2-AgC   GgG AgC 1-AgC   1-AgC   1-AgC   1-AgC,1-GgG
16..0007    4-TgA   4-TgA   4-TgA   1-AgG,1-TgA AgG TgA 2-TgA   2-TgA   2-TgA1  1-TgA
..................contd

Or, the values in dictionary format for each column would equally work. Something like ['4-CgT','4-CgT','4-CgT','4-CgT'] for first M1 at position 16..9783 and same for other.

like image 264
everestial007 Avatar asked Feb 02 '17 21:02

everestial007


1 Answers

The question is a bit old, but interesting because you have a very clear specification and you need help to write the code. I will expose a solution following a top-down approach, which is a very well known method, using plain old python. It shouldn't be difficult to adapt to pandas.

The top-down approach means to me: if you don't know how to write it, just name it!

You have a file (or a string) as input, and you want to output a file (or a string). It seems quite simple, but you want to merge pairs of rows to build every new row. The idea is:

  1. get the rows of the input, as dictionaries
  2. take them by two
  3. build a new row for each pair
  4. output the result

You don't know for now how to write the generator of rows. You don't know either how to build a new row for each pair. Don't stay blocked by the difficulties, just name the solutions. Imagine you have a function get_rows and a function build_new_row. Let's write this:

def build_new_rows(f):
    """generate the new rows. Output may be redirected to a file"""
    rows = get_rows(f) # get a generator on rows = dictionaries.
    r1 = next(rows) # store the first row
    for r2 in rows: # for every following row
        yield build_new_row(r1, r2) # yield a new row built of the previous stored row and the current row.
        r1 = r2 # store the current row, which becomes the previous row

Now, examine the two "missing" functions: get_rows and build_new_row. The function get_rows is quite easy to write. Here's the main part:

header = process_line(next(f))
for line in f:
    yield {k:v for k,v in zip(header, process_line(line))}

where process_line just splits the line on space, e.g. with a re.split("\s+", line.strip()).

The second part is build_new_row. Still the top-down approach: you need to build H0 and H1 from your expected table, and then to build the count of H1 for every M and S according to the conditions you exposed. Pretend you have a pipe_compute function that compute H0 and H1, and a build_count function that builds the count of H1 for every M and S:

def build_new_row(r1, r2):
    """build a row"""
    h0, h1 = pipe_compute(r1["F1_hybrid"], r2["F1_hybrid"])

    # initialize the dict whith the pos, H0 and H1
    new_row = {"pos":r2["pos"], "H0":h0, "H1":h1}

    for key in r1.keys():
        if key[0] in ("M", "S"):
            new_row[key] = build_count(r1[key], r2[key], h1)

    return new_row

You have almost everything now. Take a look at pipe_compute: it's exactly what you have written in your condition 03.

def pipe_compute(v1, v2):
    """build H0 H1 according to condition 03"""
    xs = v1.split("|")
    ys = v2.split("|")
    return [ys[0]+"g"+xs[0], ys[1]+"g"+xs[1]]

And for buid_count, stick to the top-down approach:

def build_count(v1, v2, to_count):
    """nothing funny here: just follow the conditions"""
    if is_slash_count(v1, v2): # are conditions 01, 02, 04 true ?
        c = slash_count(v1, v2)[to_count] # count how many "to_count" we find in the 2 x 2 table of condtions 01 or 02.
    elif "|" in v1 and "|" in v2: # condition 03
        c = pipe_count(v1, v2)[to_count]
    elif "." in v1 or "." in v2: # condition 05
        return '0'
    else:
        raise Exception(v1, v2)

    return "{}-{}".format(c, to_count) # n-XgY

We are still going down. When do we have is_slash_count? Two slashes (conditions 01 and 02) or one slash and one pipe (condition 04):

def is_slash_count(v1, v2):
    """conditions 01, 02, 04"""
    return "/" in v1 and "/" in v2 or "/" in v1 and "|" in v2 or "|" in v1 and "/" in v2

The function slash_count is simply the 2 x 2 table of conditions 01 and 02:

def slash_count(v1, v2):
    """count according to conditions 01, 02, 04"""
    cnt = collections.Counter()
    for x in re.split("[|/]", v1): # cartesian product
        for y in re.split("[|/]", v2): # cartesian product
            cnt[y+"g"+x] += 1
    return cnt # a dictionary XgY -> count(XgY)

The function pipe_count is even simpler, because you just have to count the result of pipe_compute:

def pipe_count(v1, v2):
    """count according to condition 03"""
    return collections.Counter(pipe_compute(v1, v2))

Now you're done (and down). I get this result, which is slightly different from your expectation, but you certainly have already seen my mistake(s?):

pos M1  M2  Mk  Mg1 H0  H1  S1  Sk1 S2  Sj
16229783    4-CgT   4-CgT   4-CgT   1-CgT   GgC CgT 0   1-CgT   1-CgT   1-CgT
16229992    4-AgC   4-AgC   4-AgC   1-AgC   GgG AgC 2-AgC   2-AgC   2-AgC   1-AgC
16230007    4-TgA   4-TgA   4-TgA   1-TgA   AgG TgA 2-TgA   2-TgA   2-TgA   0-TgA
16230011    4-GgT   4-GgT   4-GgT   2-GgT   CgA GgT 1-GgT   1-GgT   1-GgT   1-GgT
16230049    4-AgG   4-AgG   4-AgG   4-AgG   TgC AgG 1-AgG   0   1-AgG   1-AgG
16230174    0   0   0   4-CgA   TgT CgA 1-CgA   0   1-CgA   1-CgA
16230190    0   0   0   4-AgC   TgT AgC 0-AgC   0-AgC   0-AgC   0-AgC
16230260    4-AgA   4-AgA   4-AgA   4-AgA   GgT AgA 0-AgA   0-AgA   0-AgA   0-AgA

Bonus: Try it online!

What is important is, beyond the solution to this specific problem, the method I used and which is widely used in software development. The code may be improved a lot.

like image 126
jferard Avatar answered Oct 19 '22 07:10

jferard