Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Length cutting through file handling

I have 2 codes which did the same work as which i am asking , but still i didn't get any useful or better code for my data set to make it useful for me , First let me clear what i am doing . I have 2 TEXT files , one name as input_num and second named as input_data as it is clear from names that input_num.txt have number in them , and input_data have data in it , these 2 files are of 8 to 10 mb , let me show you some of their part , This is 'input_num.txt'

ASA5.txt DF4E6.txt DFS6Q7.txt

and this input_data.txt

>56|61|83|92|ASA5
Dogsarebarking

These 2 are some parts of their text files , input_data.txt have last column which contain ASA5 and so on , these are data from input_num.txt , so the program first check the last column of >56|61|83|92|ASA5 which is ASA5 than goto input_num.txt which have 5 , it contain some value in input_num.txt like 4 in the above , so it come back to the input_data.txt goto the words and cut them to 4 ,

I have 2 codes for it : 1 is

import os
import re
file_c = open('num_data.txt')
file_c = file_c.read()
lines = re.findall(r'\w+\.txt \d+', file_c)
numbers = {}

for line in lines:
    line_split = line.split('.txt ')
    hash_name = line_split[0]
    count = line_split[1]
    numbers[hash_name] = count
file_i = open('input_data.txt')
file_i = file_i.read()

for hash_name, count in numbers.iteritems():
    regex = '(' + hash_name.strip() + ')'
    result = re.findall(r'>.*\|(' + regex + ')(.*?)>', file_i, re.S)

    if len(result) > 0:
        data_original = result[0][2]
        stripped_data = result[0][2][int(count):]
        file_i = file_i.replace(data_original, '\n' + stripped_data)
f = open('input_new.txt', 'wt')
f.write(file_i)
f.close()

and the 2nd is

import csv
output = open('output.txt' , 'wb')
def get_min(num):
    return int(open('%s.txt' % num, 'r+').readlines()[0])
last_line = ''
input_list = []

#iterate over input.txt in sort the input in a list of tuples 
for i, line in enumerate(open('input.txt', 'r+').readlines()): 
    if i%2 == 0: 
        last_line = line
    else:
        input_list.append((last_line, line))
filtered = [(header, data[:get_min(header[-2])] + '\n' ) for (header, data) in input_list]
[output.write(''.join(data)) for data in filtered]
output.close()
like image 915
Rocket Avatar asked Apr 12 '13 18:04

Rocket


1 Answers

As far as I could understand from the description of your problem with the first code, you want the first N letters in the output while in fact you get everything except the first N letters. This can probably be fixed by changing

stripped_data = result[0][2][int(count):]

to

stripped_data = result[0][2][:int(count)]

I also think the regular expressions used are not completely accurate. I suggest the following for the numbers:

with open('num.txt') as nums:
    lines = re.findall(r'\w+\.txt\s+\d+', nums.read())

numbers = {}
for line in lines:
    line_split = re.split(r'\.txt\s+', line)
    count = line_split[1]
    numbers[line_split[0]] = int(line_split[1])

and the following for the data:

with open('input_data.txt') as file_i:
     data = file_i.read()

for name, count in numbers.iteritems():
    result = re.search(r'\|{}\n(.*?)(>|$)'.format(name), s, re.S)
    if result:
        data_original = result.group(1)
        stripped_data = data_original[:count]
        data = data.replace(data_original, stripped_data)
with open('input_new.txt', 'w') as f:
    f.write(data)

But note that the idea is still flawed because you can accidentally change more than one sequence when doing replace. Also this method is memory-inefficient because the files are read into the memory as one string. I suggest to use an iterative parser for the data, like the ones I mention below.


Anyway, if I had to solve this problem, I'd use pyteomics to read and write FASTA files (because I wrote it and always have it handy).

The format of input_num.txt is awful, so I think the code from your first example is the best one can do to extract the info. I made some fixes to it though:

import re
from pyteomics import fasta

with open('num.txt') as nums:
    lines = re.findall(r'\w+\.txt\s+\d+', nums.read())

numbers = {}
for line in lines:
    line_split = re.split(r'\.txt\s+', line)
    count = line_split[1]
    numbers[line_split[0]] = int(line_split[1])

with fasta.read('data.txt') as data:
    new_data = ((header, seq[:numbers.get(header.rsplit('|', 1)[-1])])
            for header, seq in data)
    fasta.write(new_data, 'new_data.txt')

On the other hand, since your data look more like DNA sequences and pyteomics is for proteomics, it may make more sense to use BioPython.SeqIO:

import re
from Bio import SeqIO

with open('num.txt') as nums:
    lines = re.findall(r'\w+\.txt\s+\d+', nums.read())

numbers = {}
for line in lines:
    line_split = re.split(r'\.txt\s+', line)
    count = line_split[1]
    numbers[line_split[0]] = int(line_split[1])
data = SeqIO.parse(open('data.txt'), 'fasta')

def new_records():
    for record in data:
        record.seq = record.seq[:numbers.get(record.description.rsplit('|', 1)[-1])]
        yield record

with open('new_data.txt', 'w') as new_data:
    SeqIO.write(new_records(), new_data, 'fasta')
like image 112
Lev Levitsky Avatar answered Oct 16 '22 22:10

Lev Levitsky