Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split file to chunk

I am trying to split a file formatted as:

@some 
@garbage
@lines
@target G0.S0
@type xy
 -0.108847E+02  0.489034E-04
 -0.108711E+02  0.491023E-04
 -0.108574E+02  0.493062E-04
 -0.108438E+02  0.495075E-04
 -0.108302E+02  0.497094E-04
 ....Unknown line numbers...
&
@target G0.S1
@type xy
 -0.108847E+02  0.315559E-04
 -0.108711E+02  0.316844E-04
 -0.108574E+02  0.318134E-04
 ....Unknown line numbers...
&
@target G1.S0
@type xy
 -0.108847E+02  0.350450E-04
 -0.108711E+02  0.351669E-04
 -0.108574E+02  0.352908E-04
&
@target G1.S1
@type xy
 -0.108847E+02  0.216396E-04
 -0.108711E+02  0.217122E-04
 -0.108574E+02  0.217843E-04
 -0.108438E+02  0.218622E-04

The @target Gx.Sy combination is unique and each set of data is always termineted by &.

I have managed to split the file in chunk as:

#!/usr/bin/env python3
import os
import sys
import itertools as it
import numpy as np
import matplotlib.pyplot as plt

try:
  filename = sys.argv[1]
  print(filename)
except IndexError:
  print("ERROR: Required filename not provided")

with open(filename, "r") as f:
  for line in f:
    if line.startswith("@target"):
      print(line.split()[-1].split("."))

x=[];y=[]
with open(filename, "r") as f:
  for key,group in it.groupby(f,lambda line: line.startswith('@target')):
    print(key)
    if not key:
        group = list(group)
        group.pop(0)
        # group.pop(-1)
        print(group)
        for i in range(len(group)):
          x.append(group[i].split()[0])
          y.append(group[i].split()[1])
        nx=np.array(x)
        ny=np.array(y)

I have two problem:

1) The preamble lines before the real data is also grouped, so the script does not work if there is any preamble. It is impossible to predict how many lines that would be; But I am trying to group after the @target and

2) I want to name the arrays as G0[S0,S0] and G1[S1,S2]; but I cant do this.

Kindly Help

UPDATE: I am trying to store those data in a nested np array of G0[S0,S1,...], G1[S0,S1,..] so that I can use it in matplotlib.

like image 932
BaRud Avatar asked Feb 22 '17 19:02

BaRud


People also ask

How do you split a file into chunks in Python?

To split a big binary file in multiple files, you should first read the file by the size of chunk you want to create, then write that chunk to a file, read the next chunk and repeat until you reach the end of original file.

How do you split a file in Python?

Example 1: Using the splitlines() the read() method reads the data from the file which is stored in the variable file_data. splitlines() method splits the data into lines and returns a list object. After printing out the list, the file is closed using the close() method. Create a text file with the name “examplefile.


2 Answers

The functions below get the job done:

import numpy as np
from collections import defaultdict

def read_without_preamble(filename):
    with open(filename, 'r') as f:
        lines = f.readlines()
    for i, line in enumerate(lines):
        if line.startswith('@target'):
            return lines[i:]

def split_into_chunks(lines):
    chunks = defaultdict(dict)
    for line in lines:
        if line.startswith('@target'):
            GS_str = line.strip().split()[-1].split('.')
            G, S = map(lambda x: int(x[1:]), GS_str)
            chunks[G][S] = []
        elif line.startswith('@type xy'):
            pass
        elif line.startswith('&'):
            chunks[G][S] = np.asarray(chunks[G][S])
        else:
            xy_str = line.strip().split()
            chunks[G][S].append(map(float, xy_str))
    return chunks

To split your file into chunks you just need to run this code:

try:
  filename = sys.argv[1]
  print(filename)
except IndexError:
  print("ERROR: Required filename not provided")

data = read_without_preamble(filename)
chunks = split_into_chunks(data)

Stepwise demo

chunks is a dictionary in which the key is G (either 0 or 1):

In [415]: type(chunks)
Out[415]: dict

In [416]: for k in chunks.keys(): print(k)
0
1

The value of dictionary chunks is another dictionary in which the key is S (0, 1, or 2 in this example) and the value is a NumPy array containing the numeric data for Gi.Sn. You can access this chunk of data like this: chunks[i][n], where indices i and n are the values of G and S, respectively.

In [417]: type(chunks[0])
Out[417]: dict

In [418]: for k in chunks[0].keys(): print(k)
0
1
2

In [419]: type(chunks[1][2])
Out[419]: numpy.ndarray

In [420]: chunks[1][2]
Out[420]: 
array([[ -1.08851000e+01,   2.53058000e-05],
       [ -1.08715000e+01,   2.55353000e-05],
       [ -1.08579000e+01,   2.57745000e-05],
       [ -1.08443000e+01,   2.60225000e-05],
       [ -1.08306000e+01,   2.62617000e-05],
       [ -1.08170000e+01,   2.65097000e-05],
       [ -1.08034000e+01,   2.67666000e-05]])

chunks[i][n].shape[0] is 2 for any i and n, but chunks[i][n].shape[1] can take any value, i.e. the number of rows of numeric data may vary from one chunk to another.

formatted_file.txt

This is the file I used in the sample run. It consists of six chunks, namely G0.S0, G0.S1, G0.S2, G1.S0, G1.S1, and G1.S2.

@some 
@garbage
@lines
@target G0.S0
@type xy
 -0.108851E+02  0.127435E-03
 -0.108715E+02  0.127829E-03
 -0.108579E+02  0.128191E-03
 -0.108443E+02  0.128502E-03
 -0.108306E+02  0.128726E-03
 -0.108170E+02  0.128838E-03
 -0.108034E+02  0.128751E-03
&
@target G0.S1
@type xy
 -0.108851E+02  0.472694E-04
 -0.108715E+02  0.474233E-04
 -0.108579E+02  0.475837E-04
 -0.108443E+02  0.477448E-04
 -0.108306E+02  0.479052E-04
 -0.108170E+02  0.480669E-04
 -0.108034E+02  0.482279E-04
&
@target G0.S2
@type xy
 -0.108851E+02  0.253654E-04
 -0.108715E+02  0.255956E-04
 -0.108579E+02  0.258346E-04
 -0.108443E+02  0.260825E-04
 -0.108306E+02  0.263303E-04
 -0.108170E+02  0.265781E-04
 -0.108034E+02  0.268349E-04
&
@target G1.S0
@type xy
 -0.108851E+02  0.108786E-03
 -0.108715E+02  0.109216E-03
 -0.108579E+02  0.109651E-03
 -0.108443E+02  0.110116E-03
 -0.108306E+02  0.110552E-03
 -0.108170E+02  0.111011E-03
 -0.108034E+02  0.111489E-03
&
@target G1.S1
@type xy
 -0.108851E+02  0.278045E-04
 -0.108715E+02  0.278711E-04
 -0.108579E+02  0.279384E-04
 -0.108443E+02  0.280050E-04
 -0.108306E+02  0.280723E-04
 -0.108170E+02  0.281395E-04
 -0.108034E+02  0.282074E-04
&
@target G1.S2
@type xy
 -0.108851E+02  0.253058E-04
 -0.108715E+02  0.255353E-04
 -0.108579E+02  0.257745E-04
 -0.108443E+02  0.260225E-04
 -0.108306E+02  0.262617E-04
 -0.108170E+02  0.265097E-04
 -0.108034E+02  0.267666E-04
&
like image 75
Tonechas Avatar answered Nov 15 '22 04:11

Tonechas


Here is an approach using a generator and np.genfromtxt. Advantage: Light on memory. It filters the file on the fly hence does not require loading the entire thing into memory for processing.

UPDATE:

I streamlined the code and changed the output format to array of arrays. If for example G ranges between 0...3 and S ranges between 0...5 then it creates a 4x6 array containing arrays.

import numpy as np
from itertools import dropwhile, groupby
from operator import itemgetter

def load_chunks(f):
    f = open(f, 'rt') if isinstance(f, str) else f
    f = filter(lambda l: not l.strip() in ("", "&"), f)
    tok = "@target", "@type"
    fg = dropwhile(itemgetter(0), groupby(f, lambda l: not l.split()[0] in tok))
    I, D = [], []
    for k, g in fg:
        info = next(l.split() for l in g)[1]
        I.append([int(key[1:]) for key in info.split('.')])
        D.append(np.genfromtxt((l.encode() for l in next(fg)[1])))
    G, S = np.array(I).T
    res = np.empty((np.max(G)+1, np.max(S)+1), dtype=object)
    res[G, S] = D
    return res

fn = <your_file_name>

ara = load_chunks(fn)
like image 33
Paul Panzer Avatar answered Nov 15 '22 05:11

Paul Panzer