Split file to chunk

Tags:

I am trying to split a file formatted as:

@some 
@garbage
@lines
@target G0.S0
@type xy
 -0.108847E+02  0.489034E-04
 -0.108711E+02  0.491023E-04
 -0.108574E+02  0.493062E-04
 -0.108438E+02  0.495075E-04
 -0.108302E+02  0.497094E-04
 ....Unknown line numbers...
&
@target G0.S1
@type xy
 -0.108847E+02  0.315559E-04
 -0.108711E+02  0.316844E-04
 -0.108574E+02  0.318134E-04
 ....Unknown line numbers...
&
@target G1.S0
@type xy
 -0.108847E+02  0.350450E-04
 -0.108711E+02  0.351669E-04
 -0.108574E+02  0.352908E-04
&
@target G1.S1
@type xy
 -0.108847E+02  0.216396E-04
 -0.108711E+02  0.217122E-04
 -0.108574E+02  0.217843E-04
 -0.108438E+02  0.218622E-04

The @target Gx.Sy combination is unique and each set of data is always termineted by &.

I have managed to split the file in chunk as:

#!/usr/bin/env python3
import os
import sys
import itertools as it
import numpy as np
import matplotlib.pyplot as plt

try:
  filename = sys.argv[1]
  print(filename)
except IndexError:
  print("ERROR: Required filename not provided")

with open(filename, "r") as f:
  for line in f:
    if line.startswith("@target"):
      print(line.split()[-1].split("."))

x=[];y=[]
with open(filename, "r") as f:
  for key,group in it.groupby(f,lambda line: line.startswith('@target')):
    print(key)
    if not key:
        group = list(group)
        group.pop(0)
        # group.pop(-1)
        print(group)
        for i in range(len(group)):
          x.append(group[i].split()[0])
          y.append(group[i].split()[1])
        nx=np.array(x)
        ny=np.array(y)

I have two problem:

1) The preamble lines before the real data is also grouped, so the script does not work if there is any preamble. It is impossible to predict how many lines that would be; But I am trying to group after the @target and

2) I want to name the arrays as G0[S0,S0] and G1[S1,S2]; but I cant do this.

Kindly Help

UPDATE: I am trying to store those data in a nested np array of G0[S0,S1,...], G1[S0,S1,..] so that I can use it in matplotlib.

932

asked Feb 22 '17 19:02

BaRud

2 Answers

The functions below get the job done:

import numpy as np
from collections import defaultdict

def read_without_preamble(filename):
    with open(filename, 'r') as f:
        lines = f.readlines()
    for i, line in enumerate(lines):
        if line.startswith('@target'):
            return lines[i:]

def split_into_chunks(lines):
    chunks = defaultdict(dict)
    for line in lines:
        if line.startswith('@target'):
            GS_str = line.strip().split()[-1].split('.')
            G, S = map(lambda x: int(x[1:]), GS_str)
            chunks[G][S] = []
        elif line.startswith('@type xy'):
            pass
        elif line.startswith('&'):
            chunks[G][S] = np.asarray(chunks[G][S])
        else:
            xy_str = line.strip().split()
            chunks[G][S].append(map(float, xy_str))
    return chunks

To split your file into chunks you just need to run this code:

try:
  filename = sys.argv[1]
  print(filename)
except IndexError:
  print("ERROR: Required filename not provided")

data = read_without_preamble(filename)
chunks = split_into_chunks(data)

Stepwise demo

chunks is a dictionary in which the key is G (either 0 or 1):

In [415]: type(chunks)
Out[415]: dict

In [416]: for k in chunks.keys(): print(k)
0
1

The value of dictionary chunks is another dictionary in which the key is S (0, 1, or 2 in this example) and the value is a NumPy array containing the numeric data for Gi.Sn. You can access this chunk of data like this: chunks[i][n], where indices i and n are the values of G and S, respectively.

In [417]: type(chunks[0])
Out[417]: dict

In [418]: for k in chunks[0].keys(): print(k)
0
1
2

In [419]: type(chunks[1][2])
Out[419]: numpy.ndarray

In [420]: chunks[1][2]
Out[420]: 
array([[ -1.08851000e+01,   2.53058000e-05],
       [ -1.08715000e+01,   2.55353000e-05],
       [ -1.08579000e+01,   2.57745000e-05],
       [ -1.08443000e+01,   2.60225000e-05],
       [ -1.08306000e+01,   2.62617000e-05],
       [ -1.08170000e+01,   2.65097000e-05],
       [ -1.08034000e+01,   2.67666000e-05]])

chunks[i][n].shape[0] is 2 for any i and n, but chunks[i][n].shape[1] can take any value, i.e. the number of rows of numeric data may vary from one chunk to another.

formatted_file.txt

This is the file I used in the sample run. It consists of six chunks, namely G0.S0, G0.S1, G0.S2, G1.S0, G1.S1, and G1.S2.

@some 
@garbage
@lines
@target G0.S0
@type xy
 -0.108851E+02  0.127435E-03
 -0.108715E+02  0.127829E-03
 -0.108579E+02  0.128191E-03
 -0.108443E+02  0.128502E-03
 -0.108306E+02  0.128726E-03
 -0.108170E+02  0.128838E-03
 -0.108034E+02  0.128751E-03
&
@target G0.S1
@type xy
 -0.108851E+02  0.472694E-04
 -0.108715E+02  0.474233E-04
 -0.108579E+02  0.475837E-04
 -0.108443E+02  0.477448E-04
 -0.108306E+02  0.479052E-04
 -0.108170E+02  0.480669E-04
 -0.108034E+02  0.482279E-04
&
@target G0.S2
@type xy
 -0.108851E+02  0.253654E-04
 -0.108715E+02  0.255956E-04
 -0.108579E+02  0.258346E-04
 -0.108443E+02  0.260825E-04
 -0.108306E+02  0.263303E-04
 -0.108170E+02  0.265781E-04
 -0.108034E+02  0.268349E-04
&
@target G1.S0
@type xy
 -0.108851E+02  0.108786E-03
 -0.108715E+02  0.109216E-03
 -0.108579E+02  0.109651E-03
 -0.108443E+02  0.110116E-03
 -0.108306E+02  0.110552E-03
 -0.108170E+02  0.111011E-03
 -0.108034E+02  0.111489E-03
&
@target G1.S1
@type xy
 -0.108851E+02  0.278045E-04
 -0.108715E+02  0.278711E-04
 -0.108579E+02  0.279384E-04
 -0.108443E+02  0.280050E-04
 -0.108306E+02  0.280723E-04
 -0.108170E+02  0.281395E-04
 -0.108034E+02  0.282074E-04
&
@target G1.S2
@type xy
 -0.108851E+02  0.253058E-04
 -0.108715E+02  0.255353E-04
 -0.108579E+02  0.257745E-04
 -0.108443E+02  0.260225E-04
 -0.108306E+02  0.262617E-04
 -0.108170E+02  0.265097E-04
 -0.108034E+02  0.267666E-04
&

answered Nov 15 '22 04:11

Tonechas

Here is an approach using a generator and np.genfromtxt. Advantage: Light on memory. It filters the file on the fly hence does not require loading the entire thing into memory for processing.

UPDATE:

I streamlined the code and changed the output format to array of arrays. If for example G ranges between 0...3 and S ranges between 0...5 then it creates a 4x6 array containing arrays.

import numpy as np
from itertools import dropwhile, groupby
from operator import itemgetter

def load_chunks(f):
    f = open(f, 'rt') if isinstance(f, str) else f
    f = filter(lambda l: not l.strip() in ("", "&"), f)
    tok = "@target", "@type"
    fg = dropwhile(itemgetter(0), groupby(f, lambda l: not l.split()[0] in tok))
    I, D = [], []
    for k, g in fg:
        info = next(l.split() for l in g)[1]
        I.append([int(key[1:]) for key in info.split('.')])
        D.append(np.genfromtxt((l.encode() for l in next(fg)[1])))
    G, S = np.array(I).T
    res = np.empty((np.max(G)+1, np.max(S)+1), dtype=object)
    res[G, S] = D
    return res

fn = <your_file_name>

ara = load_chunks(fn)

answered Nov 15 '22 05:11

Paul Panzer

Related questions
                            
                                In pycharm ImportError: DLL load failed: The specified module could not be found. while importing facerecognition
                            
                                Where should I modify my breadth first search algo for finding the shortest path between 2 nodes?
                            
                                Since latest python version retains insertion order of dict,will the meaning of equality (==) change?
                            
                                Absolute paths after freezing with cx_freeze (Qt5 / PySide2 App)
                            
                                Keras, Tensorflow : Merge two different model output into one
                            
                                Python3.7 asyncio start webserver (FastAPI) and aio_pika consumer
                            
                                How to access webcam in OpenCV on PythonAnywhere through Javascript?
                            
                                Error loading Python lib with PyInstaller on MacOS
                            
                                grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE
                            
                                Quarter to date growth
                            
                                How can I properly mimic this encryption method to produce the proper value for the encryptedPwd field?
                            
                                Changing the "locale preferred encoding"
                            
                                Should I add __future__ statements to every file on my project?
                            
                                Is that OK to use the MRO in order to override a mixin?
                            
                                How to temporarily disable keyboard input using Python
                            
                                How to optimize a python script which runs for 4**k times?
                            
                                Qt.ScrollBarAsNeeded not showing scrollbar when it's actually needed
                            
                                Ttk Theme Settings
                            
                                Seaborn bug? Inconsistent in heatmap plotting
                            
                                Did I reinvent the wheel with this deduplicating function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split file to chunk

Tags:

file

split

python-3.x

numpy