I am trying to split a file formatted as:
@some
@garbage
@lines
@target G0.S0
@type xy
-0.108847E+02 0.489034E-04
-0.108711E+02 0.491023E-04
-0.108574E+02 0.493062E-04
-0.108438E+02 0.495075E-04
-0.108302E+02 0.497094E-04
....Unknown line numbers...
&
@target G0.S1
@type xy
-0.108847E+02 0.315559E-04
-0.108711E+02 0.316844E-04
-0.108574E+02 0.318134E-04
....Unknown line numbers...
&
@target G1.S0
@type xy
-0.108847E+02 0.350450E-04
-0.108711E+02 0.351669E-04
-0.108574E+02 0.352908E-04
&
@target G1.S1
@type xy
-0.108847E+02 0.216396E-04
-0.108711E+02 0.217122E-04
-0.108574E+02 0.217843E-04
-0.108438E+02 0.218622E-04
The @target Gx.Sy
combination is unique and each set of data is always termineted by &
.
I have managed to split the file in chunk as:
#!/usr/bin/env python3
import os
import sys
import itertools as it
import numpy as np
import matplotlib.pyplot as plt
try:
filename = sys.argv[1]
print(filename)
except IndexError:
print("ERROR: Required filename not provided")
with open(filename, "r") as f:
for line in f:
if line.startswith("@target"):
print(line.split()[-1].split("."))
x=[];y=[]
with open(filename, "r") as f:
for key,group in it.groupby(f,lambda line: line.startswith('@target')):
print(key)
if not key:
group = list(group)
group.pop(0)
# group.pop(-1)
print(group)
for i in range(len(group)):
x.append(group[i].split()[0])
y.append(group[i].split()[1])
nx=np.array(x)
ny=np.array(y)
I have two problem:
1) The preamble lines before the real data is also grouped, so the script does not work if there is any preamble. It is impossible to predict how many lines that would be; But I am trying to group after the @target
and
2) I want to name the arrays as G0[S0,S0] and G1[S1,S2]; but I cant do this.
Kindly Help
UPDATE: I am trying to store those data in a nested np array of G0[S0,S1,...], G1[S0,S1,..] so that I can use it in matplotlib.
To split a big binary file in multiple files, you should first read the file by the size of chunk you want to create, then write that chunk to a file, read the next chunk and repeat until you reach the end of original file.
Example 1: Using the splitlines() the read() method reads the data from the file which is stored in the variable file_data. splitlines() method splits the data into lines and returns a list object. After printing out the list, the file is closed using the close() method. Create a text file with the name “examplefile.
The functions below get the job done:
import numpy as np
from collections import defaultdict
def read_without_preamble(filename):
with open(filename, 'r') as f:
lines = f.readlines()
for i, line in enumerate(lines):
if line.startswith('@target'):
return lines[i:]
def split_into_chunks(lines):
chunks = defaultdict(dict)
for line in lines:
if line.startswith('@target'):
GS_str = line.strip().split()[-1].split('.')
G, S = map(lambda x: int(x[1:]), GS_str)
chunks[G][S] = []
elif line.startswith('@type xy'):
pass
elif line.startswith('&'):
chunks[G][S] = np.asarray(chunks[G][S])
else:
xy_str = line.strip().split()
chunks[G][S].append(map(float, xy_str))
return chunks
To split your file into chunks you just need to run this code:
try:
filename = sys.argv[1]
print(filename)
except IndexError:
print("ERROR: Required filename not provided")
data = read_without_preamble(filename)
chunks = split_into_chunks(data)
chunks
is a dictionary in which the key is G
(either 0
or 1
):
In [415]: type(chunks)
Out[415]: dict
In [416]: for k in chunks.keys(): print(k)
0
1
The value of dictionary chunks
is another dictionary in which the key is S
(0
, 1
, or 2
in this example) and the value is a NumPy array containing the numeric data for Gi.Sn
. You can access this chunk of data like this: chunks[i][n]
, where indices i
and n
are the values of G
and S
, respectively.
In [417]: type(chunks[0])
Out[417]: dict
In [418]: for k in chunks[0].keys(): print(k)
0
1
2
In [419]: type(chunks[1][2])
Out[419]: numpy.ndarray
In [420]: chunks[1][2]
Out[420]:
array([[ -1.08851000e+01, 2.53058000e-05],
[ -1.08715000e+01, 2.55353000e-05],
[ -1.08579000e+01, 2.57745000e-05],
[ -1.08443000e+01, 2.60225000e-05],
[ -1.08306000e+01, 2.62617000e-05],
[ -1.08170000e+01, 2.65097000e-05],
[ -1.08034000e+01, 2.67666000e-05]])
chunks[i][n].shape[0]
is 2
for any i
and n
, but chunks[i][n].shape[1]
can take any value, i.e. the number of rows of numeric data may vary from one chunk to another.
This is the file I used in the sample run. It consists of six chunks, namely G0.S0
, G0.S1
, G0.S2
, G1.S0
, G1.S1
, and G1.S2
.
@some
@garbage
@lines
@target G0.S0
@type xy
-0.108851E+02 0.127435E-03
-0.108715E+02 0.127829E-03
-0.108579E+02 0.128191E-03
-0.108443E+02 0.128502E-03
-0.108306E+02 0.128726E-03
-0.108170E+02 0.128838E-03
-0.108034E+02 0.128751E-03
&
@target G0.S1
@type xy
-0.108851E+02 0.472694E-04
-0.108715E+02 0.474233E-04
-0.108579E+02 0.475837E-04
-0.108443E+02 0.477448E-04
-0.108306E+02 0.479052E-04
-0.108170E+02 0.480669E-04
-0.108034E+02 0.482279E-04
&
@target G0.S2
@type xy
-0.108851E+02 0.253654E-04
-0.108715E+02 0.255956E-04
-0.108579E+02 0.258346E-04
-0.108443E+02 0.260825E-04
-0.108306E+02 0.263303E-04
-0.108170E+02 0.265781E-04
-0.108034E+02 0.268349E-04
&
@target G1.S0
@type xy
-0.108851E+02 0.108786E-03
-0.108715E+02 0.109216E-03
-0.108579E+02 0.109651E-03
-0.108443E+02 0.110116E-03
-0.108306E+02 0.110552E-03
-0.108170E+02 0.111011E-03
-0.108034E+02 0.111489E-03
&
@target G1.S1
@type xy
-0.108851E+02 0.278045E-04
-0.108715E+02 0.278711E-04
-0.108579E+02 0.279384E-04
-0.108443E+02 0.280050E-04
-0.108306E+02 0.280723E-04
-0.108170E+02 0.281395E-04
-0.108034E+02 0.282074E-04
&
@target G1.S2
@type xy
-0.108851E+02 0.253058E-04
-0.108715E+02 0.255353E-04
-0.108579E+02 0.257745E-04
-0.108443E+02 0.260225E-04
-0.108306E+02 0.262617E-04
-0.108170E+02 0.265097E-04
-0.108034E+02 0.267666E-04
&
Here is an approach using a generator and np.genfromtxt
. Advantage: Light on memory. It filters the file on the fly hence does not require loading the entire thing into memory for processing.
UPDATE:
I streamlined the code and changed the output format to array of arrays.
If for example G
ranges between 0...3
and S
ranges between 0...5
then it creates a 4x6 array containing arrays.
import numpy as np
from itertools import dropwhile, groupby
from operator import itemgetter
def load_chunks(f):
f = open(f, 'rt') if isinstance(f, str) else f
f = filter(lambda l: not l.strip() in ("", "&"), f)
tok = "@target", "@type"
fg = dropwhile(itemgetter(0), groupby(f, lambda l: not l.split()[0] in tok))
I, D = [], []
for k, g in fg:
info = next(l.split() for l in g)[1]
I.append([int(key[1:]) for key in info.split('.')])
D.append(np.genfromtxt((l.encode() for l in next(fg)[1])))
G, S = np.array(I).T
res = np.empty((np.max(G)+1, np.max(S)+1), dtype=object)
res[G, S] = D
return res
fn = <your_file_name>
ara = load_chunks(fn)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With