Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading large formatted text file with NumPy

I've volunteered to help someone convert a finite element mesh from one format to another (i-deas *.unv to Alberta). I've used NumPy to do some additional shaping of the mesh, but I'm having problems reading the raw text file data into NumPy arrays. I've tried genfromtxt and loadtxt with no success so far.

Some details:

1) All groups are delimited by the header and footer flag " -1" on its own line.

2) The NODE group has a header " 2411" on its own line. I only want to read alternate lines from this group, skipping each line with 4 integers, but reading the line with 3 Fortran double precision numbers.

3) The ELEMENT connectivity group has a header " 2412" on its own line. All data are integers and only the first 4 columns are required to be read. There will be some empty slots in the NumPy array due to missing values for 2 and 3 node elements.

4) The " 2477" node groups I think I can deal with myself using regular expressions that find which lines to read.

5) The real data file will have about 1 million lines of text, so I'm very keen for it to be vectorized if possible (or whatever NumPy does to read stuff quickly).

Sorry if I've given too much information, and thanks.

The lines below are a sample of parts of the *.unv text file format.

    -1
  2411
  146303         1         1        11
  6.9849462399269246D-001  8.0008842847097805D-002  6.6360238055630028D-001
  146304         1         1        11
  4.1854795755893875D-001  9.1256034628308313D-001  3.5725496189239300D-002
  146305         1         1        11
  7.5541258490349616D-001  3.7870257739063029D-001  2.0504544370783115D-001
  146306         1         1        11
  2.7637569971086767D-001  9.2829777518336010D-001  1.3757239038663285D-001
   -1
   -1
 2412
     9        21         1         0         7         2
     0         0         0
     1         9
    10        21         1         0         7         2
     0         0         0
     9        10
  1550        91         6         0         7         3
   761      3685      2027
  1551        91         6         0         7         3
   761      2380      2067
 39720       111         1         0         7         4
 71854     59536     40323     73014
 39721       111         1         0         7         4
 45520     48908    133818    145014
   -1
   -1
   2477
     1         0         0         0         0         0         0      3022
PERMANENT GROUP1
     7         2         0         0         7         3         0         0
     7         8         0         0         7         7         0         0
     7       147         0         0         7       148         0         0
     2         0         0         0         0         0         0      2915
PERMANENT GROUP2
     7         1         0         0         7         5         0         0
     7         4         0         0         7         6         0         0
     7         9         0         0         7        11         0         0
   -1
like image 907
Tim Avatar asked Feb 18 '13 22:02

Tim


1 Answers

The numpy methods genfromtxt and loadtxt would be rather difficult to apply on the whole file, as your data has a quite special structure (which changes depending in which node you are). Therefore, I'd suggest the following strategy:

  • Read the file line by lines, try to determine in which node you are by analysing the line.

  • If you are in a node, which has only a few data (and where, for example, you have to read alternating lines, so you can't read continously), read it line by line and process the lines.

  • When you reach a section with a lot of data (like the one with the "real data"), use numpys fromfile method to read in the data, like this:

    mydata = np.fromfile(fp, sep=" ", dtype=int, count=number_of_elements)
    mydata.shape = (100000, 3)    # Reshape it to the desired shape as fromfile
                                  # returns a 1D array.
    

This way, you combine the flexibility of line-by-line processing with the ability to quickly read and convert large chunks of data.

UPDATE: The point is, that you open the file, read it line by line, and when you arrive at a place with a big chunk of data, you pass the file descriptor to fromfile.

Below a simplified example:

import numpy as np

fp = open("test.dat", "r")
line = fp.readline()
ndata = int(line.strip())
data = np.fromfile(fp, count=ndata, sep=" ", dtype=int)
fp.close()

That would read the data from a file test.dat with a content like:

10
1 2 3 4 5
6 7 8 9 10

The first line is read explicitely with fp.read(), processed (the number of integers to be read is determined) and then np.fromfile() reads the appropriate chunk of data and stores it in the 1D-array data.

UPDATE2: Alternatively, you could read the entire text into a buffer, then determine the starting and end positions for the large chunk of data and convert it via np.fromstring directly:

fp = open("test.dat", "r")
txt = fp.read()
fp.close()
# Now determine starting and end positions (startpos, endpos)
# ..
# pass text that portion of the text to the fromstring function.
data = np.fromstring(txt[startpos:endpos], dtype=int, sep=" ")

Or, if it is easy to formulate as one regular expression, you could use the fromregex() directly on the file.

like image 60
Bálint Aradi Avatar answered Sep 27 '22 21:09

Bálint Aradi