Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Increase speed numpy.loadtxt?

I have hundred of thousands of data text files to read. As of now, I'm importing the data from text files every time I run the code. Perhaps the easy solution would be to simply reformat the data into a file faster to read. Anyway, right now every text files I have look like:

User: unknown
Title : OE1_CHANNEL1_20181204_103805_01
Sample data     
Wavelength  OE1_CHANNEL1    
185.000000  27.291955

186.000000  27.000877

187.000000  25.792290

188.000000  25.205620

189.000000  24.711882

.
.
.

The code where I read and import the txt files is:

# IMPORT DATA
path = 'T2'
if len(sys.argv) == 2:
    path = sys.argv[1]

files = os.listdir(path)
trans_import = []
for index, item in enumerate(files):
    trans_import.append(np.loadtxt(path+'/'+files[1], dtype=float, skiprows=4, usecols=(0,1)))

The resulting array looks in the variable explorer as: {ndarray} = [[185. 27.291955]\n [186. 27.000877]\n ... ]

I'm wondering, how I could speed up this part? It takes a little too long as of now just to import ~4k text files. There are 841 lines inside every text files (spectrum). The output I get with this code is 841 * 2 = 1682. Obviously, it considers the \n as a line...

like image 501
Mooder Avatar asked Feb 25 '26 19:02

Mooder


1 Answers

It would probably be much faster if you had one large file instead of many small ones. This is generally more efficient. Additionally, you might get a speedup from just saving the numpy array directly and loading that .npy file in instead of reading in a large text file. I'm not as sure about the last part though. As always when time is a concern, I would try both of these options and then measure the performance improvement.

If for some reason you really can't just have one large text file / .npy file, you could also probably get a speedup by using, e.g., multiprocessing to have multiple workers reading in the files at the same time. Then you can just concatenate the matrices together at the end.


Not your primary question but since it seems to be an issue - you can rewrite the text files to not have those extra newlines, but I don't think np.loadtxt can ignore them. If you're open to using pandas, though, pandas.read_csv with skip_blank_lines=True should handle that for you. To get a numpy.ndarray from a pandas.DataFrame, just do dataframe.values.

like image 55
Nathan Avatar answered Feb 27 '26 08:02

Nathan