How can I speed up reading multiple files and putting the data into a dataframe?

Tags:

I have a number of text files, say 50, that I need to read into a massive dataframe. At the moment, I am using the following steps.

Read every file and check what the labels are. The information I need is often contained in the first few lines. The same labels just repeat for the rest of the file, with different types of data listed against them each time.
Create a dataframe with those labels.
Read the file again and fill the dataframe with values.
Concatenate that dataframe with a master dataframe.

This works pretty well for files that are of the 100 KB size - a few minutes, but at 50 MB, it just takes hours, and is not practical.

How can I optimise my code? In particular -

How can I identify what functions are taking the most time, which I need to optimise? Is it the reading of the file? Is it the writing to the dataframe? Where is my program spending time?
Should I consider multithreading or multiprocessing?
Can I improve the algorithm?
- Perhaps read the entire file in in one go into a list, rather than line by line,
- Parse data in chunks/entire file, rather than line by line,
- Assign data to the dataframe in chunks/one go, rather than row by row.
Is there anything else that I can do to make my code execute faster?

Here is an example code. My own code is a little more complex, as the text files are more complex such that I have to use about 10 regular expressions and multiple while loops to read the data in and allocate it to the right location in the right array. To keep the MWE simple, I haven't used repeating labels in the input files for the MWE either, so it would like I'm reading the file twice for no reason. I hope that makes sense!

import re import pandas as pd  df = pd.DataFrame() paths = ["../gitignore/test1.txt", "../gitignore/test2.txt"] reg_ex = re.compile('^(.+) (.+)\n') # read all files to determine what indices are available for path in paths:     file_obj = open(path, 'r')     print file_obj.readlines()  ['a 1\n', 'b 2\n', 'end'] ['c 3\n', 'd 4\n', 'end']  indices = [] for path in paths:     index = []     with open(path, 'r') as file_obj:         line = True         while line:             try:                 line = file_obj.readline()                 match = reg_ex.match(line)                 index += match.group(1)             except AttributeError:                 pass     indices.append(index) # read files again and put data into a master dataframe for path, index in zip(paths, indices):     subset_df = pd.DataFrame(index=index, columns=["Number"])     with open(path, 'r') as file_obj:         line = True         while line:             try:                 line = file_obj.readline()                 match = reg_ex.match(line)                 subset_df.loc[[match.group(1)]] = match.group(2)             except AttributeError:                 pass     df = pd.concat([df, subset_df]).sort_index() print df    Number a      1 b      2 c      3 d      4

My input files:

test1.txt

a 1 b 2 end

test2.txt

c 3 d 4 end

464

asked Feb 10 '17 11:02

bluprince13

2 Answers

I've used this many times as it's a particular easy implementation of multiprocessing.

import pandas as pd from multiprocessing import Pool  def reader(filename):     return pd.read_excel(filename)  def main():     pool = Pool(4) # number of cores you want to use     file_list = [file1.xlsx, file2.xlsx, file3.xlsx, ...]     df_list = pool.map(reader, file_list) #creates a list of the loaded df's     df = pd.concat(df_list) # concatenates all the df's into a single df  if __name__ == '__main__':     main()

Using this you should be able to substantially increase the speed of your program without too much work at all. If you don't know how many processors you have, you can check by pulling up your shell and typing

echo %NUMBER_OF_PROCESSORS%

EDIT: To make this run even faster, consider changing your files to csvs and using pandas function pandas.read_csv

186

answered Oct 06 '22 13:10

Некто

Before pulling out the multiprocessing hammer, your first step should be to do some profiling. Use cProfile to quickly look through to identify which functions are taking a long time. Unfortunately if your lines are all in a single function call, they'll show up as library calls. line_profiler is better but takes a little more setup time.

NOTE. If using ipython, you can use %timeit (magic command for the timeit module) and %prun (magic command for the profile module) both to time your statements as well as functions. A google search will show some guides.

Pandas is a wonderful library, but I've been an occasional victim of using it poorly with atrocious results. In particular, be wary of append()/concat() operations. That might be your bottleneck but you should profile to be sure. Usually, the numpy.vstack() and numpy.hstack() operations are faster if you don't need to perform index/column alignment. In your case it looks like you might be able to get by with Series or 1-D numpy ndarrays which can save time.

BTW, a try block in python is much slower often 10x or more than checking for an invalid condition, so be sure you absolutely need it when sticking it into a loop for every single line. This is probably the other hogger of time; I imagine you stuck the try block to check for AttributeError in case of a match.group(1) failure. I would check for a valid match first.

Even these small modifications should be enough for your program to run significantly faster before trying anything drastic like multiprocessing. Those Python libraries are awesome but bring a fresh set of challenges to deal with.

answered Oct 06 '22 12:10

clocker

Related questions
                            
                                python pip trouble installing from requirements.txt
                            
                                How do I get the active window on Gnome Wayland?
                            
                                what is XLA_GPU and XLA_CPU for tensorflow
                            
                                Is there an accepted way to use API keys for authentication in Flask? [closed]
                            
                                Is there a convention to distinguish Python integration tests from unit tests?
                            
                                Why no 'const' in Python? [closed]
                            
                                Python packaging: wheels vs tarball (tar.gz)
                            
                                Python socket.error: [Errno 111] Connection refused
                            
                                sklearn and large datasets
                            
                                How to set dtypes by column in pandas DataFrame
                            
                                Why do assertions in unittest use TestCase.assertEqual not the assert keyword?
                            
                                Suppress stdout / stderr print from Python functions
                            
                                Auto adjust font size in seaborn heatmap
                            
                                'Cannot setup a Python SDK' in PyCharm project using virtualenv after OS reinstallation
                            
                                Using OpenGL with Python [closed]
                            
                                IOError: [Errno 24] Too many open files:
                            
                                Difference between os.path.exists and os.path.isfile?
                            
                                Python web hosting: Numpy, Matplotlib, Scientific Computing
                            
                                UserWarning: FixedFormatter should only be used together with FixedLocator
                            
                                Coverage.py warning: No data was collected. (no-data-collected)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I speed up reading multiple files and putting the data into a dataframe?

Tags:

performance

python

regex

pandas

parsing

bluprince13

People also ask

2 Answers

Некто

clocker

Recent Activity

Donate For Us