For my application, I need to read multiple files with 15 M lines each, store them in a DataFrame, and save the DataFrame in HDFS5 format.
I've already tried different approaches, notably pandas.read_csv with chunksize and dtype specifications, and dask.dataframe. They both take around 90 seconds to treat 1 file, and so I'd like to know if there's a way to efficiently treat these files in the described way. In the following, I show some code of the tests I've done.
import pandas as pd
import dask.dataframe as dd
import numpy as np
import re
# First approach
store = pd.HDFStore('files_DFs.h5')
chunk_size = 1e6
df_chunk = pd.read_csv(file,
sep="\t",
chunksize=chunk_size,
usecols=['a', 'b'],
converters={"a": lambda x: np.float32(re.sub(r"[^\d.]", "", x)),\
"b": lambda x: np.float32(re.sub(r"[^\d.]", "", x))},
skiprows=15
)
chunk_list = []
for chunk in df_chunk:
chunk_list.append(chunk)
df = pd.concat(chunk_list, ignore_index=True)
store[dfname] = df
store.close()
# Second approach
df = dd.read_csv(
file,
sep="\t",
usecols=['a', 'b'],
converters={"a": lambda x: np.float32(re.sub(r"[^\d.]", "", x)),\
"b": lambda x: np.float32(re.sub(r"[^\d.]", "", x))},
skiprows=15
)
store.put(dfname, df.compute())
store.close()
Here is what the files look like (whitespace consists of a literal tab):
a b
599.998413 14.142895
599.998413 20.105534
599.998413 6.553850
599.998474 27.116098
599.998474 13.060312
599.998474 13.766775
599.998596 1.826706
599.998596 18.275938
599.998718 20.797491
599.998718 6.132450)
599.998718 41.646194
599.998779 19.145775
Let’s look over the importing options now and compare the time taken to read CSV into memory. The pandas python library provides read_csv () function to import CSV as a dataframe structure to compute or analyze it easily. This function provides one parameter described in a later section to import your gigantic file much faster.
This option is faster and is best to use when you have limited RAM. Alternatively, a new python library, DASK can also be used, described below. While reading large CSVs, you may encounter out of memory error if it doesn't fit in your RAM, hence DASK comes into picture.
PANDAS The pandas python library provides read_csv () function to import CSV as a dataframe structure to compute or analyze it easily. This function provides one parameter described in a later section to import your gigantic file much faster. 1. pandas.read_csv ()
csv.DictReader is by far the fastest, but as I pointed out earlier it imports everything as strings, while the other methods try to guess the data types of each column separately and possibly do multiple other validations upon import.
First, lets answer the title of the question
I suggest you use modin:
import modin.pandas as mpd
import pandas as pd
import numpy as np
frame_data = np.random.randint(0, 10_000_000, size=(15_000_000, 2))
pd.DataFrame(frame_data*0.0001).to_csv('15mil.csv', header=False)
!wc 15mil*.csv ; du -h 15mil*.csv
15000000 15000000 480696661 15mil.csv
459M 15mil.csv
%%timeit -r 3 -n 1 -t
global df1
df1 = pd.read_csv('15mil.csv', header=None)
9.7 s ± 95.1 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
%%timeit -r 3 -n 1 -t
global df2
df2 = mpd.read_csv('15mil.csv', header=None)
3.07 s ± 685 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
(df2.values == df1.values).all()
True
So as we can see modin was approximatly 3 times faster on my setup.
Now to answer your specific problem
As people have noted, your bottleneck is probably the converter. You are calling those lambdas 30 Million times. Even the function call overhead becomes non-trivial at that scale.
Let's attack this problem.
!sed 's/.\{4\}/&)/g' 15mil.csv > 15mil_dirty.csv
First, I tried using modin with the converters argument. Then, I tried a different approach that calls the regexp less times:
First I will create a File-like object that filters everything through your regexp:
class FilterFile():
def __init__(self, file):
self.file = file
def read(self, n):
return re.sub(r"[^\d.,\n]", "", self.file.read(n))
def write(self, *a): return self.file.write(*a) # needed to trick pandas
def __iter__(self, *a): return self.file.__iter__(*a) # needed
Then we pass it to pandas as the first argument in read_csv:
with open('15mil_dirty.csv') as file:
df2 = pd.read_csv(FilterFile(file))
%%timeit -r 1 -n 1 -t
global df1
df1 = pd.read_csv('15mil_dirty.csv', header=None,
converters={0: lambda x: np.float32(re.sub(r"[^\d.]", "", x)),
1: lambda x: np.float32(re.sub(r"[^\d.]", "", x))}
)
2min 28s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -r 1 -n 1 -t
global df2
df2 = mpd.read_csv('15mil_dirty.csv', header=None,
converters={0: lambda x: np.float32(re.sub(r"[^\d.]", "", x)),
1: lambda x: np.float32(re.sub(r"[^\d.]", "", x))}
)
38.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -r 1 -n 1 -t
global df3
df3 = pd.read_csv(FilterFile(open('15mil_dirty.csv')), header=None,)
1min ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Seems like modin wins again! Unfortunatly modin has not implemented reading from buffers yet so I devised the ultimate approach.
%%timeit -r 1 -n 1 -t
with open('15mil_dirty.csv') as f, open('/dev/shm/tmp_file', 'w') as tmp:
tmp.write(f.read().translate({ord(i):None for i in '()'}))
df4 = mpd.read_csv('/dev/shm/tmp_file', header=None)
5.68 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
This uses translate
which is considerably faster than re.sub
, and also uses /dev/shm
which is in-memory filesystem that ubuntu (and other linuxes) usually provide. Any file written there will never go to disk, so it is fast.
Finally, it uses modin to read the file, working around modin's buffer limitation.
This approach is about 30 times faster than your approach, and it is pretty simple, also.
Well my findings are not much related to pandas, but rather some common pitfalls.
Your code:
(genel_deneme) ➜ derp time python a.py
python a.py 38.62s user 0.69s system 100% cpu 39.008 total
Replace re.sub(r"[^\d.]", "", x) with precompiled version and use it in your lambdas
Result :
(genel_deneme) ➜ derp time python a.py
python a.py 26.42s user 0.69s system 100% cpu 26.843 total
replace np.float32 with float and run your code.
My Result:
(genel_deneme) ➜ derp time python a.py
python a.py 14.79s user 0.60s system 102% cpu 15.066 total
Find another way to achieve the result with the floats. More on this issue https://stackoverflow.com/a/6053175/37491
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With