Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Interview Question - Compare Pandas-Joins and Ideally Provide the Fastest Method

While ago I was interviewed for a Data Scientist role. Strangely, without asking about Machine Learning or Data Science or even Statistics, I was given a small task to join two pandas dataframes, and compare various methods for doing so. I was not given a criteria that what the expectation was. I've provided multiple solutions. Surprisingly I was told afterwards none of my solutions meet the performance benchmarks of other solutions they have for this task! Obviously I asked for a feedback or what other method(s) they use for this task that outperform my provided solutions, but so answer; not just yet. I consider myself an intermediate Python programmer and certainly I am not aware of many tricks or best practices, and I have not much paid attention to the performance so far, unless it was super obvious slow. That is why, since the interview, it got me thinking what other ways is to accomplish this task in the fastest way possible.

The problem:

# Randomly generated historical data about how many megabytes were downloaded from the Internet."HoD" is the Hour of the Day!
hist_df = pd.DataFrame(columns=['HoD', 'Volume'])
hist_df['HoD'] = np.random.randint(0, 24, 365 * 24)
hist_df['Volume'] = np.random.uniform(1, 1000, 365 * 24)

# Tariffs based on the hour of the day
tariffs_df = pd.DataFrame({
    'Time range': ['00:00 to 09:00', '09:00 to 18:00', '18:00 to 00:00'],
    'cost': [10, 14, 22]
})

Task: Return the historical dataframe with an additional column “cost” that will show how much money was spent for every hour in the historical data. Basically tariff dataframe need to be merged to the historical data.

MY Solutions: Here is the gist containing the four methods I have provided. I have provided merging based on (slowest to fastest) i) simple iterrows (slowest), ii) pandas_apply, iii) and numpy vectorize, and iv) pandas binning (fastest). The results are (in seconds):

{'naive_iterrows': 5.810565948486328,
 'pandas_apply': 0.6743350028991699,
 'numpy_vec': 0.6381142139434814,
 'pandas_bin-merge': 0.009788990020751953}

enter image description here

Question: What faster methods are out there to achieve this? As I mentioned in the first paragraph, they were not happy with the performance of my solutions.

P.S.: Although this matter is very subjective, but I find it bizarre that they post a Data Scientist role (with many machine learning blah blah requirement), and yet to reject an applicant because of this. I am super glad I came to know what their job requirements and expectations beforehand. Still love learning more about pandas, python best practices, for this particular case if there are any others!

like image 645
TwinPenguins Avatar asked Oct 28 '22 15:10

TwinPenguins


1 Answers

You can change the last method to join as was suggested by Siddharth. Suppose your DataFrame is much larger:

hist_df = pd.DataFrame(columns=['HoD', 'Volume'])
hist_df['HoD'] = np.random.randint(0, 10000, 365 * 10000)
hist_df['Volume'] = np.random.uniform(1, 10000, 365 * 10000)

Let's compare:

%timeit merged_df = pd.merge(hist_df, tariffs_df, how='left', left_on='bin', right_on='Time range')

1 loop, best of 3: 740 ms per loop


%timeit hist = hist_df.set_index('bin')
%timeit tariffs = tariffs_df.set_index('Time range')
%timeit merged_df = hist.join(tariffs)

10 loops, best of 3: 20.1 ms per loop
1000 loops, best of 3: 449 µs per loop
100 loops, best of 3: 3.59 ms per loop

If other packages are allowed, you can also use dask for parallelization.

like image 156
DmytroSytro Avatar answered Nov 15 '22 06:11

DmytroSytro