I have a pandas DataFrame with time length data in seconds. The length varies from seconds to months so taking a histogram after taking log is convenient as it covers the range better. Here is a sample code
%matplotlib inline
import numpy as np
import pandas as pd
x=np.random.lognormal(mean=10, sigma=1, size=10000)
df=pd.DataFrame(x, range(10000), columns=['timeLength'])
np.log10(df.timeLength).hist()
However, the labels on the x-axis are log scaled. Is there a way to put them as 10^1 and so on. Or even better, if I could put them as 1 second, 10 seconds, 1 minute, 10 minute, 1 hours, 1 day and so on.
Instead of logging the values,
np.log10(df.timeLength)
try creating a non-uniform binning when computing the histogram. This can be accomplished with np.histogram
's bins
argument.
Based on
if I could put them as 1 second, 10 seconds, 1 minute, 10 minute, 1 hours, 1 day and so on.
the following bin array could be created
# Bin locations (time in seconds)
bins = np.array([0, 1, 10, 60, 60*10, 60*60, 24*60*60])
The original dataset was enlarged to fill more of the bins (mean=5, sigma=2
instead of mean=10, sigma=1
), this is for example only. The non-uniform bins are defined, the histogram computed and the plot is presented. The bins are for example and may be altered.
# Create random data in DataFrame
x = np.random.lognormal(mean=5, sigma=2, size=10000)
df = pd.DataFrame(x, columns=['timeLength'])
print df.describe()
print
# Create non-uniform bins. Unit in seconds.
bins = np.array([0, 1, 10, 60, 60*10, 60*60, 24*60*60])
print 'hisogram bins:', bins
# Get histogram of random data
y, x = np.histogram(df, bins=bins, normed=True)
# Correct bin placement
x = x[1:]
# Turn into pandas Series
hist = pd.Series(y, x)
# Plot
ax = hist.plot(kind='bar', width=1, alpha=0.5, align='center')
ax.set_title('Non-Uniform Bin Histogram')
ax.set_xlabel('Time Length')
ax.set_xticklabels(['1 s', '10 s', '1 Min', '1 Hr', '1 Day', '>1 Day'], rotation='horizontal')
timeLength
count 10000.000000
mean 1014.865417
std 4751.820312
min 0.062893
25% 36.941388
50% 144.081235
75% 556.223797
max 237838.467337
hisogram bins: [ 0 1 10 60 600 3600 86400]
Please advise if this is not the intended result.
If you want to use custom bins, you may want to combine pd.cut
with .groupby().count()
and use a bar
chart:
x=np.random.lognormal(mean=10, sigma=1, size=10000)
df=pd.DataFrame(x, range(10000), columns=['timeLength'])
df['bin'] = pd.cut(df.timeLength,include_lowest=True, bins=[0, 1, 10, 60, 60**2, 60**2*24, df.timeLength.max()], labels=['1s', '10s', '1min', '1hr', '1d', '>1d'])
df.groupby('bin').count().plot.bar()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With