Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Want to plot Pandas Dataframe as Multiple Histograms with log10 scale x-axis

I have floating point data in a Pandas dataframe. Each column represents a variable (they have string names) and each row a set of values (the rows have integer names which are not important).

>>> print data
0      kppawr23    kppaspyd
1      3.312387   13.266040
2      2.775202    0.100000
3    100.000000  100.000000
4    100.000000   39.437420
5     17.017150   33.019040
...

I want to plot a histogram for each column. The best result I have achieved is with the hist method of dataframe:

data.hist(bins=20)

but I want the x-axis of each histogram to be on a log10 scale. And the bins to be on log10 scale too, but that is easy enough with bins=np.logspace(-2,2,20).

A workaround might be to log10 transform the data before plotting, but the approaches I have tried,

data.apply(math.log10)

and

data.apply(lambda x: math.log10(x))

give me a floating point error.

    "cannot convert the series to {0}".format(str(converter)))
TypeError: ("cannot convert the series to <type 'float'>", u'occurred at index kppawr23')
like image 454
Simon Woodward Avatar asked Apr 28 '15 21:04

Simon Woodward


1 Answers

You could use

ax.set_xscale('log')

data.hist() returns an array of axes. You'll need to call ax.set_xscale('log') for each axes, ax to make each of the logarithmically scaled.


For example,

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(2015)

N = 100
arr = np.random.random((N,2)) * np.logspace(-2,2,N)[:, np.newaxis]
data = pd.DataFrame(arr, columns=['kppawr23', 'kppaspyd'])

bins = np.logspace(-2,2,20)
axs = data.hist(bins=bins)
for ax in axs.ravel():
    ax.set_xscale('log')

plt.gcf().tight_layout()
plt.show()

yields

enter image description here


By the way, to take the log of every value in the DataFrame, data, you could use

logdata = np.log10(data)

because NumPy ufuncs (such as np.log10) can be applied to pandas DataFrames because they operate elementwise on all the values in the DataFrame.

data.apply(math.log10) did not work because apply tries to pass an entire column (a Series) of values to math.log10. math.log10 expects a scalar value only.

data.apply(lambda x: math.log10(x)) fails for the same reason that data.apply(math.log10) does. Moreover, if data.apply(func) and data.apply(lambda x: func(x)) were both viable options, the first should be preferred since the lambda function would just make the call a tad slower.

You could use data.apply(np.log10), again since the NumPy ufunc np.log10 can be applied to Series, but there is no reason to bother doing this when np.log10(data) works.

You could also use data.applymap(math.log10) since applymap calls math.log10 on each value in data one-at-a-time. But this would be far slower than calling the equivalent NumPy function, np.log10 on the entire DataFrame. Still, it is worth knowing about applymap in case you need to call some custom function which is not a ufunc.

like image 163
unutbu Avatar answered Oct 08 '22 01:10

unutbu