I have list of integers and want to get frequency of each integer. This was discussed here
The problem is that approach I'm using gives me frequency of floating numbers when my data set consist of integers only. Why that happens and how I can get frequency of integers from my data?
I'm using pyplot.histogram to plot a histogram with frequency of occurrences
import numpy as np
import matplotlib.pyplot as plt
from numpy import *
data = loadtxt('data.txt',dtype=int,usecols=(4,)) #loading 5th column of csv file into array named data.
plt.hist(data) #plotting the column as histogram
I'm getting the histogram, but I've noticed that if I "print" hist(data)
hist=np.histogram(data)
print hist(data)
I get this:
(array([ 2323, 16338, 1587, 212, 26, 14, 3, 2, 2, 2]),
array([ 1. , 2.8, 4.6, 6.4, 8.2, 10. , 11.8, 13.6, 15.4,
17.2, 19. ]))
Where the second array represent values and first array represent number of occurrences.
In my data set all values are integers, how that happens that second array have floating numbers and how should I get frequency of integers?
UPDATE:
This solves the problem, thank you Lev for the reply.
plt.hist(data, bins=np.arange(data.min(), data.max()+1))
To avoid creating a new question how I can plot columns "in the middle" for each integer? Say, I want column for integer 3 take space between 2.5 and 3.5 not between 3 and 4.
The easiest way to count the number of occurrences in a Python list of a given item is to use the Python . count() method. The method is applied to a given list and takes a single argument. The argument passed into the method is counted and the number of occurrences of that item in the list is returned.
Python Code:def word_count(str): counts = dict() words = str. split() for word in words: if word in counts: counts[word] += 1 else: counts[word] = 1 return counts print( word_count('the quick brown fox jumps over the lazy dog. '))
If you don't specify what bins to use, np.histogram
and pyplot.hist
will use a default setting, which is to use 10 equal bins. The left border of the 1st bin is the smallest value and the right border of the last bin is the largest.
This is why the bin borders are floating point numbers. You can use the bins
keyword arguments to enforce another choice of bins, e.g.:
plt.hist(data, bins=np.arange(data.min(), data.max()+1))
Edit: the easiest way to shift all bins to the left is probably just to subtract 0.5 from all bin borders:
plt.hist(data, bins=np.arange(data.min(), data.max()+1)-0.5)
Another way to achieve the same effect (not equivalent if non-integers are present):
plt.hist(data, bins=np.arange(data.min(), data.max()+1), align='left')
(Late to the party, just thought I would add a seaborn
implementation)
seaborn.__version__ = 0.9.0
at time of writing.
Load the libraries and setup mock data.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = np.array([3]*10 + [5]*20 + [7]*5 + [9]*27 + [11]*2)
seaborn.distplot
:Using specified bins, calculated as per the above question.
sns.distplot(data,bins=np.arange(data.min(), data.max()+1),kde=False,hist_kws={"align" : "left"})
plt.show()
numpy
built-in binning methodsI used the doane
binning method below, which produced integer bins, migth be worth trying out the standard binning methods from numpy.histogram_bin_edges
as this is how matplotlib.hist()
bins the data.
sns.distplot(data,bins="doane",kde=False,hist_kws={"align" : "left"})
plt.show()
Produces the below Histogram:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With