I have a dataframe
with numerous columns (≈30) from an external source (csv file) but several of them have no value or always the same. Thus, I would to see quickly the value_counts
for each column, how can i do that?
For example
Id, temp, name
1 34, null, mark
2 22, null, mark
3 34, null, mark
Would return me an object stating that
So I would know that temp is irrelevant and name is not interesting (always the same)
Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.
The value_counts() method returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.
count() should be used when you want to find the frequency of valid values present in columns with respect to specified col . . value_counts() should be used to find the frequencies of a series.
In order to get the count of unique values on multiple columns use pandas DataFrame. drop_duplicates() which drop duplicate rows from pandas DataFrame. This eliminates duplicates and return DataFrame with unique rows.
For the dataframe,
df = pd.DataFrame(data=[[34, 'null', 'mark'], [22, 'null', 'mark'], [34, 'null', 'mark']], columns=['id', 'temp', 'name'], index=[1, 2, 3])
the following code
for c in df.columns:
print "---- %s ---" % c
print df[c].value_counts()
will produce the following result:
---- id ---
34 2
22 1
dtype: int64
---- temp ---
null 3
dtype: int64
---- name ---
mark 3
dtype: int64
A nice way to do this and return a nicely formatter series is combining pandas.Series.value_counts
and pandas.DataFrame.stack
.
For the DataFrame
df = pandas.DataFrame(data=[[34, 'null', 'mark'], [22, 'null', 'mark'], [34, 'null', 'mark']], columns=['id', 'temp', 'name'], index=[1, 2, 3])
You can do something like
df.apply(lambda x: x.value_counts()).T.stack()
In this code, df.apply(lambda x: x.value_counts())
applies value_counts
to every column and appends it to the resulting DataFrame
, so you end up with a DataFrame
with the same columns and one row per every different value in every column (and a lot of null
for each value that doesn't appear in each column).
After that, T
transposes the DataFrame
(so you end up with a DataFrame
with an index equal to the columns and the columns equal to the possible values), and stack
turns the columns of the DataFrame
into a new level of the MultiIndex and "deletes" all the Null
values, making the whole thing a Series
.
The result of this is
id 22 1
34 2
temp null 3
name mark 3
dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With