Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get maximum length of each column in the data frame using pandas python

I have a data frame where most of the columns are varchar/object type. Length of the column varies a lot and could be anything within the range of 3 - 1000+ . Now, for each column, I want to measure maximum length.

I know how to calculate maximum length for a col. If its varchar then:

max(df.char_col.apply(len))

and if its number (float8 or int64) then:

max(df.num_col.map(str).apply(len))

But my dataframe has hundreds of column and I want to calculate maximum length for all columns at the same time. The problem for that is, there are different data types, and I dont know how to do all at once.

So Question 1: How to get maximum column length for each columns in the data frame

Now I am trying to do that only for varchar/object type columns using following code:

xx = df.select_dtypes(include = ['object'])
for col in [xx.columns.values]:
   maxlength = [max(xx.col.apply(len))]

I selected only object type columns and tried to write a for loop. But its not working. probably using apply() within for loop is not a good idea.

Question 2: How to get maximum length of each column for only object type columns

Sample data frame:

d1 = {'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA'], 'comment1': ['Very good performance', 'N/A', 'Need to work hard', 'No Comment', 'Not satisfactory'], 'comment2': ['good', 'Meets Expectation', 'N', 'N/A', 'Incompetence']}
df1 = pd.DataFrame(data = d1)
df1['month'] = pd.DatetimeIndex(df1['DoB']).month
df1['year'] = pd.DatetimeIndex(df1['DoB']).year
like image 882
singularity2047 Avatar asked May 14 '18 21:05

singularity2047


People also ask

How do you find the maximum length of a column in a data frame?

To find the maximum string length by column in the given dataframe, first, nchar() function is called to get the length of all the string present in the particular column of the dataframe, and then the max() function must be called to get the maximum value of the length of the string generated by the nchar() function.

How can we get the length of each entry of a column Column_name of a DataFrame DF?

Get the number of columns: len(df. columns) The number of columns of pandas. DataFrame can be obtained by applying len() to the columns attribute.

How do I count the length of a string in a column in Pandas?

To find the length of strings in a data frame you have the len method on the dataframes str property. But to do this you need to call this method on the column that contains the string data.


1 Answers

One solution is to use numpy.vectorize. This may be more efficient than pandas-based solutions.

You can use pd.DataFrame.select_dtypes to select object columns.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': ['abc', 'de', 'abcd'],
                   'B': ['a', 'abcde', 'abc'],
                   'C': [1, 2.5, 1.5]})

measurer = np.vectorize(len)

Max length for all columns

res1 = measurer(df.values.astype(str)).max(axis=0)

array([4, 5, 3])

Max length for object columns

res2 = measurer(df.select_dtypes(include=[object]).values.astype(str)).max(axis=0)

array([4, 5])

Or if you need output as a dictionary:

res1 = dict(zip(df, measurer(df.values.astype(str)).max(axis=0)))

{'A': 4, 'B': 5, 'C': 3}

df_object = df.select_dtypes(include=[object])
res2 = dict(zip(df_object, measurer(df_object.values.astype(str)).max(axis=0)))

{'A': 4, 'B': 5}
like image 69
jpp Avatar answered Sep 26 '22 09:09

jpp