I have a data frame where most of the columns are varchar/object type. Length of the column varies a lot and could be anything within the range of 3 - 1000+ . Now, for each column, I want to measure maximum length.
I know how to calculate maximum length for a col. If its varchar then:
max(df.char_col.apply(len))
and if its number (float8 or int64) then:
max(df.num_col.map(str).apply(len))
But my dataframe has hundreds of column and I want to calculate maximum length for all columns at the same time. The problem for that is, there are different data types, and I dont know how to do all at once.
So Question 1: How to get maximum column length for each columns in the data frame
Now I am trying to do that only for varchar/object type columns using following code:
xx = df.select_dtypes(include = ['object'])
for col in [xx.columns.values]:
maxlength = [max(xx.col.apply(len))]
I selected only object type columns and tried to write a for loop. But its not working. probably using apply() within for loop is not a good idea.
Question 2: How to get maximum length of each column for only object type columns
Sample data frame:
d1 = {'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA'], 'comment1': ['Very good performance', 'N/A', 'Need to work hard', 'No Comment', 'Not satisfactory'], 'comment2': ['good', 'Meets Expectation', 'N', 'N/A', 'Incompetence']}
df1 = pd.DataFrame(data = d1)
df1['month'] = pd.DatetimeIndex(df1['DoB']).month
df1['year'] = pd.DatetimeIndex(df1['DoB']).year
To find the maximum string length by column in the given dataframe, first, nchar() function is called to get the length of all the string present in the particular column of the dataframe, and then the max() function must be called to get the maximum value of the length of the string generated by the nchar() function.
Get the number of columns: len(df. columns) The number of columns of pandas. DataFrame can be obtained by applying len() to the columns attribute.
To find the length of strings in a data frame you have the len method on the dataframes str property. But to do this you need to call this method on the column that contains the string data.
One solution is to use numpy.vectorize
. This may be more efficient than pandas
-based solutions.
You can use pd.DataFrame.select_dtypes
to select object
columns.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['abc', 'de', 'abcd'],
'B': ['a', 'abcde', 'abc'],
'C': [1, 2.5, 1.5]})
measurer = np.vectorize(len)
Max length for all columns
res1 = measurer(df.values.astype(str)).max(axis=0)
array([4, 5, 3])
Max length for object columns
res2 = measurer(df.select_dtypes(include=[object]).values.astype(str)).max(axis=0)
array([4, 5])
Or if you need output as a dictionary:
res1 = dict(zip(df, measurer(df.values.astype(str)).max(axis=0)))
{'A': 4, 'B': 5, 'C': 3}
df_object = df.select_dtypes(include=[object])
res2 = dict(zip(df_object, measurer(df_object.values.astype(str)).max(axis=0)))
{'A': 4, 'B': 5}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With