Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a DataFrame from dict of unequal length lists, and truncating to a specific length?

I have a dict of lists (which have variable lengths), I am looking forward to an efficient way of creating a Dataframe from it.

Assume I have minimum list length, so I can truncate size of bigger lists while creating Dataframe.

Here is my dummy code

data_dict = {'a': [1,2,3,4], 'b': [1,2,3], 'c': [2,45,67,93,82,92]}
min_length = 3

I can have a dictionary of 10k or 20k keys, so looking for an efficient way to create a DataFrame like bellow

>>> df
   a  b   c
0  1  1   2
1  2  2  45
2  3  3  67
like image 903
John Avatar asked May 09 '17 09:05

John


People also ask

How do you turn a dictionary into a data frame?

You can convert a dictionary to Pandas Dataframe using df = pd. DataFrame. from_dict(my_dict) statement.

How do you transpose a DF in Python?

Pandas DataFrame: transpose() function The transpose() function is used to transpose index and columns. Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. If True, the underlying data is copied. Otherwise (default), no copy is made if possible.

How to create a Dataframe from a dictionary where entries have different lengths?

Creating dataframe from a dictionary where entries have different lengths 1 Use pandas.DataFrame and pandas.concat. The following code will create a list of DataFrames with pandas.DataFrame, from... 2 plot. 3 dataframe. If you don't want it to show NaN and you have two particular lengths, adding a 'space' in each remaining cell... More ...

Can a list be a Dataframe?

They can hold data of different types and lengths, making them very versatile. Lists can be named or nested and have the same or different lengths. This post deals with converting a list to a dataframe when it has unequal lengths.

Why are list data types difficult to convert to DataFrames?

Lists are difficult to convert to a dataframe when they have unequal lengths. Lists as a data type can be confusing but also useful. They can hold data of different types and lengths, making them very versatile. Lists can be named or nested and have the same or different lengths.

Is the length of the array the same for all the columns?

However, the length of the array is not the same for all of them. How can I create a dataframe where each column holds a different entry? Any way to overcome this?


1 Answers

You can filter values of dict in dict comprehension, then DataFrame works perfectly:

print ({k:v[:min_length] for k,v in data_dict.items()})
{'b': [1, 2, 3], 'c': [2, 45, 67], 'a': [1, 2, 3]}


df = pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()})
print (df)
   a  b   c
0  1  1   2
1  2  2  45
2  3  3  67

If is possible some length can be less as min_length add Series:

data_dict = {'a': [1,2,3,4], 'b': [1,2], 'c': [2,45,67,93,82,92]}
min_length = 3

df = pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()})
print (df)
   a    b   c
0  1  1.0   2
1  2  2.0  45
2  3  NaN  67

Timings:

In [355]: %timeit (pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()}))
The slowest run took 5.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 520 µs per loop

In [356]: %timeit (pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()}))
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 937 µs per loop

#Allen's solution
In [357]: %timeit (pd.DataFrame.from_dict(data_dict,orient='index').T.dropna())
1 loop, best of 3: 16.7 s per loop

Code for timings:

np.random.seed(123)
L = list('ABCDEFGH')
N = 500000
min_length = 10000

data_dict = {k:np.random.randint(10, size=np.random.randint(N)) for k in L}
like image 170
jezrael Avatar answered Oct 17 '22 01:10

jezrael