I have a dict
of lists
(which have variable lengths), I am looking forward to an efficient way of creating a Dataframe from it.
Assume I have minimum list length, so I can truncate size of bigger lists while creating Dataframe.
Here is my dummy code
data_dict = {'a': [1,2,3,4], 'b': [1,2,3], 'c': [2,45,67,93,82,92]}
min_length = 3
I can have a dictionary of 10k or 20k keys, so looking for an efficient way to create a DataFrame like bellow
>>> df
a b c
0 1 1 2
1 2 2 45
2 3 3 67
You can convert a dictionary to Pandas Dataframe using df = pd. DataFrame. from_dict(my_dict) statement.
Pandas DataFrame: transpose() function The transpose() function is used to transpose index and columns. Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. If True, the underlying data is copied. Otherwise (default), no copy is made if possible.
Creating dataframe from a dictionary where entries have different lengths 1 Use pandas.DataFrame and pandas.concat. The following code will create a list of DataFrames with pandas.DataFrame, from... 2 plot. 3 dataframe. If you don't want it to show NaN and you have two particular lengths, adding a 'space' in each remaining cell... More ...
They can hold data of different types and lengths, making them very versatile. Lists can be named or nested and have the same or different lengths. This post deals with converting a list to a dataframe when it has unequal lengths.
Lists are difficult to convert to a dataframe when they have unequal lengths. Lists as a data type can be confusing but also useful. They can hold data of different types and lengths, making them very versatile. Lists can be named or nested and have the same or different lengths.
However, the length of the array is not the same for all of them. How can I create a dataframe where each column holds a different entry? Any way to overcome this?
You can filter values
of dict
in dict comprehension
, then DataFrame
works perfectly:
print ({k:v[:min_length] for k,v in data_dict.items()})
{'b': [1, 2, 3], 'c': [2, 45, 67], 'a': [1, 2, 3]}
df = pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()})
print (df)
a b c
0 1 1 2
1 2 2 45
2 3 3 67
If is possible some length can be less as min_length
add Series
:
data_dict = {'a': [1,2,3,4], 'b': [1,2], 'c': [2,45,67,93,82,92]}
min_length = 3
df = pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()})
print (df)
a b c
0 1 1.0 2
1 2 2.0 45
2 3 NaN 67
Timings:
In [355]: %timeit (pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()}))
The slowest run took 5.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 520 µs per loop
In [356]: %timeit (pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()}))
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 937 µs per loop
#Allen's solution
In [357]: %timeit (pd.DataFrame.from_dict(data_dict,orient='index').T.dropna())
1 loop, best of 3: 16.7 s per loop
Code for timings:
np.random.seed(123)
L = list('ABCDEFGH')
N = 500000
min_length = 10000
data_dict = {k:np.random.randint(10, size=np.random.randint(N)) for k in L}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With