How to create a DataFrame from dict of unequal length lists, and truncating to a specific length?

Tags:

I have a dict of lists (which have variable lengths), I am looking forward to an efficient way of creating a Dataframe from it.

Assume I have minimum list length, so I can truncate size of bigger lists while creating Dataframe.

Here is my dummy code

data_dict = {'a': [1,2,3,4], 'b': [1,2,3], 'c': [2,45,67,93,82,92]}
min_length = 3

I can have a dictionary of 10k or 20k keys, so looking for an efficient way to create a DataFrame like bellow

>>> df
   a  b   c
0  1  1   2
1  2  2  45
2  3  3  67

903

asked May 09 '17 09:05

John

1 Answers

You can filter values of dict in dict comprehension, then DataFrame works perfectly:

print ({k:v[:min_length] for k,v in data_dict.items()})
{'b': [1, 2, 3], 'c': [2, 45, 67], 'a': [1, 2, 3]}


df = pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()})
print (df)
   a  b   c
0  1  1   2
1  2  2  45
2  3  3  67

If is possible some length can be less as min_length add Series:

data_dict = {'a': [1,2,3,4], 'b': [1,2], 'c': [2,45,67,93,82,92]}
min_length = 3

df = pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()})
print (df)
   a    b   c
0  1  1.0   2
1  2  2.0  45
2  3  NaN  67

Timings:

In [355]: %timeit (pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()}))
The slowest run took 5.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 520 µs per loop

In [356]: %timeit (pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()}))
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 937 µs per loop

#Allen's solution
In [357]: %timeit (pd.DataFrame.from_dict(data_dict,orient='index').T.dropna())
1 loop, best of 3: 16.7 s per loop

Code for timings:

np.random.seed(123)
L = list('ABCDEFGH')
N = 500000
min_length = 10000

data_dict = {k:np.random.randint(10, size=np.random.randint(N)) for k in L}

170

answered Oct 17 '22 01:10

jezrael

Related questions
                            
                                How to set a dict value using another key of the same dict [duplicate]
                            
                                How to create my own datasets using in scikit-learn?
                            
                                Python type annotations: Any way to annotate a property?
                            
                                Which $path is needed so g++/pybind11 could locate Python.h?
                            
                                Subtract time from datetime.time object
                            
                                Numpy assignment like 'numpy.take'
                            
                                Diamond inheritance and the MRO
                            
                                Train NER model in NLTK with custom corpus
                            
                                selenium python element.screenshot() not working
                            
                                To Kill A Mocking Object: A Python Story
                            
                                Python whole reverse list specifying index
                            
                                Sort bins from pandas cut
                            
                                Django - short non-linear non-predictable ID in the URL
                            
                                ctypes.ArgumentError: Don't know how to convert parameter
                            
                                Arrange elements with same count in alphabetical order
                            
                                How to let python3 import graph-tool installed by Homebrew?
                            
                                How to convert NumPy ndarray to C++ vector with Boost.Python and back?
                            
                                How to run PyTorch on GPU by default?
                            
                                Most efficient way to upload image to Amazon S3 with Python using Boto3
                            
                                Efficient Algorithm to compose valid expressions with specific target

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create a DataFrame from dict of unequal length lists, and truncating to a specific length?

Tags:

python

python-3.x

pandas

John

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us