We can use the list() function to convert tuple to list in Python. After writing the above code, Ones you will print ” my_tuple ” then the output will appear as a “ [10, 20, 30, 40, 50] ”. Here, the list() function will convert the tuple to the list.
Use DataFrame. To convert pandas DataFrame to Dictionary object, use to_dict() method, this takes orient as dict by default which returns the DataFrame in format {column -> {index -> value}} . When no orient is specified, to_dict() returns in this format.
How about:
subset = data_set[['data_date', 'data_1', 'data_2']]
tuples = [tuple(x) for x in subset.to_numpy()]
for pandas < 0.24 use
tuples = [tuple(x) for x in subset.values]
list(data_set.itertuples(index=False))
As of 17.1, the above will return a list of namedtuples.
If you want a list of ordinary tuples, pass name=None
as an argument:
list(data_set.itertuples(index=False, name=None))
A generic way:
[tuple(x) for x in data_set.to_records(index=False)]
Motivation
Many data sets are large enough that we need to concern ourselves with speed/efficiency. So I offer this solution in that spirit. It happens to also be succinct.
For the sake of comparison, let's drop the index
column
df = data_set.drop('index', 1)
Solution
I'll propose the use of zip
and map
list(zip(*map(df.get, df)))
[('2012-02-17', 24.75, 25.03),
('2012-02-16', 25.0, 25.07),
('2012-02-15', 24.99, 25.15),
('2012-02-14', 24.68, 25.05),
('2012-02-13', 24.62, 24.77),
('2012-02-10', 24.38, 24.61)]
It happens to also be flexible if we wanted to deal with a specific subset of columns. We'll assume the columns we've already displayed are the subset we want.
list(zip(*map(df.get, ['data_date', 'data_1', 'data_2'])))
[('2012-02-17', 24.75, 25.03),
('2012-02-16', 25.0, 25.07),
('2012-02-15', 24.99, 25.15),
('2012-02-14', 24.68, 25.05),
('2012-02-13', 24.62, 24.77),
('2012-02-10', 24.38, 24.61)]
Turn's out records
is quickest followed by asymptotically converging zipmap
and iter_tuples
I'll use a library simple_benchmarks
that I got from this post
from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()
import pandas as pd
import numpy as np
def tuple_comp(df): return [tuple(x) for x in df.to_numpy()]
def iter_namedtuples(df): return list(df.itertuples(index=False))
def iter_tuples(df): return list(df.itertuples(index=False, name=None))
def records(df): return df.to_records(index=False).tolist()
def zipmap(df): return list(zip(*map(df.get, df)))
funcs = [tuple_comp, iter_namedtuples, iter_tuples, records, zipmap]
for func in funcs:
b.add_function()(func)
def creator(n):
return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})
@b.add_arguments('Rows in DataFrame')
def argument_provider():
for n in (10 ** (np.arange(4, 11) / 2)).astype(int):
yield n, creator(n)
r = b.run()
Check the results
r.to_pandas_dataframe().pipe(lambda d: d.div(d.min(1), 0))
tuple_comp iter_namedtuples iter_tuples records zipmap
100 2.905662 6.626308 3.450741 1.469471 1.000000
316 4.612692 4.814433 2.375874 1.096352 1.000000
1000 6.513121 4.106426 1.958293 1.000000 1.316303
3162 8.446138 4.082161 1.808339 1.000000 1.533605
10000 8.424483 3.621461 1.651831 1.000000 1.558592
31622 7.813803 3.386592 1.586483 1.000000 1.515478
100000 7.050572 3.162426 1.499977 1.000000 1.480131
r.plot()
Here's a vectorized approach (assuming the dataframe, data_set
to be defined as df
instead) that returns a list
of tuples
as shown:
>>> df.set_index(['data_date'])[['data_1', 'data_2']].to_records().tolist()
produces:
[(datetime.datetime(2012, 2, 17, 0, 0), 24.75, 25.03),
(datetime.datetime(2012, 2, 16, 0, 0), 25.0, 25.07),
(datetime.datetime(2012, 2, 15, 0, 0), 24.99, 25.15),
(datetime.datetime(2012, 2, 14, 0, 0), 24.68, 25.05),
(datetime.datetime(2012, 2, 13, 0, 0), 24.62, 24.77),
(datetime.datetime(2012, 2, 10, 0, 0), 24.38, 24.61)]
The idea of setting datetime column as the index axis is to aid in the conversion of the Timestamp
value to it's corresponding datetime.datetime
format equivalent by making use of the convert_datetime64
argument in DF.to_records
which does so for a DateTimeIndex
dataframe.
This returns a recarray
which could be then made to return a list
using .tolist
More generalized solution depending on the use case would be:
df.to_records().tolist() # Supply index=False to exclude index
The most efficient and easy way:
list(data_set.to_records())
You can filter the columns you need before this call.
This answer doesn't add any answers that aren't already discussed, but here are some speed results. I think this should resolve questions that came up in the comments. All of these look like they are O(n), based on these three values.
TL;DR: tuples = list(df.itertuples(index=False, name=None))
and tuples = list(zip(*[df[c].values.tolist() for c in df]))
are tied for the fastest.
I did a quick speed test on results for three suggestions here:
tuples = list(zip(*[df[c].values.tolist() for c in df]))
tuples = [tuple(x) for x in df.values]
name=None
suggestion from @Axel: tuples = list(df.itertuples(index=False, name=None))
from numpy import random
import pandas as pd
def create_random_df(n):
return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})
Small size:
df = create_random_df(10000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))
Gives:
1.66 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
15.5 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.74 ms ± 75.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Larger:
df = create_random_df(1000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))
Gives:
202 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.52 s ± 98.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
209 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
As much patience as I have:
df = create_random_df(10000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))
Gives:
1.78 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.4 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.68 s ± 96.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The zip version and the itertuples version are within the confidence intervals each other. I suspect that they are doing the same thing under the hood.
These speed tests are probably irrelevant though. Pushing the limits of my computer's memory doesn't take a huge amount of time, and you really shouldn't be doing this on a large data set. Working with those tuples after doing this will end up being really inefficient. It's unlikely to be a major bottleneck in your code, so just stick with the version you think is most readable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With