Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Create pandas dataframe with columns based off of unique values in nestled list

I have a nestled list containing various regions for each sample. I would like to make a dataframe such that each row (sample) has the presence or absence of the corresponding region (column). For example, the data might look like this:

region_list = [['North America'], ['North America', 'South America'], ['Asia'], ['North America', 'Asia', 'Australia']]

And the end dataframe would look something like this:

North America    South America     Asia     Australia
1                0                 0        0
1                1                 0        0
0                0                 1        0
1                0                 1        1

I think I could probably figure out a way using nestled loops and appends, but is there be a more pythonic way to do this? Perhaps with numpy.where?

like image 572
Flow Nuwen Avatar asked Dec 13 '22 22:12

Flow Nuwen


1 Answers

pandas
str.get_dummies

pd.Series(region_list).str.join('|').str.get_dummies()

   Asia  Australia  North America  South America
0     0          0              1              0
1     0          0              1              1
2     1          0              0              0
3     1          1              1              0

numpy
np.bincount with pd.factorize

n = len(region_list)
i = np.arange(n).repeat([len(x) for x in region_list])
f, u = pd.factorize(np.concatenate(region_list))
m = u.size

pd.DataFrame(
    np.bincount(i * m + f, minlength=n * m).reshape(n, m),
    columns=u
)

   North America  South America  Asia  Australia
0              1              0     0          0
1              1              1     0          0
2              0              0     1          0
3              1              0     1          1

Timing

%timeit pd.Series(region_list).str.join('|').str.get_dummies()
1000 loops, best of 3: 1.42 ms per loop

%%timeit
n = len(region_list)
i = np.arange(n).repeat([len(x) for x in region_list])
f, u = pd.factorize(np.concatenate(region_list))
m = u.size

pd.DataFrame(
    np.bincount(i * m + f, minlength=n * m).reshape(n, m),
    columns=u
)
1000 loops, best of 3: 204 µs per loop
like image 114
piRSquared Avatar answered Dec 16 '22 11:12

piRSquared