Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas substring DataFrame column

I have a pandas DataFrame, with a column called positions, that includes string values with the syntax of the following examples:

[{'y': 49, 'x': 44}, {'y': 78, 'x': 31}]
[{'y': 1, 'x': 63}, {'y': 0, 'x': 23}]
[{'y': 54, 'x': 9}, {'y': 78, 'x': 3}]

I want to create four new columns in my pandas DataFrame, y_start, x_start, y_end, x_end, that are extractions of only the numbers.

E.g. for the example of the first row, my new columns would have the following values:

y_start = 49
x_start = 44
y_end = 78
x_end = 31

To summarise, I am looking to extract just the first, second, third, and four occurrence of numbers and save these to individual columns.

like image 334
Edd Webster Avatar asked Jan 25 '23 18:01

Edd Webster


1 Answers

  • The first issue is to convert the strings back to dicts, which can be done with ast.literal_eval
  • Separate the lists to separate columns with the pandas.DataFrame constructor, because it's faster than using .apply(pd.Series)
    • Pandas split column of lists into multiple columns
  • Convert the dicts in each column to separate columns per key, using pandas.json_normalize, .rename the columns, and .concat them together.
  • Splitting dictionary/list inside a Pandas Column into Separate Columns doesn't quite answer the question, but it's similar.
  • If the data is being loaded from a csv, use the converters parameter with .read_csv.
    • df = pd.read_csv('data.csv', converters={'str_column': literal_eval})
import pandas as pd
from ast import literal_eval

# dataframe
data = {'data': ["[{'y': 49, 'x': 44}, {'y': 78, 'x': 31}]", "[{'y': 1, 'x': 63}, {'y': 0, 'x': 23}]", "[{'y': 54, 'x': 9}, {'y': 78, 'x': 3}]"]}

df = pd.DataFrame(data)

# convert the strings in the data column to dicts
df.data = df.data.apply(literal_eval)

# separate the strings into separate columns
df[['start', 'end']] = pd.DataFrame(df.data.tolist(), index=df.index)

# use json_normalize to convert the dicts to separate columns and join the dataframes with concat
cleaned = pd.concat([pd.json_normalize(df.start).rename(lambda x: f'{x}_start', axis=1), pd.json_normalize(df.end).rename(lambda x: f'{x}_end', axis=1)], axis=1)

# display(cleaned)
   y_start  x_start  y_end  x_end
0       49       44     78     31
1        1       63      0     23
2       54        9     78      3
like image 119
Trenton McKinney Avatar answered Feb 04 '23 19:02

Trenton McKinney