Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split a dataframe column into multiple columns with a Pandas converter

Tags:

python

pandas

I have a file with rows like this:

blablabla (CODE1513A15), 9.20, 9.70, 0

I want pandas to read each column, but from the first column I am interested only in the data between brackets, and I want to extract it into additional columns. Therefore, I tried using a Pandas converter:

import pandas as pd
from datetime import datetime
import string

code = 'CODE'
code_parser = lambda x: {
    'date': datetime(int(x.split('(', 1)[1].split(')')[0][len(code):len(code)+2]), string.uppercase.index(x.split('(', 1)[1].split(')')[0][len(code)+4:len(code)+5])+1, int(x.split('(', 1)[1].split(')')[0][len(code)+2:len(code)+4])), 
    'value': float(x.split('(', 1)[1].split(')')[0].split('-')[0][len(code)+5:])
}
column_names = ['first_column', 'second_column', 'third_column', 'fourth_column']
pd.read_csv('myfile.csv', usecols=[0,1,2,3], names=column_names, converters={'first_column': code_parser})

With this code, I can convert the text between brackets to a dict containing a datetime object and a value.

If the code is CODE1513A15 as in the sample, it will be built from:

  • a known code (in this example, 'CODE')
  • two digits for the year
  • two digits for the day of month
  • A letter from A to L, which is the month (A for January, B for February, ...)
  • A float value

I tested the lambda function and it correctly extracts the information I want, and its output is a dict {'date': datetime(15, 1, 13), 'value': 15}. Nevertheless, if I print the result of the pd.read_csv method, the 'first_column' is a dict, while I was expecting it to be replaced by two columns called 'date' and 'value':

                         first_column  second_column  third_column  fourth_column
0   {u'date':13-01-2015, u'value':15}           9.20          9.70              0
1   {u'date':14-01-2015, u'value':16}           9.30          9.80              0
2   {u'date':15-01-2015, u'value':12}           9.40          9.90              0

What I want to get is:

               date  value  second_column  third_column  fourth_column
0        13-01-2015     15           9.20          9.70              0
1        14-01-2015     16           9.30          9.80              0
2        15-01-2015     12           9.40          9.90              0

Note: I don't care how the date is formatted, this is only a representation of what I expect to get.

Any idea?

like image 908
Roman Rdgz Avatar asked Dec 01 '25 15:12

Roman Rdgz


1 Answers

I think it's better to do things step by step.

# read data into a data frame
column_names = ['first_column', 'second_column', 'third_column', 'fourth_column']
df = pd.read_csv(data, names=column_names)

# extract values using regular expression which is much more robust
# than string spliting
tmp = df.first_column.str.extract('CODE(\d{2})(\d{2})([A-L]{1})(\d+)')
tmp.columns = ['year', 'day', 'month', 'value']
tmp['month'] = tmp['month'].apply(lambda m: str(ord(m) - 64))

Sample output:

print tmp 
  year day month value
0   15  13     1    15

Then transform your original data frame into the format that you want

df['date'] = (tmp['year'] + tmp['day'] + tmp['month']).apply(lambda d: strptime(d, '%y%d%m'))
df['value'] = tmp['value']
del df['first_column']
like image 198
Lim H. Avatar answered Dec 04 '25 04:12

Lim H.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!