Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Changing the dtype for specific columns in a pandas dataframe

Tags:

python

pandas

I have a pandas dataframe which I have created from data stored in an xml file:

Initially the xlm file is opened and parsed

xmlData = etree.parse(filename)
trendData = xmlData.findall("//TrendData")

I created a directory which lists all the data names (which are used as column names) as keys and gives the position of the data in the xml file:

Parameters = {"TreatmentUnit":("Worklist/AdminData/AdminValues/TreatmentUnit"),
          "Modality":("Worklist/AdminData/AdminValues/Modality"),
          "Energy":("Worklist/AdminData/AdminValues/Energy"),
          "FieldSize":("Worklist/AdminData/AdminValues/Fieldsize"),
          "SDD":("Worklist/AdminData/AdminValues/SDD"),
          "Gantry":("Worklist/AdminData/AdminValues/Gantry"),
          "Wedge":("Worklist/AdminData/AdminValues/Wedge"),
          "MU":("Worklist/AdminData/AdminValues/MU"),
          "My":("Worklist/AdminData/AdminValues/My"),
          "AnalyzeParametersCAXMin":("Worklist/AdminData/AnalyzeParams/CAX/Min"),
          "AnalyzeParametersCAXMax":("Worklist/AdminData/AnalyzeParams/CAX/Max"),
          "AnalyzeParametersCAXTarget":("Worklist/AdminData/AnalyzeParams/CAX/Target"),
          "AnalyzeParametersCAXNorm":("Worklist/AdminData/AnalyzeParams/CAX/Norm"),
....}

This is just a small part of the directory, the actual one list over 80 parameters The directory keys are then sorted:

sortedKeys = list(sorted(Parameters.keys()))

A header is created for the pandas dataframe:

dateList=[]
dateList.append('date')
headers = dateList+sortedKeys

I then create an empty pandas dataframe with the same number of rows as the number of records in trendData and with the column headers set to 'headers' and then loop through the file filling the dataframe:

df = pd.DataFrame(index=np.arange(0,len(trendData)), columns=headers)
for a,b in enumerate(trendData):
    result={}
    result["date"] = dateutil.parser.parse(b.attrib['date'])
    for i,j in enumerate(Parameters):
        result[j] = b.findtext(Parameters[j])
        df.loc[a]=(result)
df = df.set_index('date')

This seems to work fine but the problem is that the dtype for each colum is set to 'object' whereas most should be integers. It's possible to use:

df.convert_objects(convert_numeric=True)

and it works fine but is now depricated. I can also use, for example, :

df.AnalyzeParametersBQFMax = pd.to_numeric(df.AnalyzeParametersBQFMax)

to convert individual columns. But is there a way of using pd.to_numeric with a list of column names. I can create a list of columns which should be integers using the following;

int64list=[]
for q in sortedKeys:
    if q.startswith("AnalyzeParameters"):
        int64list.append(q)

but cant find a way of passing this list to the function.

like image 218
Trigfa Avatar asked Dec 24 '22 05:12

Trigfa


2 Answers

You can explicitly replace columns in a DataFrame with the same column just with another dtype. Try this:

import pandas as pd
data = pd.DataFrame({'date':[2000, 2001, 2002, 2003], 'type':['A', 'B', 'A', 'C']})
data['date'] = data['date'].astype('int64')

when now calling data.dtypes it should return the following:

date     int64
type    object
dtype: object

for multiple columns use a for loop to run through the int64list you mentioned in your question.

like image 120
Neill Herbst Avatar answered Dec 26 '22 19:12

Neill Herbst


for multiple columns you can do it this way:

cols = df.filter(like='AnalyzeParameters').columns.tolist()
df[cols] = df[cols].astype(np.int64)
like image 28
MaxU - stop WAR against UA Avatar answered Dec 26 '22 20:12

MaxU - stop WAR against UA