Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame: Cannot convert string into a float

I have a column Column1 in a pandas dataframe which is of type str, values which are in the following form:

import pandas as pd
df = pd.read_table("filename.dat")
type(df["Column1"].ix[0])   #outputs 'str'
print(df["Column1"].ix[0])

which outputs '1/350'. So, this is currently a string. I would like to convert it into a float.

I tried this:

df["Column1"] = df["Column1"].astype('float64', raise_on_error = False)

But this didn't change the values into floats.

This also failed:

df["Column1"] = df["Column1"].convert_objects(convert_numeric=True)

And this failed:

df["Column1"] = df["Column1"].apply(pd.to_numeric, args=('coerce',))

How do I convert all the values of column "Column1" into floats? Could I somehow use regex to remove the parentheses?

EDIT:

The line

df["Meth"] = df["Meth"].apply(eval)

works, but only if I use it twice, i.e.

df["Meth"] = df["Meth"].apply(eval)
df["Meth"] = df["Meth"].apply(eval)

Why would this be?

like image 216
ShanZhengYang Avatar asked Jan 06 '23 15:01

ShanZhengYang


1 Answers

You need to evaluate the expression (e.g. '1/350') in order to get the result, for which you can use Python's eval() function.

By wrapping Panda's apply() function around it, you can then execute the eval() function on every value in your column. Example:

df["Column1"].apply(eval)

As you're interpreting literals, you can also use the ast.literal_eval function as noted in the docs. Update: This won't work, as the use of literal_eval() is still restricted to additions and subtractions (source).

Remark: as mentioned in other answers and comments on this question, the use of eval() is not without risks, as you're basically executing whatever input is passed in. In other words, if your input contains malicious code, you're giving it a free pass.

Alternative option:

# Define a custom div function
def div(a,b):
    return int(a)/int(b)

# Split each string and pass the values to div
df_floats = df['col1'].apply(lambda x: div(*x.split('/')))

Second alternative in case of unclean data:

By using regular expressions, we can remove any non-digits appearing resp. before the numerator and after the denominator.

# Define a custom div function (unchanged)
def div(a,b):
    return int(a)/int(b)

# We'll import the re module and define a precompiled pattern
import re
regex = re.compile('\D*(\d+)/(\d+)\D*')

df_floats = df['col1'].apply(lambda x: div(*regex.findall(x)[0]))

We'll lose a bit of performance, but the upside is that even with input like '!erefdfs?^dfsdf1/350dqsd qsd qs d', we still end up with the value of 1/350.

Performance:

When timing both options on a dataframe with 100.000 rows, the second option (using the user defined div function) clearly wins:

  • using eval: 1 loop, best of 3: 1.41 s per loop
  • using div: 10 loops, best of 3: 159 ms per loop
  • using re: 1 loop, best of 3: 275 ms per loop
like image 156
DocZerø Avatar answered Jan 08 '23 04:01

DocZerø