I am generating a pivot table report using the pandas Python module. The source data includes a lot of readings measured in milliseconds. If the number of milliseconds exceeds 999 then the value in that CSV file will include commas (e.g. 1,234 = 1.234 seconds).
Here's how I'm trying to run the report:
import pandas as pd
import numpy as np
pool_usage = pd.read_csv("c:/foo/ds-dump.csv")
# Add a column to the end that shows you where the data came from
pool_usage["Source File"] = "ds-dump.csv"
report = pool_usage.pivot_table(values=['Average Pool Size', 'Average Usage Time (ms)'], index=['Source File'], aggfunc=np.max)
print(report)
The problem is that the dtype for the Average Usage Time (ms) is an object so the np.max
function just treats it like it's NaN. I therefore never see any values greater than 999.
I tried fixing the issue like this:
import pandas as pd
import numpy as np
pool_usage = pd.read_csv("c:/foo/ds-dump.csv")
# Add a column to the end that shows you where the data came from
pool_usage["Source File"] = "ds-dump.csv"
# Convert strings to numbers if possible
pool_usage = pool_usage.convert_objects(convert_numeric=True)
report = pool_usage.pivot_table(values=['Average Pool Size', 'Average Usage Time (ms)'], index=['Source File'], aggfunc=np.max)
print(report)
This did actually change the dtype of the Average Usage Time column to a float but all of the values that are greater than 999 are still treated like NaN's.
How can I convert the Average Usage Time column to a float even though it's possible that some of the values may include commas?
The read_csv
function takes an optional thousands
argument. Its default is None
so you can change it to ","
to have it recognise 1,234
as 1234
when it reads the file:
pd.read_csv("c:/foo/ds-dump.csv", thousands=",")
The column holding the millisecond values should then have the int64
datatype once the file has been read into memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With