I have a 200,000 x 500 dataframe
loaded into Pandas. Is there a function that can automatically tell me which columns are missing data? Or do I have to iterate over each column and check element by element?
Once I've found a missing element, how do I define a custom function (based on both the column name and some other data in the same row) to do automatic replacements. I see the fillna() method, but I don't think it takes a (lambda) function as an input.
Thanks!
One way of handling missing values is the deletion of the rows or columns having null values. If any columns have more than half of the values as null then you can drop the entire column. In the same way, rows can also be dropped if having one or more columns values as null.
Listwise or case deletion This approach is known as the complete case (or available case) analysis or listwise deletion. Listwise deletion is the most frequently used method in handling missing data, and thus has become the default option for analysis in most statistical software packages.
When dealing with missing data, data scientists can use two primary methods to solve the error: imputation or the removal of data. The imputation method develops reasonable guesses for missing data. It's most useful when the percentage of missing data is low.
something like:
import pandas as pd
pd.isnull(frame).any()
Is probably what you're looking for to look for missing data
fillna currently does not take lambda functions though that's in the works as an open issue on github.
You can use DataFrame.apply to do custom filling for now. Though can you be a little more specific on what you need to do to fill the data? Just curious what the use case is.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With