I have a SparkSQL DataFrame.
Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas?
In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods.
Thanks.
SparkR Column provides a long list of useful methods including isNull
and isNotNull
:
> people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA))
> people <- createDataFrame(sqlContext, people_local)
> head(people)
Id Age
1 1 21
2 2 18
3 3 NA
> filter(people, isNotNull(people$Age)) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
> filter(people, isNull(people$Age)) %>% head()
Id Age
1 4 NA
Please keep in mind that there is no distinction between NA
and NaN
in SparkR.
If you prefer operations on a whole data frame there is a set of NA functions including fillna
and dropna
:
> fillna(people, 99) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
4 4 99
> dropna(people) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
Both can be adjusted to consider only some subset of columns (cols
), and dropna
has some additional useful parameters. For example you can specify minimal number of not null columns:
> people_with_names_local <- data.frame(
Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA))
> people_with_names <- createDataFrame(sqlContext, people_with_names_local)
> people_with_names %>% head()
Id Age Name
1 1 21 Alice
2 2 18 <NA>
3 3 30 Bob
4 4 NA <NA>
> dropna(people_with_names, minNonNulls=2) %>% head()
Id Age Name
1 1 21 Alice
2 2 18 <NA>
3 3 30 Bob
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With