Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter dataframe on non-empty WrappedArray

My problem is that I have to find in a list, these which are not empty. When I use the filter function is not null, than I get also every row.

My program code looks like this:

...    
val csc = new CassandraSQLContext(sc)
val df = csc.sql("SELECT * FROM test").toDF()

val wrapped = df.select("fahrspur_liste")
wrapped.printSchema

The column fahrspur_liste contains the wrapped arrays and this column I have to analyze. When I run the code, than I get this structure for my wrapped array and these entries:

    root
 |-- fahrspur_liste: array (nullable = true)
 |    |-- element: long (containsNull = true)

+--------------+
|fahrspur_liste|
+--------------+
|            []|
|            []|
|          [56]|
|            []|
|          [36]|
|            []|
|            []|
|          [34]|
|            []|
|            []|
|            []|
|            []|
|            []|
|            []|
|            []|
|         [103]|
|            []|
|         [136]|
|            []|
|          [77]|
+--------------+
only showing top 20 rows

Now I want to filter these rows, so that I have only the entries [56],[36],[34],[103], ...

How can I write a filter function, that I get only these rows, which contains a number?

like image 840
DaShI Avatar asked Sep 15 '17 06:09

DaShI


People also ask

How do I filter out empty strings in a Dataframe?

You can filter out empty strings in your dataframe like this: df = df [df ['str_field'].str.len () > 0]

How to filter out rows with missing values in pandas Dataframe?

To filter out the rows of pandas dataframe that has missing values in Last_Namecolumn, we will first find the index of the column with non null values with pandas notnull() function. It will return a boolean series, where True for not null and False for null values or missing values.

How to filter pyspark Dataframe column with none value?

df.filter (condition) : This function returns the new dataframe with the values which satisfies the given condition. df.column_name.isNotNull () : This function is used to filter the rows that are not NULL/None in the dataframe column. Example 1: Filtering PySpark dataframe column with None value

How do I filter down rows in a Dataframe using regular expressions?

You can use the .str.contains () method to filter down rows in a dataframe using regular expressions (regex). For example, if you wanted to filter to show only records that end in “th” in the Region field, you could write: To learn more about regex, check out this link.


1 Answers

I don't think you need to use a UDF here.

You can just use size method and filter all those rows with array size = 0

df.filter(""" size(fahrspur_liste) != 0 """)
like image 196
philantrovert Avatar answered Nov 15 '22 10:11

philantrovert