Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select columns which contains a string in pyspark

I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:

df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']

I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:

['hello_world','hello_country','hello_everyone','index']

I want something like df.select('hello*','index')

Thanks in advance :)

EDIT:

I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it

like image 346
Manrique Avatar asked Dec 04 '22 19:12

Manrique


2 Answers

I've found a quick and elegant way:

selected = [s for s in df.columns if 'hello' in s]+['index']
df.select(selected)

With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.

like image 160
Manrique Avatar answered Dec 31 '22 01:12

Manrique


You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.

like image 41
Neeraj Bhadani Avatar answered Dec 31 '22 01:12

Neeraj Bhadani