Select columns which contains a string in pyspark

Question

I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:

df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']

I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:

['hello_world','hello_country','hello_everyone','index']

I want something like df.select('hello*','index')

Thanks in advance :)

EDIT:

I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it

Manrique · Accepted Answer

I've found a quick and elegant way:

selected = [s for s in df.columns if 'hello' in s]+['index']
df.select(selected)

With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.

Neeraj Bhadani · Answer

You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.

Donate For Us