I have a pyspark dataframe with a lot of columns, and I want to select the ones which contain a certain string, and others. For example:
df.columns = ['hello_world','hello_country','hello_everyone','byebye','ciao','index']
I want to select the ones which contains 'hello' and also the column named 'index', so the result will be:
['hello_world','hello_country','hello_everyone','index']
I want something like df.select('hello*','index')
Thanks in advance :)
EDIT:
I found a quick way to solve it, so I answered myself, Q&A style. If someone sees my solution and can provide a better one I will appreciate it
I've found a quick and elegant way:
selected = [s for s in df.columns if 'hello' in s]+['index']
df.select(selected)
With this solution i can add more columns I want without editing the for loop that Ali AzG suggested.
You can also try to use colRegex function introduced in Spark 2.3, where in you can specify the column name as regular expression as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With