I have a dataframe in spark, something like this: <pre class="prettyprint"><code>ID | Column ------ | ---- 1 | STRINGOFLETTERS 2 | SOMEOTHERCHARACTERS 3 | ANOTHERSTRING 4 | EXAMPLEEXAMPLE </code></pre> What I would like to do is extract the first 5 characters from the column plus the 8th character and create a new column, something like this: <pre class="prettyprint"><code>ID | New Column ------ | ------ 1 | STRIN_F 2 | SOMEO_E 3 | ANOTH_S 4 | EXAMP_E </code></pre> I can't use the following codem, because the values in the columns differ, and I don't want to split on a specific character, but on the 6th character: <pre class="prettyprint"><code>import pyspark split_col = pyspark.sql.functions.split(DF['column'], ' ') newDF = DF.withColumn('new_column', split_col.getItem(0)) </code></pre> Thanks all!

Use something like this: <pre class="prettyprint"><code>df.withColumn('new_column', concat(df.Column.substr(1, 5), lit('_'), df.Column.substr(8, 1))) </code></pre> This use the function substr and concat Those functions will solve your problem.

Python spark extract characters from dataframe

Tags:

python-2.7

apache-spark

pyspark

I have a dataframe in spark, something like this:

ID     | Column
------ | ----
1      | STRINGOFLETTERS
2      | SOMEOTHERCHARACTERS
3      | ANOTHERSTRING
4      | EXAMPLEEXAMPLE

What I would like to do is extract the first 5 characters from the column plus the 8th character and create a new column, something like this:

ID     | New Column
------ | ------
1      | STRIN_F
2      | SOMEO_E
3      | ANOTH_S
4      | EXAMP_E

I can't use the following codem, because the values in the columns differ, and I don't want to split on a specific character, but on the 6th character:

import pyspark
split_col = pyspark.sql.functions.split(DF['column'], ' ')
newDF = DF.withColumn('new_column', split_col.getItem(0))

Thanks all!

963

asked Dec 01 '16 17:12

Amanda C

1 Answers

Use something like this:

df.withColumn('new_column', concat(df.Column.substr(1, 5),
                                   lit('_'),
                                   df.Column.substr(8, 1)))

This use the function substr and concat

Those functions will solve your problem.

103

answered Oct 27 '22 14:10

Thiago Baldim

Related questions
                            
                                Python logging: display only information from debug level
                            
                                Why is this Python code running twice? [duplicate]
                            
                                How to catch this Python exception: error: [Errno 10054] An existing connection was forcibly closed by the remote host
                            
                                Send method using generator. still trying to understand the send method and quirky behaviour
                            
                                Why python's list slicing doesn't produce index out of bound error? [duplicate]
                            
                                cannot import name is_secure_transport
                            
                                Python cannot allocate memory using multiprocessing.pool
                            
                                How to map a series of conditions as keys in a dictionary?
                            
                                Scikit-Learn One-hot-encode before or after train/test split
                            
                                Tensorflow error using my own data
                            
                                python double colon with -1 as third parameter [duplicate]
                            
                                Python 2: Get network share path from drive letter
                            
                                Jupyter notebook, wrong sys.path and sys.executable
                            
                                Check if two file pointers point to same file in Python
                            
                                Horizontal scrolling won't activate for ttk Treeview widget
                            
                                Python: calling function from imported file
                            
                                write csv file with double quotes for particular column not working
                            
                                What is the use of returning self in the __iter__ method?
                            
                                Represent infinity as an integer in Python 2.7
                            
                                Sort a sublist of elements in a list leaving the rest in place

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With