Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explode in PySpark

I would like to transform from a DataFrame that contains lists of words into a DataFrame with each word in its own row.

How do I do explode on a column in a DataFrame?

Here is an example with some of my attempts where you can uncomment each code line and get the error listed in the following comment. I use PySpark in Python 2.7 with Spark 1.6.1.

from pyspark.sql.functions import split, explode DF = sqlContext.createDataFrame([('cat \n\n elephant rat \n rat cat', )], ['word']) print 'Dataset:' DF.show() print '\n\n Trying to do explode: \n' DFsplit_explode = (  DF  .select(split(DF['word'], ' ')) #  .select(explode(DF['word']))  # AnalysisException: u"cannot resolve 'explode(word)' due to data type mismatch: input to function explode should be array or map type, not StringType;" #   .map(explode)  # AttributeError: 'PipelinedRDD' object has no attribute 'show' #   .explode()  # AttributeError: 'DataFrame' object has no attribute 'explode' ).show()  # Trying without split print '\n\n Only explode: \n'  DFsplit_explode = (  DF   .select(explode(DF['word']))  # AnalysisException: u"cannot resolve 'explode(word)' due to data type mismatch: input to function explode should be array or map type, not StringType;" ).show() 

Please advice

like image 808
user1982118 Avatar asked Jul 05 '16 18:07

user1982118


People also ask

What is explode in PySpark?

pyspark.sql.functions. explode (col)[source] Returns a new row for each element in the given array or map. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise.

What does F explode do?

DataFrame - explode() function The explode() function is used to transform each element of a list-like to a row, replicating the index values. Exploded lists to rows of the subset columns; index will be duplicated for these rows.


1 Answers

explode and split are SQL functions. Both operate on SQL Column. split takes a Java regular expression as a second argument. If you want to separate data on arbitrary whitespace you'll need something like this:

df = sqlContext.createDataFrame(     [('cat \n\n elephant rat \n rat cat', )], ['word'] )  df.select(explode(split(col("word"), "\s+")).alias("word")).show()  ## +--------+ ## |    word| ## +--------+ ## |     cat| ## |elephant| ## |     rat| ## |     rat| ## |     cat| ## +--------+ 
like image 99
zero323 Avatar answered Sep 20 '22 08:09

zero323