How to select multiple non-contigous columns from a list into another dataframe in python

Question

I am working on Ipython and Spark and I have a RDD from which I form a list. Now from this list I want to form a dataframe which has multiple columns from parent list but these columns are not contiguous. I wrote this but it seems to be working wrong:

list1 = rdd.collect()
columns_num = [1,8,11,17,21,24]
df2 = [list[i] for i in columns_num]

The above code only selects 6 rows, with only column 1 data, from parent list and forms the new dataframe with those 6 columns 1 data.

How can I form a new dataframe with multiple not contiguous columns from another list

zero323 · Accepted Answer

For example like this:

rdd = sc.parallelize([("a", 1, 2, 4.0, "foo"), ("b", 3, 4, 5.0, "bar")])
columns_num = [0, 3]

df = rdd.toDF()
df2 = df.select(*(df.columns[i] for i in columns_num))
df2.show()

##  +---+---+
##  | _1| _4|
##  +---+---+
##  |  a|4.0|
##  |  b|5.0|
##  +---+---+

or like this:

df = rdd.map(lambda row: [row[i] for i in columns_num]).toDF()
df.show()

##  +---+---+
##  | _1| _4|
##  +---+---+
##  |  a|4.0|
##  |  b|5.0|
##  +---+---+

On a side not you should never collect data just to reshape. In the best case scenario it will be slow, in the worst case scenario it will simply crash.

How to select multiple non-contigous columns from a list into another dataframe in python

Tags:

python

apache-spark

apache-spark-sql

pyspark

Jason Donnald

1 Answers

zero323

Recent Activity

Donate For Us

How to select multiple non-contigous columns from a list into another dataframe in python

Tags:

python

apache-spark

apache-spark-sql

pyspark

Jason Donnald

1 Answers

zero323

Related questions

Recent Activity

Donate For Us