I am working on Ipython
and Spark
and I have a RDD
from which I form a list
. Now from this list
I want to form a dataframe
which has multiple columns from parent list
but these columns are not contiguous. I wrote this but it seems to be working wrong:
list1 = rdd.collect()
columns_num = [1,8,11,17,21,24]
df2 = [list[i] for i in columns_num]
The above code only selects 6 rows, with only column 1 data, from parent list
and forms the new dataframe
with those 6 columns 1 data.
How can I form a new dataframe
with multiple not contiguous columns from another list
For example like this:
rdd = sc.parallelize([("a", 1, 2, 4.0, "foo"), ("b", 3, 4, 5.0, "bar")])
columns_num = [0, 3]
df = rdd.toDF()
df2 = df.select(*(df.columns[i] for i in columns_num))
df2.show()
## +---+---+
## | _1| _4|
## +---+---+
## | a|4.0|
## | b|5.0|
## +---+---+
or like this:
df = rdd.map(lambda row: [row[i] for i in columns_num]).toDF()
df.show()
## +---+---+
## | _1| _4|
## +---+---+
## | a|4.0|
## | b|5.0|
## +---+---+
On a side not you should never collect data just to reshape. In the best case scenario it will be slow, in the worst case scenario it will simply crash.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With