I have a dataframe df
that have following structure:
+-----+-----+-----+-------+
| s |col_1|col_2|col_...|
+-----+-----+-----+-------+
| f1 | 0.0| 0.6| ... |
| f2 | 0.6| 0.7| ... |
| f3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
And I want to calculate the transpose of this dataframe so it will be look like
+-------+-----+-----+-------+------+
| s | f1 | f2 | f3 | ...|
+-------+-----+-----+-------+------+
|col_1 | 0.0| 0.6| 0.5 | ...|
|col_2 | 0.6| 0.7| 0.9 | ...|
|col_...| ...| ...| ... | ...|
I tied this two solutions but it returns that dataframe has not the specified used method:
method 1:
for x in df.columns:
df = df.pivot(x)
method 2:
df = sc.parallelize([ (k,) + tuple(v[0:]) for k,v in df.items()]).toDF()
how can I fix this.
If data is small enough to be transposed (not pivoted with aggregation) you can just convert it to Pandas DataFrame
:
df = sc.parallelize([
("f1", 0.0, 0.6, 0.5),
("f2", 0.6, 0.7, 0.9)]).toDF(["s", "col_1", "col_2", "col_3"])
df.toPandas().set_index("s").transpose()
s f1 f2
col_1 0.0 0.6
col_2 0.6 0.7
col_3 0.5 0.9
If it is to large for this, Spark won't help. Spark DataFrame
distributes data by row (although locally uses columnar storage), therefore size of a individual rows is limited to local memory.
You can try Koalas by databricks. Koalas is a similar to Pandas but made for distributed processing and available in Pyspark (atleast from 3.0.0).
kdf = df.to_koalas()
kdf_t = kdf.transpose()
df_T = kdf_t.to_spark()
edit: to efficiently access Koalas you need to define partitions, otherwise there can be serious performance degradation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With