when I use df.show()
to view the pyspark dataframe in jupyter notebook
It show me that:
+---+-------+-------+-------+------+-----------+-----+-------------+-----+---------+----------+-----+-----------+-----------+--------+---------+-------+------------+---------+------------+---------+---------------+------------+---------------+---------+------------+
| Id|groupId|matchId|assists|boosts|damageDealt|DBNOs|headshotKills|heals|killPlace|killPoints|kills|killStreaks|longestKill|maxPlace|numGroups|revives|rideDistance|roadKills|swimDistance|teamKills|vehicleDestroys|walkDistance|weaponsAcquired|winPoints|winPlacePerc|
+---+-------+-------+-------+------+-----------+-----+-------------+-----+---------+----------+-----+-----------+-----------+--------+---------+-------+------------+---------+------------+---------+---------------+------------+---------------+---------+------------+
| 0| 24| 0| 0| 5| 247.3000| 2| 0| 4| 17| 1050| 2| 1| 65.3200| 29| 28| 1| 591.3000| 0| 0.0000| 0| 0| 782.4000| 4| 1458| 0.8571|
| 1| 440875| 1| 1| 0| 37.6500| 1| 1| 0| 45| 1072| 1| 1| 13.5500| 26| 23| 0| 0.0000| 0| 0.0000| 0| 0| 119.6000| 3| 1511| 0.0400|
| 2| 878242| 2| 0| 1| 93.7300| 1| 0| 2| 54| 1404| 0| 0| 0.0000| 28| 28| 1| 0.0000| 0| 0.0000| 0| 0| 3248.0000| 5| 1583| 0.7407|
| 3|1319841| 3| 0| 0| 95.8800| 0| 0| 0| 86| 1069| 0| 0| 0.0000| 97| 94| 0| 0.0000| 0| 0.0000| 0| 0| 21.4900| 1| 1489| 0.1146|
| 4|1757883| 4| 0| 1| 0.0000| 0| 0| 1| 58| 1034| 0| 0| 0.0000| 47|
How can I get a formatted dataframe just like pandas dataframe to view the data more efficiently?
Convert PySpark Dataframe to Pandas DataFramePySpark DataFrame provides a method toPandas() to convert it to Python Pandas DataFrame. toPandas() results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data.
You can use the print() method to print the dataframe in a table format. You can convert the dataframe to String using the to_string() method and pass it to the print method which will print the dataframe.
In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is a best fit which could processes operations many times(100x) faster than Pandas.
You can use the ability to convert a pyspark dataframe directly to a pandas dataframe. The command for the same would be -
df.limit(10).toPandas()
This should directly yield the result as a pandas dataframe and you just need to have pandas package installed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With