The code below is working and creates a Spark dataframe from a text file. However, I'm trying to use the header option to use the first column as header and for some reason it doesn't seem to be happening. I cannot understand why! It must be something stupid but I cannot solve this. <pre class="prettyprint"><code>>>>from pyspark.sql import SparkSession >>>spark = SparkSession.builder.master("local").appName("Word Count")\ .config("spark.some.config.option", "some-value")\ .getOrCreate() >>>df = spark.read.option("header", "true")\ .option("delimiter", ",")\ .option("inferSchema", "true")\ .text("StockData/ETFs/aadr.us.txt") >>>df.take(3) </code></pre> Returns the following: <blockquote> [Row(value=u'Date,Open,High,Low,Close,Volume,OpenInt'), Row(value=u'2010-07-21,24.333,24.333,23.946,23.946,43321,0'), Row(value=u'2010-07-22,24.644,24.644,24.362,24.487,18031,0')] </blockquote> <pre class="prettyprint"><code>>>>df.columns </code></pre> Returns the following: <blockquote> ['value'] </blockquote>

Issue The issue is that you are using <code>.text</code> api call instead of <code>.csv</code> or <code>.load</code>. If you read the .text api documentation, it says <blockquote> <blockquote> <code>def text(self, paths): """Loads text files and returns a :class:DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any. Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). df = spark.read.text('python/test_support/sql/text-test.txt') df.collect() [Row(value=u'hello'), Row(value=u'this')] """</code> </blockquote> </blockquote> Solution using .csv Change the <code>.text</code> function call to <code>.csv</code> and you should be fine as <pre class="prettyprint"><code>df = spark.read.option("header", "true") \ .option("delimiter", ",") \ .option("inferSchema", "true") \ .csv("StockData/ETFs/aadr.us.txt") df.show(2, truncate=False) </code></pre> which should give you <pre class="prettyprint"><code>+-------------------+------+------+------+------+------+-------+ |Date |Open |High |Low |Close |Volume|OpenInt| +-------------------+------+------+------+------+------+-------+ |2010-07-21 00:00:00|24.333|24.333|23.946|23.946|43321 |0 | |2010-07-22 00:00:00|24.644|24.644|24.362|24.487|18031 |0 | +-------------------+------+------+------+------+------+-------+ </code></pre> Solution using .load <code>.load</code> would assume the file to be of parquet format if a format option is not defined. So you would need a format option to be defined as well <pre class="prettyprint"><code>df = spark.read\ .format("com.databricks.spark.csv")\ .option("header", "true") \ .option("delimiter", ",") \ .option("inferSchema", "true") \ .load("StockData/ETFs/aadr.us.txt") df.show(2, truncate=False) </code></pre> I hope the answer is helpful

Spark 2.3.0 Read Text File With Header Option Not Working

Tags:

text-files

header

python-2.7

apache-spark

spark-dataframe

The code below is working and creates a Spark dataframe from a text file. However, I'm trying to use the header option to use the first column as header and for some reason it doesn't seem to be happening. I cannot understand why! It must be something stupid but I cannot solve this.

>>>from pyspark.sql import SparkSession
>>>spark = SparkSession.builder.master("local").appName("Word Count")\
    .config("spark.some.config.option", "some-value")\
    .getOrCreate()
>>>df = spark.read.option("header", "true")\
    .option("delimiter", ",")\
    .option("inferSchema", "true")\
    .text("StockData/ETFs/aadr.us.txt")
>>>df.take(3)

Returns the following:

[Row(value=u'Date,Open,High,Low,Close,Volume,OpenInt'), Row(value=u'2010-07-21,24.333,24.333,23.946,23.946,43321,0'), Row(value=u'2010-07-22,24.644,24.644,24.362,24.487,18031,0')]

>>>df.columns

Returns the following:

['value']

349

asked Mar 24 '18 23:03

Odisseo

1 Answers

Issue

The issue is that you are using .text api call instead of .csv or .load. If you read the .text api documentation, it says

def text(self, paths): """Loads text files and returns a :class:DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any. Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). df = spark.read.text('python/test_support/sql/text-test.txt') df.collect() [Row(value=u'hello'), Row(value=u'this')] """

Solution using .csv

Change the .text function call to .csv and you should be fine as

df = spark.read.option("header", "true") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .csv("StockData/ETFs/aadr.us.txt")

df.show(2, truncate=False)

which should give you

+-------------------+------+------+------+------+------+-------+
|Date               |Open  |High  |Low   |Close |Volume|OpenInt|
+-------------------+------+------+------+------+------+-------+
|2010-07-21 00:00:00|24.333|24.333|23.946|23.946|43321 |0      |
|2010-07-22 00:00:00|24.644|24.644|24.362|24.487|18031 |0      |
+-------------------+------+------+------+------+------+-------+

Solution using .load

.load would assume the file to be of parquet format if a format option is not defined. So you would need a format option to be defined as well

df = spark.read\
    .format("com.databricks.spark.csv")\
    .option("header", "true") \
    .option("delimiter", ",") \
    .option("inferSchema", "true") \
    .load("StockData/ETFs/aadr.us.txt")

df.show(2, truncate=False)

I hope the answer is helpful

165

answered Oct 15 '22 01:10

Ramesh Maharjan

Related questions
                            
                                Does reverse actually reverse a Python iterator?
                            
                                Python encoding/decoding problems
                            
                                How to append the logs in python
                            
                                django-rest-framework serializer to_representation
                            
                                How to line left justify label and entry boxes in Tkinter grid
                            
                                How to create the scrapy project by python3
                            
                                How to deal with multiple date string formats in a python series
                            
                                Mock two separate responses to same function in same test
                            
                                Uninstall Python 2.7 from Mac OS X El Capitan
                            
                                Hashingvectorizer and Multinomial naive bayes are not working together
                            
                                Use an object method with the Initializer (Same line)
                            
                                Split string at delimiter '\' in python
                            
                                Why can yield be indexed?
                            
                                Peewee ORM JSONField for MySQL
                            
                                Python zipfile module can't extract filenames with Chinese characters
                            
                                Python Flask heroku application error
                            
                                Animated barplot in Python
                            
                                Sort Python list using multiple keys
                            
                                Python numpy unwrap function
                            
                                db.create_all() not creating tables in Flask-SQLAclchemy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With