How to load CSV file with records on multiple lines?

Tags:

I use Spark 2.3.0.

As a Apache Spark's project I am using this data set to work on. When trying to read csv using spark, row in spark dataframe does not corresponds to correct row in csv (See sample csv here) file. Code looks like following:

answer_df = sparkSession.read.csv('./stacksample/Answers_sample.csv', header=True, inferSchema=True, multiLine=True);
answer_df.show(2)

Output

+--------------------+-------------+--------------------+--------+-----+--------------------+
|                  Id|  OwnerUserId|        CreationDate|ParentId|Score|                Body|
+--------------------+-------------+--------------------+--------+-----+--------------------+
|                  92|           61|2008-08-01T14:45:37Z|      90|   13|"<p><a href=""htt...|
|<p>A very good re...| though.</p>"|                null|    null| null|                null|
+--------------------+-------------+--------------------+--------+-----+--------------------+
only showing top 2 rows

However, When i used pandas, it worked like a charm.

df = pd.read_csv('./stacksample/Answers_sample.csv')
df.head(3)

Output

Index Id    OwnerUserId CreationDate    ParentId    Score   Body
0   92  61  2008-08-01T14:45:37Z    90  13  <p><a href="http://svnbook.red-bean.com/">Vers...
1   124 26  2008-08-01T16:09:47Z    80  12  <p>I wound up using this. It is a kind of a ha...

My Observation: Apache spark is treating every line in csv file as a record for dataframe( which is reasonable) but on the other hand, pandas intelligently ( not sure based on which parameters) figures out where the record end actually.

Question I would like to know, how can i instruct Spark to load the dataframe properly.

The data to be loaded is as follows with the lines starting with 92 and 124 being two records.

Id,OwnerUserId,CreationDate,ParentId,Score,Body
92,61,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Version Control with Subversion</a></p>

<p>A very good resource for source control in general. Not really TortoiseSVN specific, though.</p>"
124,26,2008-08-01T16:09:47Z,80,12,"<p>I wound up using this. It is a kind of a hack, but it actually works pretty well. The only thing is you have to be very careful with your semicolons. : D</p>

<pre><code>var strSql:String = stream.readUTFBytes(stream.bytesAvailable);      
var i:Number = 0;
var strSqlSplit:Array = strSql.split("";"");
for (i = 0; i &lt; strSqlSplit.length; i++){
    NonQuery(strSqlSplit[i].toString());
}
</code></pre>
"

660

asked May 05 '18 10:05

Gaurav Gupta

2 Answers

I think you should use option("escape", "\"") as it seems that " is used as so-called quote escape characters.

val q = spark.read
  .option("multiLine", true)
  .option("header", true)
  .option("escape", "\"")
  .csv("input.csv")
scala> q.show
+---+-----------+--------------------+--------+-----+--------------------+
| Id|OwnerUserId|        CreationDate|ParentId|Score|                Body|
+---+-----------+--------------------+--------+-----+--------------------+
| 92|         61|2008-08-01T14:45:37Z|      90|   13|<p><a href="http:...|
|124|         26|2008-08-01T16:09:47Z|      80|   12|<p>I wound up usi...|
+---+-----------+--------------------+--------+-----+--------------------+

150

answered Oct 08 '22 02:10

Jacek Laskowski

After few hours of struggle I was able to figure out the solution.

Analysis: Data dump provided by Stackoverflow had quote(") being escaped by a another quote("). And since spark uses slash(\) as default value for escape character, which i was not passing, therefore it end up in giving meaningless output.

Updated code

answer_df = sparkSession.read.\
    csv('./stacksample/Answers_sample.csv', 
        inferSchema=True, header=True, multiLine=True, escape='"');

answer_df.show(2)

Note the use of escape parameter in csv().

Output

+---+-----------+-------------------+--------+-----+--------------------+
| Id|OwnerUserId|       CreationDate|ParentId|Score|                Body|
+---+-----------+-------------------+--------+-----+--------------------+
| 92|         61|2008-08-01 20:15:37|      90|   13|<p><a href="http:...|
|124|         26|2008-08-01 21:39:47|      80|   12|<p>I wound up usi...|
+---+-----------+-------------------+--------+-----+--------------------+

Hope it will help other and save some time for them.

answered Oct 08 '22 03:10

Gaurav Gupta

Related questions
                            
                                Change CSV delimiter in ServiceStack.Text.CsvSerializer
                            
                                jackson serialize csv property order
                            
                                Convert from CSV to array in Python
                            
                                How to deal with a 50GB large csv file in r language?
                            
                                Python: Comparing specific columns in two csv files
                            
                                csv writing within loop
                            
                                I need to call mongoexport remotely and get the result from node.js
                            
                                Parsing a CSV file for a unique row using the new Java 8 Streams API
                            
                                Escape Comma from CSV in Ruby
                            
                                TypeError: expected a character buffer object
                            
                                QTableView output save as .csv or .txt
                            
                                Pandas: Creating new data frame from only certain columns
                            
                                TextFieldParser ignoring header row C#
                            
                                Combine multiple csv files into a single xls workbook Python 3
                            
                                C# Excel import data from CSV into Excel
                            
                                Ruby 1.9.2 export a CSV string without generating a file
                            
                                XPT to CSV Conversion? [closed]
                            
                                Why is my PostgreSQL table larger (in GB) than the csv it came from?
                            
                                How can I merge two CSV files from command line?
                            
                                How to check if .xls and .csv files are empty

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to load CSV file with records on multiple lines?

Tags:

csv

apache-spark

pyspark

pyspark-sql

Gaurav Gupta

People also ask

2 Answers

Jacek Laskowski

Gaurav Gupta

Recent Activity

Donate For Us