Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load CSV file with records on multiple lines?

I use Spark 2.3.0.

As a Apache Spark's project I am using this data set to work on. When trying to read csv using spark, row in spark dataframe does not corresponds to correct row in csv (See sample csv here) file. Code looks like following:

answer_df = sparkSession.read.csv('./stacksample/Answers_sample.csv', header=True, inferSchema=True, multiLine=True);
answer_df.show(2)

Output

+--------------------+-------------+--------------------+--------+-----+--------------------+
|                  Id|  OwnerUserId|        CreationDate|ParentId|Score|                Body|
+--------------------+-------------+--------------------+--------+-----+--------------------+
|                  92|           61|2008-08-01T14:45:37Z|      90|   13|"<p><a href=""htt...|
|<p>A very good re...| though.</p>"|                null|    null| null|                null|
+--------------------+-------------+--------------------+--------+-----+--------------------+
only showing top 2 rows

However, When i used pandas, it worked like a charm.

df = pd.read_csv('./stacksample/Answers_sample.csv')
df.head(3) 

Output

Index Id    OwnerUserId CreationDate    ParentId    Score   Body
0   92  61  2008-08-01T14:45:37Z    90  13  <p><a href="http://svnbook.red-bean.com/">Vers...
1   124 26  2008-08-01T16:09:47Z    80  12  <p>I wound up using this. It is a kind of a ha...

My Observation: Apache spark is treating every line in csv file as a record for dataframe( which is reasonable) but on the other hand, pandas intelligently ( not sure based on which parameters) figures out where the record end actually.

Question I would like to know, how can i instruct Spark to load the dataframe properly.

The data to be loaded is as follows with the lines starting with 92 and 124 being two records.

Id,OwnerUserId,CreationDate,ParentId,Score,Body
92,61,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Version Control with Subversion</a></p>

<p>A very good resource for source control in general. Not really TortoiseSVN specific, though.</p>"
124,26,2008-08-01T16:09:47Z,80,12,"<p>I wound up using this. It is a kind of a hack, but it actually works pretty well. The only thing is you have to be very careful with your semicolons. : D</p>

<pre><code>var strSql:String = stream.readUTFBytes(stream.bytesAvailable);      
var i:Number = 0;
var strSqlSplit:Array = strSql.split("";"");
for (i = 0; i &lt; strSqlSplit.length; i++){
    NonQuery(strSqlSplit[i].toString());
}
</code></pre>
"
like image 660
Gaurav Gupta Avatar asked May 05 '18 10:05

Gaurav Gupta


People also ask

How do I read multiple lines in a csv file in Python?

To read data row-wise from a CSV file in Python, we can use reader and DictReader which are present in the CSV module allows us to fetch data row-wise.

What is multiline true in PySpark?

PySpark Read JSON multiple lines (Option multiline) In this PySpark example, we set multiline option to true to read JSON records on file from multiple lines. By default, this option is set to false. Let's consider we have a below JSON file with multiple lines by name “multiline-zipcode. json”.

How do I write multiple lines in spark SQL?

You can use triple-quotes at the start/end of the SQL code or a backslash at the end of each line. Triple quotes (both double and single) can be used in Python as well. Also backslashes are obsolete.


2 Answers

I think you should use option("escape", "\"") as it seems that " is used as so-called quote escape characters.

val q = spark.read
  .option("multiLine", true)
  .option("header", true)
  .option("escape", "\"")
  .csv("input.csv")
scala> q.show
+---+-----------+--------------------+--------+-----+--------------------+
| Id|OwnerUserId|        CreationDate|ParentId|Score|                Body|
+---+-----------+--------------------+--------+-----+--------------------+
| 92|         61|2008-08-01T14:45:37Z|      90|   13|<p><a href="http:...|
|124|         26|2008-08-01T16:09:47Z|      80|   12|<p>I wound up usi...|
+---+-----------+--------------------+--------+-----+--------------------+
like image 150
Jacek Laskowski Avatar answered Oct 08 '22 02:10

Jacek Laskowski


After few hours of struggle I was able to figure out the solution.

Analysis: Data dump provided by Stackoverflow had quote(") being escaped by a another quote("). And since spark uses slash(\) as default value for escape character, which i was not passing, therefore it end up in giving meaningless output.

Updated code

answer_df = sparkSession.read.\
    csv('./stacksample/Answers_sample.csv', 
        inferSchema=True, header=True, multiLine=True, escape='"');

answer_df.show(2)

Note the use of escape parameter in csv().

Output

+---+-----------+-------------------+--------+-----+--------------------+
| Id|OwnerUserId|       CreationDate|ParentId|Score|                Body|
+---+-----------+-------------------+--------+-----+--------------------+
| 92|         61|2008-08-01 20:15:37|      90|   13|<p><a href="http:...|
|124|         26|2008-08-01 21:39:47|      80|   12|<p>I wound up usi...|
+---+-----------+-------------------+--------+-----+--------------------+

Hope it will help other and save some time for them.

like image 44
Gaurav Gupta Avatar answered Oct 08 '22 03:10

Gaurav Gupta