I use Spark 2.3.0.
As a Apache Spark's project I am using this data set to work on. When trying to read csv using spark, row in spark dataframe does not corresponds to correct row in csv (See sample csv here) file. Code looks like following:
answer_df = sparkSession.read.csv('./stacksample/Answers_sample.csv', header=True, inferSchema=True, multiLine=True);
answer_df.show(2)
Output
+--------------------+-------------+--------------------+--------+-----+--------------------+
| Id| OwnerUserId| CreationDate|ParentId|Score| Body|
+--------------------+-------------+--------------------+--------+-----+--------------------+
| 92| 61|2008-08-01T14:45:37Z| 90| 13|"<p><a href=""htt...|
|<p>A very good re...| though.</p>"| null| null| null| null|
+--------------------+-------------+--------------------+--------+-----+--------------------+
only showing top 2 rows
However, When i used pandas, it worked like a charm.
df = pd.read_csv('./stacksample/Answers_sample.csv')
df.head(3)
Output
Index Id OwnerUserId CreationDate ParentId Score Body
0 92 61 2008-08-01T14:45:37Z 90 13 <p><a href="http://svnbook.red-bean.com/">Vers...
1 124 26 2008-08-01T16:09:47Z 80 12 <p>I wound up using this. It is a kind of a ha...
My Observation: Apache spark is treating every line in csv file as a record for dataframe( which is reasonable) but on the other hand, pandas intelligently ( not sure based on which parameters) figures out where the record end actually.
Question I would like to know, how can i instruct Spark to load the dataframe properly.
The data to be loaded is as follows with the lines starting with 92
and 124
being two records.
Id,OwnerUserId,CreationDate,ParentId,Score,Body
92,61,2008-08-01T14:45:37Z,90,13,"<p><a href=""http://svnbook.red-bean.com/"">Version Control with Subversion</a></p>
<p>A very good resource for source control in general. Not really TortoiseSVN specific, though.</p>"
124,26,2008-08-01T16:09:47Z,80,12,"<p>I wound up using this. It is a kind of a hack, but it actually works pretty well. The only thing is you have to be very careful with your semicolons. : D</p>
<pre><code>var strSql:String = stream.readUTFBytes(stream.bytesAvailable);
var i:Number = 0;
var strSqlSplit:Array = strSql.split("";"");
for (i = 0; i < strSqlSplit.length; i++){
NonQuery(strSqlSplit[i].toString());
}
</code></pre>
"
To read data row-wise from a CSV file in Python, we can use reader and DictReader which are present in the CSV module allows us to fetch data row-wise.
PySpark Read JSON multiple lines (Option multiline) In this PySpark example, we set multiline option to true to read JSON records on file from multiple lines. By default, this option is set to false. Let's consider we have a below JSON file with multiple lines by name “multiline-zipcode. json”.
You can use triple-quotes at the start/end of the SQL code or a backslash at the end of each line. Triple quotes (both double and single) can be used in Python as well. Also backslashes are obsolete.
I think you should use option("escape", "\"")
as it seems that "
is used as so-called quote escape characters.
val q = spark.read
.option("multiLine", true)
.option("header", true)
.option("escape", "\"")
.csv("input.csv")
scala> q.show
+---+-----------+--------------------+--------+-----+--------------------+
| Id|OwnerUserId| CreationDate|ParentId|Score| Body|
+---+-----------+--------------------+--------+-----+--------------------+
| 92| 61|2008-08-01T14:45:37Z| 90| 13|<p><a href="http:...|
|124| 26|2008-08-01T16:09:47Z| 80| 12|<p>I wound up usi...|
+---+-----------+--------------------+--------+-----+--------------------+
After few hours of struggle I was able to figure out the solution.
Analysis:
Data dump provided by Stackoverflow
had quote(")
being escaped by a another quote(")
. And since spark uses slash(\)
as default value for escape character, which i was not passing, therefore it end up in giving meaningless output.
Updated code
answer_df = sparkSession.read.\
csv('./stacksample/Answers_sample.csv',
inferSchema=True, header=True, multiLine=True, escape='"');
answer_df.show(2)
Note the use of escape
parameter in csv()
.
Output
+---+-----------+-------------------+--------+-----+--------------------+
| Id|OwnerUserId| CreationDate|ParentId|Score| Body|
+---+-----------+-------------------+--------+-----+--------------------+
| 92| 61|2008-08-01 20:15:37| 90| 13|<p><a href="http:...|
|124| 26|2008-08-01 21:39:47| 80| 12|<p>I wound up usi...|
+---+-----------+-------------------+--------+-----+--------------------+
Hope it will help other and save some time for them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With