Project_Bank.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [110, 111, 13, 10]

Tags:

So i was trying to load the csv file inferring custom schema but everytime i end up with the following errors:

Project_Bank.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [110, 111, 13, 10]

This is how my program looks like and my csv file entries ,

age;job;marital;education;default;balance;housing;loan;contact;day;month;duration;campaign;pdays;previous;poutcome;y 58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no 44;technician;single;secondary;no;29;yes;no;unknown;5;may;151;1;-1;0;unknown;no 33;entrepreneur;married;secondary;no;2;yes;yes;unknown;5;may;76;1;-1;0;unknown;no

My Code :

$spark-shell --packages com.databricks:spark-csv_2.10:1.5.0

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext   
import sqlContext.implicits._    
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val bankSchema = StructType(Array(
  StructField("age", IntegerType, true),
  StructField("job", StringType, true),
  StructField("marital", StringType, true),
  StructField("education", StringType, true),
  StructField("default", StringType, true),
  StructField("balance", IntegerType, true),
  StructField("housing", StringType, true),
  StructField("loan", StringType, true),
  StructField("contact", StringType, true),
  StructField("day", IntegerType, true),
  StructField("month", StringType, true),
  StructField("duration", IntegerType, true),
  StructField("campaign", IntegerType, true),
  StructField("pdays", IntegerType, true),
  StructField("previous", IntegerType, true),
  StructField("poutcome", StringType, true),
  StructField("y", StringType, true)))


 val df = sqlContext.
  read.
  schema(bankSchema).
  option("header", "true").
  option("delimiter", ";").
  load("/user/amit.kudnaver_gmail/hadoop/project_bank/Project_Bank.csv").toDF()

  df.registerTempTable("people")
  df.printSchema()
  val distinctage = sqlContext.sql("select distinct age from people")

Any suggestion as why am not able to work with the csv file here after pushing the correct schema. Thanks in advance for your advise.

Thanks Amit K

229

asked May 22 '17 19:05

amitk

2 Answers

Here the problem is Data Frame expects Parquet file while processing it. In order to handle data in CSV. Here what you can do.

First of all, remove the header row from the data.

58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no
44;technician;single;secondary;no;29;yes;no;unknown;5;may;151;1;-1;0;unknown;no
33;entrepreneur;married;secondary;no;2;yes;yes;unknown;5;may;76;1;-1;0;unknown;no

Next we write following code to read the data.

Create case class

case class BankSchema(age: Int, job: String, marital:String, education:String, default:String, balance:Int, housing:String, loan:String, contact:String, day:Int, month:String, duration:Int, campaign:Int, pdays:Int, previous:Int, poutcome:String, y:String)

Read data from HDFS and parse it

val bankData = sc.textFile("/user/myuser/Project_Bank.csv").map(_.split(";")).map(p => BankSchema(p(0).toInt, p(1), p(2),p(3),p(4), p(5).toInt, p(6), p(7), p(8), p(9).toInt, p(10), p(11).toInt, p(12).toInt, p(13).toInt, p(14).toInt, p(15), p(16))).toDF()

And then register table and execute queries.

bankData.registerTempTable("bankData")
val distinctage = sqlContext.sql("select distinct age from bankData")

Here is what the output would look like

+---+
|age|
+---+
| 33|
| 44|
| 58|
+---+

171

answered Oct 31 '22 04:10

Tanmay Deshpande

Here the expected file format is csv but as per error its looking for parquet file format.

This can be overcome by explicitly mentioning the file format as below (which was missing in the problem shared) because if we don't specify the file format then it by default expects Parquet format.

As per Java code version (sample example):

Dataset<Row> resultData = session.read().format("csv")
                                            .option("sep", ",")
                                            .option("header", true)
                                            .option("mode", "DROPMALFORMED")
                                            .schema(definedSchema)
                                            .load(inputPath);

Here, schema can be defined either by using a java class (ie. POJO class) or by using StructType as already mentioned. And inputPath is the path of input csv file.

answered Oct 31 '22 03:10

RPaul

Related questions
                            
                                MYSQL: How to define or get LONG string variables
                            
                                Query Returns No Results Only When ORDER BY Added
                            
                                How to write union query in doctrine?
                            
                                CakePHP 1.3 - Unknown column in where clause
                            
                                phpMyAdmin "Dump some row(s)" for more than one table at once
                            
                                Laravel 4 query builder - with complicated left joins
                            
                                Site to site OpenSWAN VPN tunnel issues with AWS
                            
                                Sending messages(notification) to a group - Android
                            
                                Loading several large dataset in Highcharts
                            
                                Updating millions of rows in MySQL -- when to commit
                            
                                Update database data with submit button
                            
                                SELECT within SELECT PDO prepared statement [duplicate]
                            
                                MySQL: How to query multiple 'statistics' in a single query?
                            
                                Mysql fulltext search not working with special characters
                            
                                Laravel Eloquent created_at is getting updated
                            
                                is that possible to count how many rows between two rows mysql?
                            
                                SEQUELIZE: How to use EXTRACT MySQL function
                            
                                How long should I make database field to save useragent strings?
                            
                                Join two tables but exclude rows that match condition in a third table
                            
                                Ruby on Rails - Emoji not saved in MySQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Project_Bank.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [110, 111, 13, 10]

Tags:

mysql

csv

apache-spark

parquet

spark-shell