Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read csv without header and name them with names while reading in pyspark?

100000,20160214,93374987
100000,20160214,1925301
100000,20160216,1896542
100000,20160216,84167419
100000,20160216,77273616
100000,20160507,1303015

I want to read the csv file which has no column names in first row. How to read it and name the columns with my specified names in the same time ? for now, I just renamed the original columns with my specified names like this:

df = spark.read.csv("user_click_seq.csv",header=False)
df = df.withColumnRenamed("_c0", "member_srl")
df = df.withColumnRenamed("_c1", "click_day")
df = df.withColumnRenamed("_c2", "productid")

Any better way ?

like image 961
yanachen Avatar asked Jun 15 '17 03:06

yanachen


People also ask

How do I read a CSV file in Spark without header?

In order to read a CSV file without headers use None value to header param in pandas read_csv() function.

How do I read a CSV file in Python without column names?

To read CSV file without header, use the header parameter and set it to “None” in the read_csv() method.


3 Answers

You can import the csv file into a dataframe with a predefined schema. The way you define a schema is by using the StructType and StructField objects. Assuming your data is all IntegerType data:

from pyspark.sql.types import StructType, StructField, IntegerType  schema = StructType([     StructField("member_srl", IntegerType(), True),     StructField("click_day", IntegerType(), True),     StructField("productid", IntegerType(), True)])  df = spark.read.csv("user_click_seq.csv",header=False,schema=schema) 

should work.

like image 78
DavidWayne Avatar answered Oct 04 '22 19:10

DavidWayne


For those who would like to do this in scala and may not want to add types:

val df = spark.read.format("csv")
                   .option("header","false")
                   .load("hdfs_filepath")
                   .toDF("var0","var1","var2","var3")
like image 44
Climbs_lika_Spyder Avatar answered Oct 04 '22 19:10

Climbs_lika_Spyder


You can read the data with header=False and then pass the column names with toDF as bellow:

data = spark.read.csv('data.csv', header=False)
data = data.toDF('name1', 'name2', 'name3')
like image 36
Mohammad Reza Malekpour Avatar answered Oct 04 '22 20:10

Mohammad Reza Malekpour