Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add column names to data read from csv file without column names

I am using Apache Spark with Scala.

I have a csv file that does not have column names in the first row. It's like this:

28,Martok,49,476
29,Nog,48,364
30,Keiko,50,175
31,Miles,39,161

The columns represent ID, name, age, numOfFriends.

In my Scala object, I am creating dataset using SparkSession from csv file as follows:

val spark = SparkSession.builder.master("local[*]").getOrCreate()
val df = spark.read.option("inferSchema","true").csv("../myfile.csv")
df.printSchema()

When I run the program, the result is:

|-- _c0: integer (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: integer (nullable = true)

How can I add names to the columns in my dataset?

like image 871
Placid Avatar asked Nov 05 '17 11:11

Placid


People also ask

How do I read a CSV file without column names?

To read CSV file without header, use the header parameter and set it to “None” in the read_csv() method.

How do I import only certain columns from a CSV file?

Method 1: Using read. table() function. In this method of only importing the selected columns of the CSV file data, the user needs to call the read. table() function, which is an in-built function of R programming language, and then passes the selected column in its arguments to import particular columns from the data.

Why CSV file does not display columns as columns?

A CSV file does not contain any information about the format of the data in it. That means when you open a CSV file in Excel, it only works if the data format is the same as in your control panel.

How do I add a column name to a dataset?

Adding column name to the DataFrame : We can add columns to an existing DataFrame using its columns attribute. Output : Now the DataFrame has column names. Renaming column name of a DataFrame : We can rename the columns of a DataFrame by using the rename() function.


2 Answers

toDf           

method can be used, where you can pass in the column name in spark java.

Example:

Dataset<Row> rowsWithTitle = sparkSession.read().option("header", "true").option("delimiter", "\t").csv("file").toDF("h1", "h2");

like image 161
padmaja ramesh Avatar answered Nov 10 '22 20:11

padmaja ramesh


You can use toDF to specify column names when reading the CSV file:

val df = spark.read.option("inferSchema","true").csv("../myfile.csv").toDF(
  "ID", "name", "age", "numOfFriends"
)

Or, if you already have the DataFrame created, you can rename its columns as follows:

val newColNames = Seq("ID", "name", "age", "numOfFriends")
val df2 = df.toDF(newColNames: _*)
like image 24
Leo C Avatar answered Nov 10 '22 20:11

Leo C