I am using Apache Spark with Scala.
I have a csv file that does not have column names in the first row. It's like this:
28,Martok,49,476
29,Nog,48,364
30,Keiko,50,175
31,Miles,39,161
The columns represent ID, name, age, numOfFriends.
In my Scala object, I am creating dataset using SparkSession from csv file as follows:
val spark = SparkSession.builder.master("local[*]").getOrCreate()
val df = spark.read.option("inferSchema","true").csv("../myfile.csv")
df.printSchema()
When I run the program, the result is:
|-- _c0: integer (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: integer (nullable = true)
How can I add names to the columns in my dataset?
To read CSV file without header, use the header parameter and set it to “None” in the read_csv() method.
Method 1: Using read. table() function. In this method of only importing the selected columns of the CSV file data, the user needs to call the read. table() function, which is an in-built function of R programming language, and then passes the selected column in its arguments to import particular columns from the data.
A CSV file does not contain any information about the format of the data in it. That means when you open a CSV file in Excel, it only works if the data format is the same as in your control panel.
Adding column name to the DataFrame : We can add columns to an existing DataFrame using its columns attribute. Output : Now the DataFrame has column names. Renaming column name of a DataFrame : We can rename the columns of a DataFrame by using the rename() function.
toDf
method can be used, where you can pass in the column name in spark java.
Example:
Dataset<Row> rowsWithTitle = sparkSession.read().option("header", "true").option("delimiter", "\t").csv("file").toDF("h1", "h2");
You can use toDF
to specify column names when reading the CSV file:
val df = spark.read.option("inferSchema","true").csv("../myfile.csv").toDF(
"ID", "name", "age", "numOfFriends"
)
Or, if you already have the DataFrame created, you can rename its columns as follows:
val newColNames = Seq("ID", "name", "age", "numOfFriends")
val df2 = df.toDF(newColNames: _*)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With