Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find column index by searching column header of a Dataset in Apache Spark Java

I have a Spark Dataset similar to the example below:

       0         1                  2          3
    +------+------------+--------------------+---+
    |ItemID|Manufacturer|       Category     |UPC|
    +------+------------+--------------------+---+
    |   804|         ael|Brush & Broom Han...|123|
    |   805|         ael|Wheel Brush Parts...|124|
    +------+------------+--------------------+---+

I need to find the position of a column by searching the column header.

For Example:

int position=getColumnPosition("Category");

This should return 2.

Is there any Spark function supported on Dataset<Row> datatype to find the column index or any java functions which can run on Spark dataset?

like image 905
Shreeharsha Avatar asked Apr 13 '17 12:04

Shreeharsha


People also ask

How do I get column values from Spark DataFrame?

In order to convert Spark DataFrame Column to List, first select() the column you want, next use the Spark map() transformation to convert the Row to String, finally collect() the data to the driver which returns an Array[String] .

How do I get the index of a column in PySpark?

You can get the column index from the column name in Pandas using DataFrame. columns. get_loc() method.

What option can be used to automatically infer the datatype of column?

Auto Loader also attempts to infer partition columns from the underlying directory structure of the data if the data is laid out in Hive style partitioning.

How do I detect if a Spark DataFrame has a column?

Spark Check if Column Exists in DataFrame Spark DataFrame has an attribute columns that returns all column names as an Array[String] , once you have the columns, you can use the array function contains() to check if the column present. Note that df. columns returns only top level columns but not nested struct columns.


1 Answers

You need to access the schema and read the field index as follows:

int position = df.schema().fieldIndex("Category");
like image 111
shants Avatar answered Sep 18 '22 12:09

shants