Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to introduce the schema in a Row in Spark?

Tags:

apache-spark

In the Row Java API there is a row.schema(), however there is not a row.set(StructType schema).

Also I tried to RowFactorie.create(objets), but I don't know how to proceed

UPDATE:

The problems is how to generate a new dataframe when I modify the structure in workers I put the example

DataFrame sentenceData = jsql.createDataFrame(jrdd, schema);
List<Row> resultRows2 = sentenceData.toJavaRDD()
            .map(new MyFunction<Row, Row>(parameters) {
            /** my map function **// 

                public Row call(Row row) {

                 // I want to change Row definition adding new columns
                    Row newRow = functionAddnewNewColumns (row);
                    StructType newSchema = functionGetNewSchema (row.schema);

                    // Here I want to insert the structure 

                    //
                    return newRow
                    }

                }

        }).collect();


JavaRDD<Row> jrdd = jsc.parallelize(resultRows);

// Here is the problema  I don't know how to get the new schema to create the   new modified dataframe

DataFrame newDataframe = jsql.createDataFrame(jrdd, newSchema);
like image 713
Rafael del Hoyo Avatar asked Nov 26 '15 09:11

Rafael del Hoyo


2 Answers

You can create a row with Schema by using:

Row newRow = new GenericRowWithSchema(values, newSchema);
like image 85
Christian Avatar answered Sep 28 '22 05:09

Christian


You do not set a schema on a row - that makes no sense. You can, however, create a DataFrame (or pre-Spark 1.3 a JavaSchemaRDD) with a given schema using the sqlContext.

DataFrame dataFrame = sqlContext.createDataFrame(rowRDD, schema)

The dataframe will have the schema, you have provided.

For further information, please consult the documentation at http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

EDIT: According to updated question

Your can generate new rows in your map-function which will get you a new rdd of type JavaRDD<Row>

DataFrame sentenceData = jsql.createDataFrame(jrdd, schema);
JavaRDD<Row> newRowRDD = sentenceData
   .toJavaRDD()
   .map(row -> functionAddnewNewColumns(row)) // Assuming functionAddnewNewColumns returns a Row

You then define the new schema

StructField[] fields = new StructField[] {
   new StructField("column1",...),
   new StructField("column2",...),
   ...
};
StructType newSchema = new StructType(fields);

Create a new DataFrame from your rowRDD with newSchema as schema

DataFrame newDataframe = jsql.createDataFrame(newRowRDD, newSchema)
like image 24
Glennie Helles Sindholt Avatar answered Sep 28 '22 05:09

Glennie Helles Sindholt