I have a List<String> data. Something like: 
[[dev, engg, 10000], [karthik, engg, 20000]..]
I know schema for this data.
name (String)
degree (String)
salary (Integer)
I tried:
JavaRDD<String> data = new JavaSparkContext(sc).parallelize(datas);
DataFrame df = sqlContext.read().json(data);
df.printSchema();
df.show(false);
Output:
root
 |-- _corrupt_record: string (nullable = true)
+-----------------------------+
|_corrupt_record              |
+-----------------------------+
|[dev, engg, 10000]           |
|[karthik, engg, 20000]       |
+-----------------------------+
Because List<String> is not a proper JSON.
Do I need to create a proper JSON or is there any other way to do this?
You can create DataFrame from  List<String> and then use  selectExpr and split to get desired DataFrame.
public class SparkSample{
public static void main(String[] args) {
    SparkConf conf = new SparkConf().setAppName("SparkSample").setMaster("local[*]");
    JavaSparkContext jsc = new JavaSparkContext(conf);
    SQLContext sqc = new SQLContext(jsc);
    // sample data
    List<String> data = new ArrayList<String>();
    data.add("dev, engg, 10000");
    data.add("karthik, engg, 20000");
    // DataFrame
    DataFrame df = sqc.createDataset(data, Encoders.STRING()).toDF();
    df.printSchema();
    df.show();
    // Convert
    DataFrame df1 = df.selectExpr("split(value, ',')[0] as name", "split(value, ',')[1] as degree","split(value, ',')[2] as salary");
    df1.printSchema();
    df1.show(); 
   }
}
You will get below output.
root
 |-- value: string (nullable = true)
+--------------------+
|               value|
+--------------------+
|    dev, engg, 10000|
|karthik, engg, 20000|
+--------------------+
root
 |-- name: string (nullable = true)
 |-- degree: string (nullable = true)
 |-- salary: string (nullable = true)
+-------+------+------+
|   name|degree|salary|
+-------+------+------+
|    dev|  engg| 10000|
|karthik|  engg| 20000|
+-------+------+------+
The sample data you have provided has empty spaces. If you want to remove space and have the salary type as "integer" then you can use trim and cast function like below. 
df1 = df1.select(trim(col("name")).as("name"),trim(col("degree")).as("degree"),trim(col("salary")).cast("integer").as("salary")); 
                        DataFrame createNGramDataFrame(JavaRDD<String> lines) {
 JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
    private static final long serialVersionUID = -4332903997027358601L;
    @Override
    public Row call(String line) throws Exception {
        return RowFactory.create(line.split("\\s+"));
    }
 });
 StructType schema = new StructType(new StructField[] {
        new StructField("words",
                DataTypes.createArrayType(DataTypes.StringType), false,
                Metadata.empty()) });
 DataFrame wordDF = new SQLContext(jsc).createDataFrame(rows, schema);
 // build a bigram language model
 NGram transformer = new NGram().setInputCol("words")
        .setOutputCol("ngrams").setN(2);
 DataFrame ngramDF = transformer.transform(wordDF);
 ngramDF.show(10, false);
 return ngramDF;
}
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With