Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load a file from SFTP server into spark RDD

How can I load a file from SFTP server into spark RDD. After loading this file I need to perform some filtering on the data. Also the file is csv file so could you please help me decide if I should use Dataframes or RDDs.

like image 465
vindev Avatar asked Apr 14 '17 06:04

vindev


People also ask

How do I convert a CSV file to a DataFrame in spark?

In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.


1 Answers

You can use spark-sftp library in your program in following ways:

For Spark 2.x

Maven Dependency

<dependency>
    <groupId>com.springml</groupId>
    <artifactId>spark-sftp_2.11</artifactId>
    <version>1.1.0</version>
</dependency>

SBT Dependency

libraryDependencies += "com.springml" % "spark-sftp_2.11" % "1.1.0"

Using with Spark shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

$ bin/spark-shell --packages com.springml:spark-sftp_2.11:1.1.0

Scala API

// Construct Spark dataframe using file in FTP server
val df = spark.read.
            format("com.springml.spark.sftp").
            option("host", "SFTP_HOST").
            option("username", "SFTP_USER").
            option("password", "****").
            option("fileType", "csv").
            option("inferSchema", "true").
            load("/ftp/files/sample.csv")

// Write dataframe as CSV file to FTP server
df.write.
      format("com.springml.spark.sftp").
      option("host", "SFTP_HOST").
      option("username", "SFTP_USER").
      option("password", "****").
      option("fileType", "csv").
      save("/ftp/files/sample.csv")

For Spark 1.x (1.5+)

Maven Dependency

<dependency>
    <groupId>com.springml</groupId>
    <artifactId>spark-sftp_2.10</artifactId>
    <version>1.0.2</version>
</dependency>

SBT Dependency

libraryDependencies += "com.springml" % "spark-sftp_2.10" % "1.0.2"

Using with Spark shell

This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell:

$ bin/spark-shell --packages com.springml:spark-sftp_2.10:1.0.2

Scala API

import org.apache.spark.sql.SQLContext

// Construct Spark dataframe using file in FTP server
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.
                    format("com.springml.spark.sftp").
                    option("host", "SFTP_HOST").
                    option("username", "SFTP_USER").
                    option("password", "****").
                    option("fileType", "csv").
                    option("inferSchema", "true").
                    load("/ftp/files/sample.csv")

// Write dataframe as CSV file to FTP server
df.write().
      format("com.springml.spark.sftp").
      option("host", "SFTP_HOST").
      option("username", "SFTP_USER").
      option("password", "****").
      option("fileType", "csv").
      save("/ftp/files/sample.csv")

For more information on spark-sftp you can visit there github page springml/spark-sftp

like image 79
himanshuIIITian Avatar answered Sep 26 '22 11:09

himanshuIIITian