After reading several documentation pages of Apache Flink (official documentation, dataartisans) as well as the examples provided in the official repository, I keep seeing examples where they use as the data source for streamming a file already downloaded, connecting always to the localhost.
I am trying to use Apache Flink to download JSON files which contain dynamic data. My intention is to try to stablish the url where I can access the JSON file as the input source of Apache Flink, instead of downloading it with another system and processing the downloaded file with Apache Flink.
Is it possible to stablish this net connection with Apache Flink?
You can define the URLs you want to download as your input DataStream
and then download the documents from within a MapFunction
. The following code demonstrates this:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> inputURLs = env.fromElements("http://www.json.org/index.html");
inputURLs.map(new MapFunction<String, String>() {
@Override
public String map(String s) throws Exception {
URL url = new URL(s);
InputStream is = url.openStream();
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(is));
StringBuilder builder = new StringBuilder();
String line;
try {
while ((line = bufferedReader.readLine()) != null) {
builder.append(line + "\n");
}
} catch (IOException ioe) {
ioe.printStackTrace();
}
try {
bufferedReader.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
return builder.toString();
}
}).print();
env.execute("URL download job");
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With