Hi I am trying to read specific lines from a text file using spark.
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("data.txt");
String firstLine = lines.first();
It can used the .first() command to fetch the first line of the data.text document. How can I access Nth line of the document? I need java solution.
Apache Spark RDDs are not meant to be used for lookups. The most "efficient" way to get the n
th line would be lines.take(n + 1).get(n)
. Every time you do this, it will read the first n
lines of the file. You could run lines.cache
to avoid that, but it will still move the first n
lines over the network in a very inefficient dance.
If the data can fit on one machine, just collect it all once, and access it locally: List<String> local = lines.collect(); local.get(n);
.
If the data does not fit on one machine, you need a distributed system which supports efficient lookups. Popular examples are HBase and Cassandra.
It is also possible that your problem can be solved efficiently with Spark, but not via lookups. If you explain the larger problem in a separate question, you may get a solution like that. (Lookups are very common in single-machine applications, but distributed algorithms have to think differently.)
I think this as fast as it gets
def getNthLine(n: Long) =
lines.zipWithIndex().filter(_._2 == n).first
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With