Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark LuceneRDD - how does it work

Could you please help me to figure out what happens while initializing Spark RDD?

There is an official example here:

val capitals = spark.read.parquet("capitals.parquet").select("name", "country")
val luceneRDD = LuceneRDD(capitals)
val result = luceneRDD.termQuery("name", "ottawa", 10)

But I'm not familiar with Scala and have troubles with reading source-code. Could you pls answer next questions:

  1. How does spark-lucenerdd index capitals.parquet? How can I index each row of each column (all values)?
  2. Can I set number of partitions for luceneRDD?
like image 416
VB_ Avatar asked Dec 06 '25 10:12

VB_


1 Answers

(disclaimer: I am the author of LuceneRDD)

Take a look at the slides that I have prepared:

https://www.slideshare.net/zouzias/lucenerdd-for-geospatial-search-and-entity-linkage

In a nutshell, LuceneRDD instantiates an inverted index on each Spark executor and collects / aggregates search results from Spark executors to the Spark driver. The main motivation behind LuceneRDD is to natively extend Spark's capabilities with full-text search, geospatial search and entity linkage without requiring an external dependency of a SolrCloud or Elasticsearch cluster.

To answer your questions:

  1. All columns of your DataFrame are indexed by default.
  2. You can set the number of partitions by just repartitioning your input DataFrame, i.e.,
LuceneRDD(capitals.repartition(numPartitions=10))
like image 90
Zouzias Avatar answered Dec 07 '25 23:12

Zouzias



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!