Spark LuceneRDD - how does it work

Question

Could you please help me to figure out what happens while initializing Spark RDD?

There is an official example here:

val capitals = spark.read.parquet("capitals.parquet").select("name", "country")
val luceneRDD = LuceneRDD(capitals)
val result = luceneRDD.termQuery("name", "ottawa", 10)

But I'm not familiar with Scala and have troubles with reading source-code. Could you pls answer next questions:

How does spark-lucenerdd index capitals.parquet? How can I index each row of each column (all values)?
Can I set number of partitions for luceneRDD?

Zouzias · Accepted Answer

(disclaimer: I am the author of LuceneRDD)

Take a look at the slides that I have prepared:

https://www.slideshare.net/zouzias/lucenerdd-for-geospatial-search-and-entity-linkage

In a nutshell, LuceneRDD instantiates an inverted index on each Spark executor and collects / aggregates search results from Spark executors to the Spark driver. The main motivation behind LuceneRDD is to natively extend Spark's capabilities with full-text search, geospatial search and entity linkage without requiring an external dependency of a SolrCloud or Elasticsearch cluster.

To answer your questions:

All columns of your DataFrame are indexed by default.
You can set the number of partitions by just repartitioning your input DataFrame, i.e.,

LuceneRDD(capitals.repartition(numPartitions=10))

Spark LuceneRDD - how does it work

Tags:

java

scala

lucene

apache-spark

apache-spark-2.0

VB_

1 Answers

Zouzias

Recent Activity

Donate For Us

Spark LuceneRDD - how does it work

Tags:

java

scala

lucene

apache-spark

apache-spark-2.0

VB_

1 Answers

Zouzias

Related questions

Recent Activity

Donate For Us