I have a requirement where I have to index HDFS files (includes TXT, PDF, DOCX, other rich documents) into Solr.
Currently, I am using DirectoryIngestMapper of the LucidWorks connector to achieve the same.
https://github.com/lucidworks/hadoop-solr
But I cannot work with this because it has certain limitations (the main one being that you cannot specify the filetypes to be considered).
So now I am looking into the possibility of using MapReduceIndexerTool. But it doesn't have many beginner (I mean absolute basic!) level examples.
Could someone post some links with examples for starting with the MapReduceIndexerTool? Is there some other better or easier way to index files in HDFS?
On Cloudera I think that you have these options:
About MapReduceIndexerTool here a quick guide:
This guide show you how to index/upload a .csv file to SolR using MapReduceIndexerTool.
This procedure will read the csv from HDFS and write directly the index inside HDFS.
See also https://www.cloudera.com/documentation/enterprise/latest/topics/search_mapreduceindexertool.html .
Assuming that you have:
THIS_IS_YOUR_CLOUDERA_HOST, if using Docker Quickstart it should be quickstart.cloudera)THIS_IS_YOUR_INPUT_CSV_FILE, like /your-hdfs-dir/your-csv.csv)THIS_IS_YOUR_DESTINATION_COLLECTION)
instanceDir (see THIS_IS_YOUR_CORE_INSTANCEDIR) and should be an HDFS pathFor this example we will process a TAB separated file with uid, firstName and lastName columns. The first row contains the headers. The Morphlines configuration files will skip the first line, so the actual column name doesn't matter, columns are expected just in this order.
On SolR we should configure the fields with something similar:
<field name="_version_" type="long" indexed="true" stored="true" />
<field name="uid" type="string" indexed="true" stored="true" required="true" />
<field name="firstName" type="text_general" indexed="true" stored="true" />
<field name="lastName" type="text_general" indexed="true" stored="true" />
<field name="text" type="text_general" indexed="true" multiValued="true" />
Then you should create a Morphlines configuration file (csv-to-solr-morphline.conf) with the following code:
# Specify server locations in a SOLR_LOCATOR variable; used later in
# variable substitutions:
SOLR_LOCATOR : {
# Name of solr collection
collection : THIS_IS_YOUR_DESTINATION_COLLECTION
# ZooKeeper ensemble
zkHost : "THIS_IS_YOUR_CLOUDERA_HOST:2181/solr"
}
# Specify an array of one or more morphlines, each of which defines an ETL
# transformation chain. A morphline consists of one or more potentially
# nested commands. A morphline is a way to consume records such as Flume events,
# HDFS files or blocks, turn them into a stream of records, and pipe the stream
# of records through a set of easily configurable transformations on the way to
# a target application such as Solr.
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readCSV {
separator : "\t"
# This columns should map the one configured in SolR and are expected in this position inside CSV
columns : [uid,lastName,firstName]
ignoreFirstLine : true
quoteChar : ""
commentPrefix : ""
trim : true
charset : UTF-8
}
}
# Consume the output record of the previous command and pipe another
# record downstream.
#
# This command deletes record fields that are unknown to Solr
# schema.xml.
#
# Recall that Solr throws an exception on any attempt to load a document
# that contains a field that is not specified in schema.xml.
{
sanitizeUnknownSolrFields {
# Location from which to fetch Solr schema
solrLocator : ${SOLR_LOCATOR}
}
}
# log the record at DEBUG level to SLF4J
{ logDebug { format : "output record: {}", args : ["@{}"] } }
# load the record into a Solr server or MapReduce Reducer
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
To import run the following command inside a cluster:
hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
--output-dir hdfs://quickstart.cloudera/THIS_IS_YOUR_CORE_INSTANCEDIR/ \
--morphline-file ./csv-to-solr-morphline.conf \
--zk-host quickstart.cloudera:2181/solr \
--solr-home-dir /THIS_IS_YOUR_CORE_INSTANCEDIR \
--collection THIS_IS_YOUR_DESTINATION_COLLECTION \
--go-live \
hdfs://THIS_IS_YOUR_CLOUDERA_HOST/THIS_IS_YOUR_INPUT_CSV_FILE
Some considerations:
sudo -u hdfs to run the above command because you should not have permissiong to write in the HDFS output directory.yarn.app.mapreduce.am.command-opts, mapreduce.map.java.opts, mapreduce.map.memory.mb and mapreduce.map.memory.mb inside /etc/hadoop/conf/map-red-sites.xmlOther resources:
But I cannot work with this because it has certain limitations (the main one being that you cannot specify the filetypes to be considered).
With the https://github.com/lucidworks/hadoop-solr the input is a path.
So, you can specify by file name.
-i /path/*.pdf
Edit:
you can add the add.subdirectories argument. But the *.pdf is not set recursively gitsource
-Dadd.subdirectories=true
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With