SOLR - Best approach to import 20 million documents from csv file

1 Answers

Unless a database is already part of your solution, I wouldn't add additional complexity to your solution. Quoting the SOLR FAQ it's your servlet container that is issuing the session time-out.

As I see it, you have a couple of options (In my order of preference):

Increase container timeout

Increase the container timeout. ("maxIdleTime" parameter, if you're using the embedded Jetty instance).

I'm assuming you only occasionally index such large files? Increasing the time-out temporarily might just be simplest option.

Split the file

Here's the simple unix script that will do the job (Splitting the file in 500,000 line chunks):

split -d -l 500000 data.csv split_files.
for file in `ls split_files.*`
do  
curl 'http://localhost:8983/solr/update/csv?fieldnames=id,name,category&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary @$file
done

Parse the file and load in chunks

The following groovy script uses opencsv and solrj to parse the CSV file and commit changes to Solr every 500,000 lines.

import au.com.bytecode.opencsv.CSVReader

import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument

@Grapes([
    @Grab(group='net.sf.opencsv', module='opencsv', version='2.3'),
    @Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
    @Grab(group='ch.qos.logback', module='logback-classic', version='1.0.0'),
])

SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");

new File("data.csv").withReader { reader ->
    CSVReader csv = new CSVReader(reader)
    String[] result
    Integer count = 1
    Integer chunkSize = 500000

    while (result = csv.readNext()) {
        SolrInputDocument doc = new SolrInputDocument();

        doc.addField("id",         result[0])
        doc.addField("name_s",     result[1])
        doc.addField("category_s", result[2])

        server.add(doc)

        if (count.mod(chunkSize) == 0) {
            server.commit()
        }
        count++
    }
    server.commit()
}

199

answered Oct 23 '22 23:10

Mark O'Connor

Related questions
                            
                                How to read a csv file with multiple header rows into pandas? [closed]
                            
                                when writing to csv file writerow fails with UnicodeEncodeError
                            
                                How to generate a CSV file in action?
                            
                                Export R data.frame to SPSS
                            
                                Convert a space delimited file to comma separated values file in python
                            
                                How to export data to csv file in Android?
                            
                                How do I read a csv stored in S3 with csv.DictReader?
                            
                                Pandas set_index doesn't drop the column
                            
                                CSV reader picks up garbage in the first few characters
                            
                                c#, Excel + csv: how to get the correct encoding?
                            
                                Basics Introduction To Using CHCSVParser
                            
                                How to write data from two lists into columns in a csv?
                            
                                write.csv() in dplyr chain
                            
                                Python Looping through CSV files and their columns
                            
                                StatefulBeanToCsv with Column headers
                            
                                How can I define a Raku grammar to parse TSV text?
                            
                                Converting ANSI to UTF-8 in shell
                            
                                "CSV file does not exist" for a filename with embedded quotes
                            
                                Load a small random sample from a large csv file into R data frame
                            
                                React: import csv file and parse

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SOLR - Best approach to import 20 million documents from csv file

Tags:

import

csv

solr

bulkinsert

dataimporthandler

Bobby ...

People also ask