Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lookup speed: State or Database?

I have a bunch of wordlists on a server of mine, and I've been planning to make a simple open-source JSON API that returns if a password is on the list1, as a method of validation. I'm doing this in Python with Flask, and literally just returning if input is present.

One small problem: the wordlists total about 150 million entries, and 1.1GB of text.

My API (minimal) is below. Is it more efficient to store every row in MongoDB and look up repeatedly, or to store the entire thing in memory using a singleton, and populate it on startup when I call app.run? Or are the differences subjective?
Furthermore, is it even good practice to do the latter? I'm thinking the lookups might start to become taxing if I open this to the public. I've also had someone suggest a Trie for efficient searching.

Update: I've done a bit of testing, and document searching is painfully slow with such a high number of records. Is it justifiable to use a database with proper indexes for a single column of data that needs to be efficiently searched?

from flask import Flask
from flask.views import MethodView
from flask.ext.pymongo import PyMongo
import json

app = Flask(__name__)
mongo = PyMongo(app)

class HashCheck(MethodView):

  def post(self):
    return json.dumps({'result' : 
      not mongo.db.passwords.find({'pass' : request.form["password"])})
    # Error-handling + test cases to come. Negate is for bool.

  def get(self):
    return redirect('/')

if __name__ == "__main__":
  app.add_url_rule('/api/', view_func=HashCheck.as_view('api'))
  app.run(host="0.0.0.0", debug=True)

1: I'm a security nut. I'm using it in my login forms and rejecting common input. One of the wordlists is UNIQPASS.

like image 467
Amelia Avatar asked Dec 26 '22 04:12

Amelia


2 Answers

What I would suggest is a hybrid approach. As requests are made do two checks. The first in a local cache and the the second in the MongoDB store. If the first fails but the second succeeds then add it to the in memory cache. Over time the application will "fault" in the most common "bad passwords"/records.

This has two advantages:
1) The common words are rejected very fast from within memory.
2) The startup cost is close to zero and amortized over many queries.

When storing the word list in MongoDB I would make the _id field hold each word. By default you will get a ObjectId which is a complete waste in this case. We can also then leverage the automatic index on _id. I suspect the poor performance you saw was due to there not being an index on the 'pass' field. You can also try adding one on the 'pass' field with:

mongo.db.passwords.create_index("pass")

To complete the _id scenario: to insert a word:

mongo.db.passwords.insert( { "_id" : "password" } );

Queries then look like:

mongo.db.passwords.find( { "_id" : request.form["password"] } )

As @Madarco mentioned you can also shave another bit off the query time by ensuring results are returned from the index by limiting the returned fields to just the _id field ({ "_id" : 1}).

mongo.db.passwords.find( { "_id" : request.form["password"] }, { "_id" : 1} )

HTH - Rob

P.S. I am not a Python/Pymongo expert so might not have the syntax 100% correct. Hopefully it is still helpful.

like image 171
Rob Moore Avatar answered Jan 05 '23 16:01

Rob Moore


Given that your list is totally static and fits in memory, I don't see a compelling reason to use a database.

I agree that a Trie would be efficient for your goal. A hash table would work too.

PS: it's too bad about Python's Global Interpreter Lock. If you used a language with real multithreading, you could take advantage of the unchanging data structure and run the server across multiple cores with shared memory.

like image 29
japreiss Avatar answered Jan 05 '23 17:01

japreiss