Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

check whether a string is in a 2-GB list of strings in python

Tags:

python

I have a large file (A.txt) of 2 GB containing a list of strings ['Question','Q1','Q2','Q3','Ans1','Format','links',...].

Now I have another larger file(1TB) containing the above mentioned strings in 2nd position:

Output:

a, Question, b
The, quiz, is
This, Q1, Answer
Here, Ans1, is
King1, links, King2
programming,language,drupal,
.....

I want to retain the lines whose second position contain the strings in the list stored in file A.txt. That is, I want to retain (store in another file) the below mentioned lines:

a, Question, b
This, Q1, Answer
Here, Ans1, is
King1, links, King2

I know how to do this when the length of the list in file (A.txt) is 100..using 'any'. But I am not getting how I should go about it when the length of the list in file (A.txt) is 2 GB.

like image 758
Rose Beck Avatar asked May 29 '13 20:05

Rose Beck


1 Answers

Don't use a list; use a set instead.

Read the first file into a set:

with open('A.txt') as file_a:
    words = {line.strip() for line in file_a}

0.5 GB of words isn't that much to store in a set.

Now you can test against words in O(1) constant time:

if second_word in words:
    # ....

Open the second file and process it line by line, perhaps using the csv module if the lines words are comma-separated.

For a larger set of words, use a database instead; Python comes with the sqlite3 library:

import sqlite3

conn = sqlite3.connect(':memory:')
conn.execute('CREATE TABLE words (word UNIQUE)')

with open('A.txt') as file_a, conn:
    cursor = conn.cursor()
    for line in file_a:
        cursor.execute('INSERT OR IGNORE INTO words VALUES (?)', (line.strip(),))

then test against that:

cursor = conn.cursor()
for line in second_file:
    second_word = hand_waving
    cursor.execute('SELECT 1 from words where word=?', (second_word,))
    if cursor.fetchone():
         # ....

Even though I use a :memory: database here, SQLite is smart enough to store data in temporary files when you start filling up memory. The :memory: connection is basically just a temporary, one-off database. You can also use a real filepath if you want to re-use the words database.

like image 56
Martijn Pieters Avatar answered Nov 14 '22 23:11

Martijn Pieters