Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make searching a string in text files quicker

Tags:

python

pandas

I want to search a list of strings (having from 2k upto 10k strings in the list) in thousands of text files (there may be as many as 100k text files each having size ranging from 1 KB to 100 MB) saved in a folder and output a csv file for the matched text filenames.

I have developed a code that does the required job but it takes around 8-9 hours for 2000 strings to search in around 2000 text files having size of ~2.5 GB in total.

Also, by using this method, system's memory is consumed and so sometimes need to split the 2000 text files into smaller batches for the code to run.

The code is as below(Python 2.7).

# -*- coding: utf-8 -*-
import pandas as pd
import os

def match(searchterm):
    global result
    filenameText = ''
    matchrateText = ''
    for i, content in enumerate(TextContent):
        matchrate = search(searchterm, content)
        if matchrate:
            filenameText += str(listoftxtfiles[i])+";"
            matchrateText += str(matchrate) + ";"
    result.append([searchterm, filenameText, matchrateText])


def search(searchterm, content):
    if searchterm.lower() in content.lower():
        return 100
    else:
        return 0


listoftxtfiles = os.listdir("Txt/")
TextContent = []
for txt in listoftxtfiles:
    with open("Txt/"+txt, 'r') as txtfile:
        TextContent.append(txtfile.read())

result = []
for i, searchterm in enumerate(searchlist):
    print("Checking for " + str(i + 1) + " of " + str(len(searchlist)))
    match(searchterm)

df=pd.DataFrame(result,columns=["String","Filename", "Hit%"])

Sample Input below.

List of strings -

["Blue Chip", "JP Morgan Global Healthcare","Maximum Horizon","1838 Large Cornerstone"]

Text file -

Usual text file containing different lines separated by \n

Sample Output below.

String,Filename,Hit%
JP Morgan Global Healthcare,000032.txt;000031.txt;000029.txt;000015.txt;,100;100;100;100;
Blue Chip,000116.txt;000126.txt;000114.txt;,100;100;100;
1838 Large Cornerstone,NA,NA
Maximum Horizon,000116.txt;000126.txt;000114.txt;,100;100;100;

As in the example above, first string was matched in 4 files(seperated by ;), second string was matched in 3 files and third string was not matched in any of the files.

Is there a quicker way to search without any splitting of text files?

like image 974
fdabhi Avatar asked Jul 15 '17 19:07

fdabhi


People also ask

How can you quickly search a bunch of files for a specific string of text?

You need to use the grep command. The grep command or egrep command searches the given input FILEs for lines containing a match or a text string.

What commands can you use to search for text within files?

findstr is a command that will find a string in a file on Windows when given a specific pattern. In this tutorial, you'll learn to find and extract information from text files and general text snippets.


1 Answers

Your code does a lot of pushing large amounts of data around in memory because you load all files in memory and then search them.

Performance aside, your code could use some cleaning up. Try to write functions as autonomous as possible, without depending on global variables (for input or output).

I rewrote your code using list comprehensions and it became a lot more compact.

# -*- coding: utf-8 -*-
from os import listdir
from os.path import isfile

def search_strings_in_files(path_str, search_list):
    """ Returns a list of lists, where each inner list contans three fields:
    the filename (without path), a string in search_list and the
    frequency (number of occurences) of that string in that file"""

    filelist = listdir(path_str)

    return [[filename, s, open(path_str+filename, 'r').read().lower().count(s)]
        for filename in filelist
            if isfile(path_str+filename)
                for s in [sl.lower() for sl in search_list] ]

if __name__ == '__main__':
    print search_strings_in_files('/some/path/', ['some', 'strings', 'here'])

Mechanism's that I use in this code:

  • list comprehension to loop thought search_lists and though the files.
  • compound statements to loop only through the files in a directory (and not through sub directories).
  • method chaining to directly call a method of an object that is returned.

Tip for reading the list comprehension: try reading it form bottom to top, so:

  • I convert all items in search_list to lower using list comprehension.
  • Then I loop over that list (for s in...)
  • Then I filter out the directory entries that are not files using a compound statement (if isfile...)
  • Then I loop over all files (for filename...)
  • In the top line, I create the sublist containing three items:
    • filename
    • s, that is the lower case search string
    • a method chained call to open the file, read all its contents, convert it to lowercase and count the number of occurrences of s.

This code uses all the power there is in "standard" Python functions. If you need more performance, you should look into specialised libraries for this task.

like image 67
agtoever Avatar answered Oct 03 '22 12:10

agtoever