I want to search a list of strings (having from 2k upto 10k strings in the list) in thousands of text files (there may be as many as 100k text files each having size ranging from 1 KB to 100 MB) saved in a folder and output a csv file for the matched text filenames.
I have developed a code that does the required job but it takes around 8-9 hours for 2000 strings to search in around 2000 text files having size of ~2.5 GB in total.
Also, by using this method, system's memory is consumed and so sometimes need to split the 2000 text files into smaller batches for the code to run.
The code is as below(Python 2.7).
# -*- coding: utf-8 -*-
import pandas as pd
import os
def match(searchterm):
global result
filenameText = ''
matchrateText = ''
for i, content in enumerate(TextContent):
matchrate = search(searchterm, content)
if matchrate:
filenameText += str(listoftxtfiles[i])+";"
matchrateText += str(matchrate) + ";"
result.append([searchterm, filenameText, matchrateText])
def search(searchterm, content):
if searchterm.lower() in content.lower():
return 100
else:
return 0
listoftxtfiles = os.listdir("Txt/")
TextContent = []
for txt in listoftxtfiles:
with open("Txt/"+txt, 'r') as txtfile:
TextContent.append(txtfile.read())
result = []
for i, searchterm in enumerate(searchlist):
print("Checking for " + str(i + 1) + " of " + str(len(searchlist)))
match(searchterm)
df=pd.DataFrame(result,columns=["String","Filename", "Hit%"])
Sample Input below.
List of strings -
["Blue Chip", "JP Morgan Global Healthcare","Maximum Horizon","1838 Large Cornerstone"]
Text file -
Usual text file containing different lines separated by \n
Sample Output below.
String,Filename,Hit%
JP Morgan Global Healthcare,000032.txt;000031.txt;000029.txt;000015.txt;,100;100;100;100;
Blue Chip,000116.txt;000126.txt;000114.txt;,100;100;100;
1838 Large Cornerstone,NA,NA
Maximum Horizon,000116.txt;000126.txt;000114.txt;,100;100;100;
As in the example above, first string was matched in 4 files(seperated by ;), second string was matched in 3 files and third string was not matched in any of the files.
Is there a quicker way to search without any splitting of text files?
You need to use the grep command. The grep command or egrep command searches the given input FILEs for lines containing a match or a text string.
findstr is a command that will find a string in a file on Windows when given a specific pattern. In this tutorial, you'll learn to find and extract information from text files and general text snippets.
Your code does a lot of pushing large amounts of data around in memory because you load all files in memory and then search them.
Performance aside, your code could use some cleaning up. Try to write functions as autonomous as possible, without depending on global variables (for input or output).
I rewrote your code using list comprehensions and it became a lot more compact.
# -*- coding: utf-8 -*-
from os import listdir
from os.path import isfile
def search_strings_in_files(path_str, search_list):
""" Returns a list of lists, where each inner list contans three fields:
the filename (without path), a string in search_list and the
frequency (number of occurences) of that string in that file"""
filelist = listdir(path_str)
return [[filename, s, open(path_str+filename, 'r').read().lower().count(s)]
for filename in filelist
if isfile(path_str+filename)
for s in [sl.lower() for sl in search_list] ]
if __name__ == '__main__':
print search_strings_in_files('/some/path/', ['some', 'strings', 'here'])
Mechanism's that I use in this code:
Tip for reading the list comprehension: try reading it form bottom to top, so:
for s in...
)if isfile...
)for filename...
)This code uses all the power there is in "standard" Python functions. If you need more performance, you should look into specialised libraries for this task.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With