Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast Text Search Over Logs

Here's the problem I'm having, I've got a set of logs that can grow fairly quickly. They're split into individual files every day, and the files can easily grow up to a gig in size. To help keep the size down, entries older than 30 days or so are cleared out.

The problem is when I want to search these files for a certain string. Right now, a Boyer-Moore search is unfeasibly slow. I know that applications like dtSearch can provide a really fast search using indexing, but I'm not really sure how to implement that without taking up twice the space a log already takes up.

Are there any resources I can check out that can help? I'm really looking for a standard algorithm that'll explain what I should do to build an index and use it to search.

Edit:
Grep won't work as this search needs to be integrated into a cross-platform application. There's no way I'll be able to swing including any external program into it.

The way it works is that there's a web front end that has a log browser. This talks to a custom C++ web server backend. This server needs to search the logs in a reasonable amount of time. Currently searching through several gigs of logs takes ages.

Edit 2: Some of these suggestions are great, but I have to reiterate that I can't integrate another application, it's part of the contract. But to answer some questions, the data in the logs varies from either received messages in a health-care specific format or messages relating to these. I'm looking to rely on an index because while it may take up to a minute to rebuild the index, searching currently takes a very long time (I've seen it take up to 2.5 minutes). Also, a lot of the data IS discarded before even recording it. Unless some debug logging options are turned on, more than half of the log messages are ignored.

The search basically goes like this: A user on the web form is presented with a list of the most recent messages (streamed from disk as they scroll, yay for ajax), usually, they'll want to search for messages with some information in it, maybe a patient id, or some string they've sent, and so they can enter the string into the search. The search gets sent asychronously and the custom web server linearly searches through the logs 1MB at a time for some results. This process can take a very long time when the logs get big. And it's what I'm trying to optimize.

like image 816
ReaperUnreal Avatar asked Oct 02 '08 18:10

ReaperUnreal


3 Answers

grep usually works pretty well for me with big logs (sometimes 12G+). You can find a version for windows here as well.

like image 178
changelog Avatar answered Nov 19 '22 18:11

changelog


Check out the algorithms that Lucene uses to do its thing. They aren't likely to be very simple, though. I had to study some of these algorithms once upon a time, and some of them are very sophisticated.

If you can identify the "words" in the text you want to index, just build a large hash table of the words which maps a hash of the word to its occurrences in each file. If users repeat the same search frequently, cache the search results. When a search is done, you can then check each location to confirm the search term falls there, rather than just a word with a matching hash.

Also, who really cares if the index is larger than the files themselves? If your system is really this big, with so much activity, is a few dozen gigs for an index the end of the world?

like image 20
PeterAllenWebb Avatar answered Nov 19 '22 17:11

PeterAllenWebb


You'll most likely want to integrate some type of indexing search engine into your application. There are dozens out there, Lucene seems to be very popular. Check these two questions for some more suggestions:

Best text search engine for integrating with custom web app?

How do I implement Search Functionality in a website?

like image 2
davr Avatar answered Nov 19 '22 18:11

davr