Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a functional file "scanner"

First let me apologize for the scale of this problem but I'm really trying to think functionally and this is one of the more challenging problems I have had to work with.

I wanted to get some suggestions on how I might handle a problem I have in a functional manner, particularly in F#. I am writing a program to go through a list of directories and using a list of regex patterns to filter the list of files retrieved from the directories and using a second list of regex patterns to find matches in the text of the retreived files. I want this thing to return the filename, line index, column index, pattern and matched value for each piece of text that matches a given regex pattern. Also, exceptions need to be recorded and there are 3 possible exceptions scenarios: can't open the directory, can't open the file, reading content from the file failed. The final requirement of this is the the volume of files "scanned" for matches could be very large so this whole thing needs to be lazy. I'm not too worried about a "pure" functional solution as much as I'm interested in a "good" solution that reads well and performs well. One final challenge is to make it interop with C# because I would like to use the winform tools to attach this algorithm to a ui. Here is my first attempt and hopefully this will clarify the problem:

open System.Text.RegularExpressions
open System.IO

type Reader<'t, 'a> = 't -> 'a //=M['a], result varies

let returnM x _ = x 

let map f m = fun t -> t |> m |> f

let apply f m = fun t -> t |> m |> (t |> f)

let bind f m = fun t -> t |> (t |> m |> f)

let Scanner dirs =
    returnM dirs
    |> apply (fun dirExHandler ->
        Seq.collect (fun directory ->
            try
                Directory.GetFiles(directory, "*", SearchOption.AllDirectories)
            with | e ->
                dirExHandler e directory
                Array.empty))
    |> map (fun filenames ->
        returnM filenames
        |> apply (fun (filenamepatterns, lineExHandler, fileExHandler) ->
            Seq.filter (fun filename ->
                 filenamepatterns |> Seq.exists (fun pattern ->
                    let regex = new Regex(pattern)
                    regex.IsMatch(filename)))
            >> Seq.map (fun filename ->
                    let fileinfo = new FileInfo(filename)
                    try
                        use reader = fileinfo.OpenText()
                        Seq.unfold (fun ((reader : StreamReader), index) ->
                            if not reader.EndOfStream then
                                try
                                    let line = reader.ReadLine()
                                    Some((line, index), (reader, index + 1))
                                with | e -> 
                                    lineExHandler e filename index
                                    None
                            else
                                None) (reader, 0)        
                        |> (fun lines -> (filename, lines))
                    with | e -> 
                        fileExHandler e filename
                        (filename, Seq.empty))
            >> (fun files -> 
                returnM files
                |> apply (fun contentpatterns ->
                    Seq.collect (fun file ->
                        let filename, lines = file
                        lines |>
                            Seq.collect (fun line ->
                                let content, index = line
                                contentpatterns
                                |> Seq.collect (fun pattern ->    
                                    let regex = new Regex(pattern)
                                    regex.Matches(content)
                                    |> (Seq.cast<Match>
                                    >> Seq.map (fun contentmatch -> 
                                        (filename, 
                                            index, 
                                            contentmatch.Index, 
                                            pattern, 
                                            contentmatch.Value))))))))))

Thanks for any input.

Updated -- here is any updated solution based on feedback I received:

open System.Text.RegularExpressions
open System.IO

type ScannerConfiguration = {
    FileNamePatterns : seq<string>
    ContentPatterns : seq<string>
    FileExceptionHandler : exn -> string -> unit
    LineExceptionHandler : exn -> string -> int -> unit
    DirectoryExceptionHandler : exn -> string -> unit }

let scanner specifiedDirectories (configuration : ScannerConfiguration) = seq {
    let ToCachedRegexList = Seq.map (fun pattern -> new Regex(pattern)) >> Seq.cache

    let contentRegexes = configuration.ContentPatterns |> ToCachedRegexList

    let filenameRegexes = configuration.FileNamePatterns |> ToCachedRegexList

    let getLines exHandler reader = 
        Seq.unfold (fun ((reader : StreamReader), index) ->
            if not reader.EndOfStream then
                try
                    let line = reader.ReadLine()
                    Some((line, index), (reader, index + 1))
                with | e -> exHandler e index; None
            else
                None) (reader, 0)   

    for specifiedDirectory in specifiedDirectories do
        let files =
            try Directory.GetFiles(specifiedDirectory, "*", SearchOption.AllDirectories)
            with e -> configuration.DirectoryExceptionHandler e specifiedDirectory; [||]
        for file in files do
            if filenameRegexes |> Seq.exists (fun (regex : Regex) -> regex.IsMatch(file)) then
                let lines = 
                    let fileinfo = new FileInfo(file)
                    try
                        use reader = fileinfo.OpenText()
                        reader |> getLines (fun e index -> configuration.LineExceptionHandler e file index)
                    with | e -> configuration.FileExceptionHandler e file; Seq.empty
                for line in lines do
                    let content, index = line
                    for contentregex in contentRegexes do
                        for mmatch in content |> contentregex.Matches do
                            yield (file, index, mmatch.Index, contentregex.ToString(), mmatch.Value) }

Again, any input is welcome.

like image 958
Brad Avatar asked Jan 10 '12 15:01

Brad


1 Answers

I think that the best approach is to start with the simplest solution and then extend it. Your current approach seems to be quite hard to read to me for two reasons:

  • The code uses a lot of combinators and function compositions in patterns that are not too common in F#. Some of the processing can be more easily written using sequence expressions.

  • The code is all written as a single function, but it is fairly complex and would be more readable if it was separated into multiple functions.

I would probably start by splitting the code in a function that tests a single file (say fileMatches) and a function that walks over the files and calls fileMatches. The main iteration can be quite nicely written using F# sequence expressions:

// Checks whether a file name matches a filename pattern 
// and a content matches a content pattern.
let fileMatches fileNamePatterns contentPatterns 
                (fileExHandler, lineExHandler) file =
  // TODO: This can be imlemented using
  // File.ReadLines which returns a sequence.


// Iterates over all the files and calls 'fileMatches'.
let scanner specifiedDirectories fileNamePatterns contentPatterns
            (dirExHandler, fileExHandler, lineExHandler) = seq {
  // Iterate over all the specified directories.
  for specifiedDir in specifiedDirectories do
    // Find all files in the directories (and handle exceptions).
    let files =
      try Directory.GetFiles(specifiedDir, "*", SearchOption.AllDirectories)
      with e -> dirExHandler e specifiedDir; [||]
    // Iterate over all files and report those that match.
    for file in files do
      if fileMatches fileNamePatterns contentPatterns 
                     (fileExHandler, lineExHandler) file then 
        // Matches! Return this file as part of the result.
        yield file }

The function is still quite complicated, because you need to pass a lot of parameters around. Wrapping the parameters in a simple type or a record could be a good idea:

type ScannerArguments = 
  { FileNamePatterns:string 
    ContentPatterns:string
    FileExceptionHandler:exn -> string -> unit
    LineExceptionHandler:exn -> string -> unit
    DirectoryExceptionHandler:exn -> string -> unit }

Then you can define both fileMatches and scanner as functions that take just two parameters, which will make your code a lot more readable. Something like:

// Iterates over all the files and calls 'fileMatches'.
let scanner specifiedDirectories (args:ScannerArguments) = seq {
  for specifiedDir in specifiedDirectories do
    let files =
      try Directory.GetFiles(specifiedDir, "*", SearchOption.AllDirectories)
      with e -> args.DirectoryExceptionHandler e specifiedDir; [||]
    for file in files do
      // No need to propagate all arguments explicitly to other functions.
      if fileMatches args file then yield file }
like image 185
Tomas Petricek Avatar answered Oct 30 '22 17:10

Tomas Petricek