I'm occasionally working with text files in which some sections do have multiple paragraphs with the same structure. Here's an example:
Some unrelated preface I'm not interested in... Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Etiam scelerisque.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Etiam scelerisque. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam scelerisque.
001 [SomeTitle 1] - Some Subtitle 1
Name: SomeName
Area: SomeArea
Content: Some multi-line comment...Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Etiam scelerisque. Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Etiam scelerisque.
002 [SomeTitle 2] - Some Subtitle 2
Name: SomeOtherName
Area: SomeOtherArea
Content: Some other multi-line comment...Lorem ipsum dolor sit amet, consectetur
adipiscing elit.
I'm looking for an easy way to query files like this. For example, if I query it for "Area:SomeOtherArea", the result should be all blocks of the file with that area. I mean all four paragraphs: Header, Name, Area, Content. I could use grep with the -A and -B options, but the problem is that the content paragraphs may consists of any number of lines. And this is just this specific example; the structure could be completely different.
I'm looking for a light-weight, easily adaptable solution, maybe a combination of CLI tools. I don't want to reinvent the wheel.
Sorry to say, but there's only so far you can go with this sort of problem, as you seem to want a swiss army knife with an infinitely expandable set of features, but without any pain on your part for programming:-) ! Such a thing is moderately possible, but given your wide open specification, recall that people spend years building out search engines like Lucene, Google and thousand others to solve this sort of problem.
That said, if you can be happy with a search tool that has a very simple rule that must be obeyed, AND you're using or have access to a Unix/Linux/Cygwin system, the following can work.
Basic rule: Blocks of data will be searched based on a blank like separating each block (as in you sample data above).
cat paraSearch.ksh
#!/bin/ksh
# (or #!/bin/bash or likely others)
case $# in 0 ) echo "usage:paraSearch.ksh SearchTargetPattern file2search [file2 ....]" ; exit 1 ;;esac
# read the first pattern as the search target,
# use quotes on cmd-line if you want to use
# regexp chars like '*'
mySrchPat="$1" ; shift
#dbg set -vx
awk -v mySrchPattern="$mySrchPat" \
'BEGIN{RS=""; ORS="\n\n"}
#dbg {print "$0="$0; print "----------------------------------------------" }
$0 ~ mySrchPattern{ print $0}
' "${@}"
chmod 755 paraSearch.ksh
test using your sample text and searchTarget AND the output
$ ./paraSearch.ksh SomeName multiLineTest.txt
001 [SomeTitle 1] - Some Subtitle 1
Name: SomeName
Area: SomeArea
Content: Some multi-line comment...Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Etiam scelerisque. Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Etiam scelerisque.
To learn more about awk, read through (several times) this excellent tutorial: The Grymoire's Awk Tutorial.
IHTH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With