Simple way to parse and query multi-line semi-structured content

Question

I'm occasionally working with text files in which some sections do have multiple paragraphs with the same structure. Here's an example:

Some unrelated preface I'm not interested in... Lorem ipsum dolor sit amet, 
consectetur adipiscing elit. Etiam scelerisque. 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Etiam scelerisque. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam scelerisque. 

001 [SomeTitle 1] - Some Subtitle 1
  Name: SomeName
  Area: SomeArea
  Content: Some multi-line comment...Lorem ipsum dolor sit amet, consectetur 
           adipiscing elit. Etiam scelerisque. Lorem ipsum dolor sit amet, 
           consectetur adipiscing elit. Etiam scelerisque. 

002 [SomeTitle 2] - Some Subtitle 2
  Name: SomeOtherName
  Area: SomeOtherArea
  Content: Some other multi-line comment...Lorem ipsum dolor sit amet, consectetur 
           adipiscing elit.

I'm looking for an easy way to query files like this. For example, if I query it for "Area:SomeOtherArea", the result should be all blocks of the file with that area. I mean all four paragraphs: Header, Name, Area, Content. I could use grep with the -A and -B options, but the problem is that the content paragraphs may consists of any number of lines. And this is just this specific example; the structure could be completely different.

I'm looking for a light-weight, easily adaptable solution, maybe a combination of CLI tools. I don't want to reinvent the wheel.

shellter · Accepted Answer

Sorry to say, but there's only so far you can go with this sort of problem, as you seem to want a swiss army knife with an infinitely expandable set of features, but without any pain on your part for programming:-) ! Such a thing is moderately possible, but given your wide open specification, recall that people spend years building out search engines like Lucene, Google and thousand others to solve this sort of problem.

That said, if you can be happy with a search tool that has a very simple rule that must be obeyed, AND you're using or have access to a Unix/Linux/Cygwin system, the following can work.

Basic rule: Blocks of data will be searched based on a blank like separating each block (as in you sample data above).

cat paraSearch.ksh

#!/bin/ksh
#  (or #!/bin/bash or likely others)

case $# in 0 ) echo "usage:paraSearch.ksh SearchTargetPattern file2search [file2 ....]" ; exit 1 ;;esac

# read the first pattern as the search target, 
# use quotes on cmd-line if you want to use
# regexp chars like '*'
mySrchPat="$1" ; shift

#dbg set -vx
awk  -v mySrchPattern="$mySrchPat"   \
  'BEGIN{RS=""; ORS="

"}
  #dbg {print "$0="$0; print "----------------------------------------------" }
  $0 ~ mySrchPattern{ print $0}
' "${@}"

chmod 755 paraSearch.ksh

test using your sample text and searchTarget AND the output

$ ./paraSearch.ksh SomeName multiLineTest.txt
001 [SomeTitle 1] - Some Subtitle 1
  Name: SomeName
  Area: SomeArea
  Content: Some multi-line comment...Lorem ipsum dolor sit amet, consectetur
           adipiscing elit. Etiam scelerisque. Lorem ipsum dolor sit amet,
           consectetur adipiscing elit. Etiam scelerisque.

To learn more about awk, read through (several times) this excellent tutorial: The Grymoire's Awk Tutorial.

IHTH

Simple way to parse and query multi-line semi-structured content

Tags:

command-line

multiline

parsing

structured-data

Stimmy

1 Answers

shellter

Recent Activity

Donate For Us

Simple way to parse and query multi-line semi-structured content

Tags:

command-line

multiline

parsing

structured-data

Stimmy

1 Answers

shellter

Related questions

Recent Activity

Donate For Us