How to parse multi line records (with awk?)

Tags:

I'm trying to figure out how to extract particular fields from multi line records separated by \n\n.

In this instance, it happens to be output from apt-cache akin to DEBIAN control files. See output of apt-cache show "$package"

Package: caffeine
Priority: optional
Section: misc
Installed-Size: 641
Maintainer: Reuben Thomas <[email protected]>
Architecture: all
Version: 2.8.3
Depends: python3:any (>= 3.3.2-2~), python3, gir1.2-gtk-3.0, gir1.2-appindicator3-0.1, python3-xlib, python3-pkg-resources, libnet-dbus-perl
Filename: pool/main/c/caffeine/caffeine_2.8.3_all.deb
Size: 58774
MD5sum: 4438db3f6d1cf43a4f4b49cc7f24cda0
SHA1: e748370ac5ccd7de6fc9466ce0451d2e90d179d4
SHA256: ae303b4e32949cc1e1af80df7217e3406291679e3f18fa8f78a5bbb97504c4f6
Description-en: Prevent the desktop becoming idle in full-screen mode
 Caffeine stops the desktop becoming idle when an application
 is running full-screen. A desktop indicator ‘caffeine-indicator’
 supplies a manual toggle, and the command ‘caffeinate’ can be used
 to prevent idleness for the duration of any command.
Description-md5: 7c14f8adc007b10f6ecafed36260bedb

Package: caffeine
Priority: optional
Section: misc
Installed-Size: 655
Maintainer: Reuben Thomas <[email protected]>
Architecture: all
Version: 2.6+555~ubuntu14.04.1
Depends: python:any (<< 2.8), python:any (>= 2.7.5-5~), python, gir1.2-gtk-2.0, gir1.2-appindicator3-0.1, x11-utils, python-dbus
Filename: pool/main/c/caffeine/caffeine_2.6+555~ubuntu14.04.1_all.deb
Size: 58604
MD5sum: 1051c3f7d40d344f986bb632d7436849
SHA1: 5e5f622595e8cbba8fb7468b3cffe2914b0ba110
SHA256: 11c5bbf2d28dcda6a7b82872195f740f1f79521b60d3c9acea3037bf0ab3a60e
Description: Prevent the desktop becoming idle
 Caffeine allows the user to prevent the desktop becoming idle,
 either manually or when certain applications are run. This
 prevents screen-blanking, locking, suspending, and so on.
Description-md5: 738866350e5086e77408d7a9c7ffa59b

Package: caffeine
Status: install ok installed
Priority: optional
Section: misc
Installed-Size: 794
Maintainer: Isaiah Heyer <[email protected]>
Architecture: all
Version: 2.4.1+478~raring1
Depends: dconf-gsettings-backend | gsettings-backend, python (>= 2.6), python-central (>= 0.6.11), python-xlib, python-appindicator, python-xdg, python-notify, python-kaa-metadata
Description: Caffeine
 A status bar application able to temporarily prevent the activation
 of both the screensaver and the "sleep" powersaving mode.
Description-md5: 1c29acf1ab0f2e6636db29fbde1d14a3
Homepage: https://launchpad.net/caffeine
Python-Version: >= 2.6

My desired output is one line per record in the format apt-get download $pkg=$ver -a=$arch. Basically a list of the installation commands for available packages...

So far what I've got is apt-cache show "$package" | awk '/^Package: / { print $2 } /^Version: / { print $2 } /^Architecture: / { print $2 }' | xargs -n3 | awk '{printf "apt-get download %s=%s -a=%s\n", $1, $3, $2}'

This is the actual output:

apt-get download caffeine=2.8.3 -a=all
apt-get download caffeine=2.6+555~ubuntu14.04.1 -a=all
apt-get download caffeine=2.4.1+478~raring1 -a=all

The is as desired but it appears to be a fluke only because the order of the fields is consistent in this example. It would break if the order of fields was different.

I can do parsing like this using object orientation in Python but I'm having difficulty getting this done in one awk command. The only way I can see doing this correctly would be to split each record into individual tmp files (using split or something along those lines) and then parse each file individually (which is straightforward). Obviously I'd really like to avoid unnecessary I/O as this seems like something that awk is well equipped for. Any awk pro's know how to solve this? I'd even be open to a Perl one-liner or utilizing bash but I'm really interested in learning how to better leverage awk.

712

asked Mar 30 '15 23:03

Six

2 Answers

$ package=sed
$ apt-cache show "$package" | awk '/^Package: /{p=$2} /^Version: /{v=$2} /^Architecture: /{a=$2} /^$/{print "apt-get download "p"="v" -a="a}' 
apt-get download sed=4.2.1-10 -a=amd64

How it works

/^Package: /{p=$2}

Save the package information in variable p.
/^Version: /{v=$2}

Save the version information in variable v.
/^Architecture: /{a=$2}

Save the architecture information in variable a.
/^$/{print "apt-get download "p"="v" -a="a}

When we reach a blank line, print out the information in the desired form.

My version of apt-cache always outputs a blank line after each package. Your sample output is missing the last blank line. If your apt-cache genuinely does not produce that last blank line, then we will need to add a little bit more code to compensate.

As a matter of style, some may prefer printf to print. In which case, replace the above with:
```
/^$/{printf "apt-get download %s=%s -a=%s\n",v,p,a}' 
```

193

answered Oct 11 '22 00:10

John1024

I find the best way to deal with data that contains name to value pairings is to create an array of those pairs and then just access the values by their names:

$ cat tst.awk
BEGIN { RS=""; FS="\n" }
{
    delete n2v
    for (i=1;i<=NF;i++) {
        if ($i !~ /^ /) {
            name = gensub(/:.*/,"","",$i)
            value = gensub(/[^:]+:\s+/,"","",$i)
            n2v[name] = value
        }
    }
    printf "apt-get download %s=%s -a=%s\n",
        n2v["Package"], n2v["Version"], n2v["Architecture"]
}

$ awk -f tst.awk file
apt-get download caffeine=2.8.3 -a=all
apt-get download caffeine=2.6+555~ubuntu14.04.1 -a=all
apt-get download caffeine=2.4.1+478~raring1 -a=all

The above uses a couple of gawk extensions but is easily adapted to any awk if necessary.

answered Oct 11 '22 01:10

Ed Morton

Related questions
                            
                                R: strsplit on backslash (\)
                            
                                Escape dollar sign in regexp for sed
                            
                                Qt - splitting a QString, using several types of whitespace as separators
                            
                                How to ban words with diacritics using a blacklist array and regex?
                            
                                What does the "+0" mean in the regexp \k<name+0>?
                            
                                SQL LIKE in Spark SQL
                            
                                Python Regular Expressions to NFA
                            
                                VIM: How to preserve case for search and replace [duplicate]
                            
                                How can I match strings that don't match a particular pattern in Perl?
                            
                                UTF-8 characters in preg_match_all (PHP)
                            
                                Invert regexp in vim
                            
                                find end index of a regular expression search/match
                            
                                How would I use regex to parse this chord scheme?
                            
                                Delete all lines between two patterns (exclusive of the pattern) using sed or awk
                            
                                PHP regex crashing apache
                            
                                C++ 11 regex: checking if string starts with regex
                            
                                How to detect if string contains only latin symbols using Ruby 1.9?
                            
                                Regex for Matching Pinyin
                            
                                Python Regex Sub - Use Match as Dict Key in Substitution
                            
                                Java regex throwing exception for no match found when pattern found in line

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to parse multi line records (with awk?)

Tags:

regex

linux

bash

sed

awk

Six

People also ask

2 Answers

How it works

John1024

Ed Morton

Recent Activity

Donate For Us