Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse multi line records (with awk?)

I'm trying to figure out how to extract particular fields from multi line records separated by \n\n.

In this instance, it happens to be output from apt-cache akin to DEBIAN control files. See output of apt-cache show "$package"

Package: caffeine
Priority: optional
Section: misc
Installed-Size: 641
Maintainer: Reuben Thomas <[email protected]>
Architecture: all
Version: 2.8.3
Depends: python3:any (>= 3.3.2-2~), python3, gir1.2-gtk-3.0, gir1.2-appindicator3-0.1, python3-xlib, python3-pkg-resources, libnet-dbus-perl
Filename: pool/main/c/caffeine/caffeine_2.8.3_all.deb
Size: 58774
MD5sum: 4438db3f6d1cf43a4f4b49cc7f24cda0
SHA1: e748370ac5ccd7de6fc9466ce0451d2e90d179d4
SHA256: ae303b4e32949cc1e1af80df7217e3406291679e3f18fa8f78a5bbb97504c4f6
Description-en: Prevent the desktop becoming idle in full-screen mode
 Caffeine stops the desktop becoming idle when an application
 is running full-screen. A desktop indicator ‘caffeine-indicator’
 supplies a manual toggle, and the command ‘caffeinate’ can be used
 to prevent idleness for the duration of any command.
Description-md5: 7c14f8adc007b10f6ecafed36260bedb

Package: caffeine
Priority: optional
Section: misc
Installed-Size: 655
Maintainer: Reuben Thomas <[email protected]>
Architecture: all
Version: 2.6+555~ubuntu14.04.1
Depends: python:any (<< 2.8), python:any (>= 2.7.5-5~), python, gir1.2-gtk-2.0, gir1.2-appindicator3-0.1, x11-utils, python-dbus
Filename: pool/main/c/caffeine/caffeine_2.6+555~ubuntu14.04.1_all.deb
Size: 58604
MD5sum: 1051c3f7d40d344f986bb632d7436849
SHA1: 5e5f622595e8cbba8fb7468b3cffe2914b0ba110
SHA256: 11c5bbf2d28dcda6a7b82872195f740f1f79521b60d3c9acea3037bf0ab3a60e
Description: Prevent the desktop becoming idle
 Caffeine allows the user to prevent the desktop becoming idle,
 either manually or when certain applications are run. This
 prevents screen-blanking, locking, suspending, and so on.
Description-md5: 738866350e5086e77408d7a9c7ffa59b

Package: caffeine
Status: install ok installed
Priority: optional
Section: misc
Installed-Size: 794
Maintainer: Isaiah Heyer <[email protected]>
Architecture: all
Version: 2.4.1+478~raring1
Depends: dconf-gsettings-backend | gsettings-backend, python (>= 2.6), python-central (>= 0.6.11), python-xlib, python-appindicator, python-xdg, python-notify, python-kaa-metadata
Description: Caffeine
 A status bar application able to temporarily prevent the activation
 of both the screensaver and the "sleep" powersaving mode.
Description-md5: 1c29acf1ab0f2e6636db29fbde1d14a3
Homepage: https://launchpad.net/caffeine
Python-Version: >= 2.6

My desired output is one line per record in the format apt-get download $pkg=$ver -a=$arch. Basically a list of the installation commands for available packages...

So far what I've got is apt-cache show "$package" | awk '/^Package: / { print $2 } /^Version: / { print $2 } /^Architecture: / { print $2 }' | xargs -n3 | awk '{printf "apt-get download %s=%s -a=%s\n", $1, $3, $2}'

This is the actual output:

apt-get download caffeine=2.8.3 -a=all
apt-get download caffeine=2.6+555~ubuntu14.04.1 -a=all
apt-get download caffeine=2.4.1+478~raring1 -a=all

The is as desired but it appears to be a fluke only because the order of the fields is consistent in this example. It would break if the order of fields was different.

I can do parsing like this using object orientation in Python but I'm having difficulty getting this done in one awk command. The only way I can see doing this correctly would be to split each record into individual tmp files (using split or something along those lines) and then parse each file individually (which is straightforward). Obviously I'd really like to avoid unnecessary I/O as this seems like something that awk is well equipped for. Any awk pro's know how to solve this? I'd even be open to a Perl one-liner or utilizing bash but I'm really interested in learning how to better leverage awk.

like image 712
Six Avatar asked Mar 30 '15 23:03

Six


People also ask

What is awk '{ print $3 }'?

I. If you notice awk 'print $1' prints first word of each line. If you use $3, it will print 3rd word of each line.

How do I use NF in awk?

NF is a predefined variable whose value is the number of fields in the current record. awk automatically updates the value of NF each time it reads a record. No matter how many fields there are, the last field in a record can be represented by $NF . So, $NF is the same as $7 , which is ' example.

Does awk process line by line?

An Example with Two RulesThe awk utility reads the input files one line at a time. For each line, awk tries the patterns of each rule. If several patterns match, then several actions execute in the order in which they appear in the awk program. If no patterns match, then no actions run.


2 Answers

$ package=sed
$ apt-cache show "$package" | awk '/^Package: /{p=$2} /^Version: /{v=$2} /^Architecture: /{a=$2} /^$/{print "apt-get download "p"="v" -a="a}' 
apt-get download sed=4.2.1-10 -a=amd64

How it works

  • /^Package: /{p=$2}

    Save the package information in variable p.

  • /^Version: /{v=$2}

    Save the version information in variable v.

  • /^Architecture: /{a=$2}

    Save the architecture information in variable a.

  • /^$/{print "apt-get download "p"="v" -a="a}

    When we reach a blank line, print out the information in the desired form.

    My version of apt-cache always outputs a blank line after each package. Your sample output is missing the last blank line. If your apt-cache genuinely does not produce that last blank line, then we will need to add a little bit more code to compensate.

    As a matter of style, some may prefer printf to print. In which case, replace the above with:

    /^$/{printf "apt-get download %s=%s -a=%s\n",v,p,a}' 
    
like image 193
John1024 Avatar answered Oct 11 '22 00:10

John1024


I find the best way to deal with data that contains name to value pairings is to create an array of those pairs and then just access the values by their names:

$ cat tst.awk
BEGIN { RS=""; FS="\n" }
{
    delete n2v
    for (i=1;i<=NF;i++) {
        if ($i !~ /^ /) {
            name = gensub(/:.*/,"","",$i)
            value = gensub(/[^:]+:\s+/,"","",$i)
            n2v[name] = value
        }
    }
    printf "apt-get download %s=%s -a=%s\n",
        n2v["Package"], n2v["Version"], n2v["Architecture"]
}

$ awk -f tst.awk file
apt-get download caffeine=2.8.3 -a=all
apt-get download caffeine=2.6+555~ubuntu14.04.1 -a=all
apt-get download caffeine=2.4.1+478~raring1 -a=all

The above uses a couple of gawk extensions but is easily adapted to any awk if necessary.

like image 45
Ed Morton Avatar answered Oct 11 '22 01:10

Ed Morton