Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gawk or grep: single line and ungreedy

I'd like to print headers of *.java files in all sub-directories recursively that have more than two type parameters (i.e. parameters within <R ... H> in the samples below). One of the files looks like (with names reduced for brevity):

multiple-lines.java

class ClazzA<R extends A,
    S extends B<T>, T extends C<T>,
    U extends D, W extends E,
    X extends F, Y extends G, Z extends H>
    extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) { 
    // ... code ...
  }
}

with expected output:

ClazzA.java:10: class ClazzA<R extends A,
ClazzA.java:11:     S extends B<T>, T extends C<T>,
ClazzA.java:12:     U extends D, W extends E,
ClazzA.java:13:     X extends F, Y extends G, Z extends H>
ClazzA.java:14:     extends OtherClazz<S> implements I<T> {

but another could look like this, as well:

single-line.java

class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) { 
    // ... code ...
  }
}

with expected output:

ClazzB.java:42: class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

Files that should not be considered/printed:

X-no-parameter.java

class ClazzC /* no type parameter */ extends OtherClazz<S> implements I<T> {

  public void method(Type<A, B> x) { 
    // ... code ...
  }
}

X-one-parameter.java

class ClazzD<R extends A>  // only one type parameter
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) { 
    // ... code ...
  }
}

X-two-parameters.java

class ClazzE<R extends A, S extends B<T>>  // only two type parameters
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) { 
    // ... code ...
  }
}

X-two-line-parameters.java

class ClazzF<R extends A,  // only two type parameters
    S extends B<T>>        // on two lines
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) { 
    // ... code ...
  }
}

All the spaces in the files could be \s+. extends [...] and implements [...] immediately prior to { are optional. extends [...] is also optional at each of the type parameters. See The Java® Language Specification, 8.1. Class Declarations for details.

I'm using gawk in the Git Bash:

$ gawk --version
GNU Awk 5.0.0, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)

with:

find . -type f -name '*.java' | xargs gawk -f ws-class-type-parameter.awk > ws-class-type-parameter.log

and ws-class-type-parameter.awk:

# /start/ , /end/ ... pattern

#/class ClazzA<.*,.*/      , /{/  {    # 5 lines, OK for ClazzA, but in real it prints classes with 2 or less type parameters, too
#/class ClazzA<.*,.*,/     , /{/  {    # no line with ClazzA, since there's no second ',' on its first line
#/class ClazzA<.*,.*,/s    , /{/  {    # 500.000+(!) lines
#/class ClazzA<.*,.*,/s    , /{/U {    # 500.000+(!) lines
#/class ClazzA<.*,.*,/sU   , /{/U {    # 500.000+(!) lines
 /(?s)class ClazzA<.*,.*,/ , /{/  {    # no line

    match( FILENAME, "/.*/.." )
    print substr( FILENAME, RLENGTH ) ":" FNR ": " $0
}

This finds all the *.java files...great, executes gawk with each of them...great, but you see the results as comments after my tries. Please note: The ClazzA literal is just for testing and MCVE here. It could be \w+ in real, but with 500.000+ lines in thousands of files when testing...

It works if I try it on regex101.com. Well, sort of. I didn't find how to define /start-regex/,/end-regex/ there, so I added another .* in between.

I took the flags from there but I couldn't find a description whether gawk supports the flag syntax /.../sU , /.../U so I just gave it a try. A now deleted comment told me that no flavour of awk supports this.

I also tried it with grep:

$ grep --version
grep (GNU grep) 3.1
...
$ grep -nrPf types.grep *.java

with types.grep:

(?s).*class\s+\w+\s*<.*,.*,.*>.*{

which results in output of singleline.java only.

(?s) is --perl-regexp, -P syntax and grep --help claims to support this.

UPDATE

The solution in Ed Morton's answer works well but it turned out that there are auto-generated files with methods like:

    /** more code before here */    
    public void setId(String value) {
        this.id = value;
    }

    /**
     * Gets a map that contains attributes that aren't bound to any typed property on this class.
     * 
     * <p>
     * the map is keyed by the name of the attribute and 
     * the value is the string value of the attribute.
     * 
     * the map returned by this method is live, and you can add new attribute
     * by updating the map directly. Because of this design, there's no setter.
     * 
     * 
     * @return
     *     always non-null
     */
    public Map<QName, String> getOtherAttributes() {
        return otherAttributes;
    }

which give an output of e.g.:

AbstractAddressType.java:81:      * Gets a map that contains attributes that aren't bound to any typed property on this class.
AbstractAddressType.java:82:      * 
AbstractAddressType.java:83:      * <p>
AbstractAddressType.java:84:      * the map is keyed by the name of the attribute and 
AbstractAddressType.java:85:      * the value is the string value of the attribute.
AbstractAddressType.java:86:      * 
AbstractAddressType.java:87:      * the map returned by this method is live, and you can add new attribute
AbstractAddressType.java:88:      * by updating the map directly. Because of this design, there's no setter.
AbstractAddressType.java:89:      * 
AbstractAddressType.java:90:      * 
AbstractAddressType.java:91:      * @return
AbstractAddressType.java:92:      *     always non-null
AbstractAddressType.java:93:      */
AbstractAddressType.java:94:     public Map<QName, String> getOtherAttributes() {

and others with class comments and annotations like:

/**
 * This class was generated by Apache CXF 3.3.4
 * 2020-11-30T12:03:21.251+01:00
 * Generated source version: 3.3.4
 *
 */
@WebService(targetNamespace = "urn:SZRServices", name = "SZR")
@XmlSeeAlso({at.gv.egov.pvp1.ObjectFactory.class, org.w3._2001._04.xmldsig_more_.ObjectFactory.class, ObjectFactory.class, org.xmlsoap.schemas.ws._2002._04.secext.ObjectFactory.class, org.w3._2000._09.xmldsig_.ObjectFactory.class, at.gv.e_government.reference.namespace.persondata._20020228_.ObjectFactory.class})
public interface SZR {
// more code after here

with an an output of e.g.:

SZR.java:13:  * This class was generated by Apache CXF 3.3.4
SZR.java:14:  * 2020-10-12T11:51:35.175+02:00
SZR.java:15:  * Generated source version: 3.3.4
SZR.java:16:  *
SZR.java:17:  */
SZR.java:18: @WebService(targetNamespace = "urn:SZRServices", name = "SZR")
SZR.java:19: @XmlSeeAlso({at.gv.egov.pvp1.ObjectFactory.class, org.w3._2001._04.xmldsig_more_.ObjectFactory.class, ObjectFactory.class, org.xmlsoap.schemas.ws._2002._04.secext.ObjectFactory.class, org.w3._2000._09.xmldsig_.ObjectFactory.class, at.gv.e_government.reference.namespace.persondata._20020228_.ObjectFactory.class})
like image 414
Gerold Broser Avatar asked Nov 20 '20 19:11

Gerold Broser


2 Answers

Using any POSIX awk in any shell on every UNIX box:

$ cat tst.awk
/[[:space:]]*class[[:space:]]*/ {
    inDef = 1
    fname = FILENAME
    sub(".*/","",fname)
    def = out = ""
}
inDef {
    out = out fname ":" FNR ": " $0 ORS

    # Remove comments (not perfect but should work for 99.9% of cases)
    sub("//.*","")
    gsub("/[*]|[*]/","\n")
    gsub(/\n[^\n]*\n/,"")

    def = def $0 ORS
    if ( /{/ ) {
        if ( gsub(/,/,"&",def) > 2 ) {
            printf "%s", out
        }
        inDef = 0
    }
}

$ find tmp -type f -name '*.java' -exec awk -f tst.awk {} +
multiple-lines.java:1: class ClazzA<R extends A,
multiple-lines.java:2:     S extends B<T>, T extends C<T>,
multiple-lines.java:3:     U extends D, W extends E,
multiple-lines.java:4:     X extends F, Y extends G, Z extends H>
multiple-lines.java:5:     extends OtherClazz<S> implements I<T> {
single-line.java:1: class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

The above was run using this input:

$ head tmp/*
==> tmp/X-no-parameter.java <==
class ClazzC /* no type parameter */ extends OtherClazz<S> implements I<T> {

  public void method(Type<A, B> x) {
    // ... code ...
  }
}

==> tmp/X-one-parameter.java <==
class ClazzD<R extends A>  // only one type parameter
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) {
    // ... code ...
  }
}

==> tmp/X-two-line-parameters.java <==
class ClazzF<R extends A,  // only two type parameters
    S extends B<T>>        // on two lines
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) {
    // ... code ...
  }
}

==> tmp/X-two-parameters.java <==
class ClazzE<R extends A, S extends B<T>>  // only two type parameters
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) {
    // ... code ...
  }
}

==> tmp/multiple-lines.java <==
class ClazzA<R extends A,
    S extends B<T>, T extends C<T>,
    U extends D, W extends E,
    X extends F, Y extends G, Z extends H>
    extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) {
    // ... code ...
  }
}

==> tmp/single-line.java <==
class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) {
    // ... code ...
  }
}

The above is just a best effort without writing a parser for the language and just having the OPs posted sample input/output to go on for what needs to be handled.

like image 166
Ed Morton Avatar answered Oct 17 '22 20:10

Ed Morton


Note: Presence of comments can cause these solutions to fail.

With ripgrep (https://github.com/BurntSushi/ripgrep)

rg -nU --no-heading '(?s)class\s+\w+\s*<[^{]*,[^{]*,[^{]*>[^{]*\{' *.java
  • -n enables line numbering (this is the default if output is to the terminal)
  • -U enables multiline matching
  • --no-heading by default, ripgrep displays matching lines grouped under filename as a header, this option makes ripgrep behave like GNU grep with filename prefix for each output line
  • [^{]* is used instead of .* to prevent matching , and > elsewhere in the file, otherwise lines like public void method(Type<Q, R> x) { will get matched
  • -m option can be used to limit number of matches per input file, which will give an additional benefit of not having to search entire input file

If you use the above regexp with GNU grep, note that:

  • grep matches only one line at a time. If you use -z option, grep will consider ASCII NUL as the record separator, which effectively gives you ability to match across multiple lines, assuming input doesn't have NUL characters that can prevent such matching. Another effect of -z option is that NUL character will be appended to each output result (this could be fixed by piping results to tr '\0' '\n')
  • -o option will be needed to print only matching portion, which means you won't be able to get line number prefix
  • for the given task, -P isn't needed, grep -zoE 'class\s+\w+\s*<[^{]*,[^{]*,[^{]*>[^{]*\{' *.java | tr '\0' '\n' will give you similar result as the ripgrep command. But, you won't get line number prefix, filename prefix will be only for each matching portion instead of each matching line and you won't get rest of line before class and after {
like image 20
Sundeep Avatar answered Oct 17 '22 19:10

Sundeep