Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Listing files with curl

Tags:

regex

curl

I am trying to list all the gz files from this website

site=http://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/rdf/
curl -s "$site" --list-only | sed -n 's%.*href="rdf/uni([^"]*\.rdf.gz)".*%\1%p'

But i am getting this error:

sed: -e expression #1, char 40: invalid reference \1 on `s' command's RHS
like image 271
jhon.smith Avatar asked Oct 23 '25 15:10

jhon.smith


1 Answers

I would avoid regex to parse html. Here you have an alternative with perl and mojolicious as parser:

perl -Mojo -E '
    g(q|http://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/rdf/|)
    ->dom
    ->find(q|a|)
    ->each(sub { 
        my $t =  $_->text; 
        say $t if $t =~ m/rdf\.gz\Z/ 
    })'

But if you insist with sed, your regular expression has some problems. First, parentheses must be escaped to do grouping. Second, rdf/uni is not a match. Third, when you do [^"]* it is bypassing the extension rdf.gz. Change it to look for a . and then check the extension, but I remember that is very fragile. It could fail in many ways, for example with a file with a . in its name:

curl -s "$site" --list-only | sed -n 's%.*href="\([^.]*\.rdf\.gz\)".*%\n\1%; ta; b; :a; s%.*\n%%; p'

Both commands yield:

citations.rdf.gz
databases.rdf.gz
diseases.rdf.gz
enzyme.rdf.gz
go.rdf.gz
journals.rdf.gz
keywords.rdf.gz
locations.rdf.gz
pathways.rdf.gz
taxonomy.rdf.gz
tissues.rdf.gz
uniparc.rdf.gz
uniprot.rdf.gz
uniref.rdf.gz
like image 134
Birei Avatar answered Oct 26 '25 04:10

Birei



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!