Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extract matches of a regex capturing group from a file

I want to perform the title-named action under linux command-line(several ca bash script will also do). the command I tried is:

sed 's/href="([^"])"/$1/g' page.html > list.lst

but obviously it failed.

To be precise, here is my input:

<link rel="stylesheet" type="text/css" href="style/css/colors.css" />
<link rel="stylesheet" type="text/css" href="style/css/global.css" />
<link rel="stylesheet" type="text/css" href="style/css/icons.css" />

the output I want would be a comma-separated or space-separated list of all matches in the input file:

style/css/colors.css,style/css/global.css,style/css/icons.css

I think I got the right expression: href="([^"]*)"

but I have no clue how to perform this. sed would do a search/replace which is not exactly what I want.( to the contrary, I only need to keep matches and throw the rest away, and not to replace them )

like image 786
BiAiB Avatar asked Jul 26 '11 14:07

BiAiB


People also ask

How do I match a group in regex?

Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses ( and ) metacharacters. Any subpattern inside a pair of parentheses will be captured as a group.

How do you find the substring that matched the last capturing group of the regex in Python?

To get access to the text matched by each regex group, pass the group's number to the group(group_number) method. So the first group will be a group of 1. The second group will be a group of 2 and so on. So this is the simple way to access each of the groups as long as the patterns were matched.

When capturing regex groups what datatype does the groups method return?

The re. groups() method This method returns a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern.


1 Answers

grep href page.html | sed 's/^.*href="\([^"]*\)".*$/\1/' | xargs | sed 's/ /,/g'

This will extract all the lines that contain href in them and will only get the first href on each line. Also, refer to this post about parsing HTML with regular expressions.

like image 151
rid Avatar answered Oct 15 '22 08:10

rid