Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract all quotations in a text?

I'm looking for a SimpleGrepSedPerlOrPythonOneLiner that outputs all quotations in a text.


Example 1:

echo “HAL,” noted Frank, “said that everything was going extremely well.” | SimpleGrepSedPerlOrPythonOneLiner

stdout:

"HAL,"
"said that everything was going extremely well.”

Example 2:

cat MicrosoftWindowsXPEula.txt | SimpleGrepSedPerlOrPythonOneLiner

stdout:

"EULA"
"Software"
"Workstation Computer"
"Device"
"DRM"

etc.

(link to the corresponding text).

like image 822
secr Avatar asked Dec 05 '08 11:12

secr


People also ask

How do you strip a quote from a string?

To remove double quotes just from the beginning and end of the String, we can use a more specific regular expression: String result = input. replaceAll("^\"|\"$", ""); After executing this example, occurrences of double quotes at the beginning or at end of the String will be replaced by empty strings.

How do you extract text between quotes in Python?

To extract strings in between the quotations we can use findall() method from re library.


1 Answers

I like this:

perl -ne 'print "$_\n" foreach /"((?>[^"\\]|\\+[^"]|\\(?:\\\\)*")*)"/g;'

It's a little verbose, but it handles escaped quotes and backtracking a lot better than the simplest implementation. What it's saying is:

my $re = qr{
   "               # Begin it with literal quote
   ( 
     (?>           # prevent backtracking once the alternation has been
                   # satisfied. It either agrees or it does not. This expression
                   # only needs one direction, or we fail out of the branch

         [^"\\]    # a character that is not a dquote or a backslash
     |   \\+       # OR if a backslash, then any number of backslashes followed by 
         [^"]      # something that is not a quote
     |   \\        # OR again a backslash
         (?>\\\\)* # followed by any number of *pairs* of backslashes (as units)
         "         # and a quote
     )*            # any number of *set* qualifying phrases
  )                # all batched up together
  "                # Ended by a literal quote
}x;

If you don't need that much power--say it's only likely to be dialog and not structured quotes, then

/"([^"]*)"/ 

probably works about as well as anything else.

like image 115
Axeman Avatar answered Oct 12 '22 05:10

Axeman