Regex for getting text between two texts using command prompt

Question

I am trying to extract the contents with in the body tag in my html file using command prompt and findstr command. My html is as below

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>AdminWeb</title>
  <base href="/wwwroot/admin-web/">

  <meta name="viewport" content="width=device-width, initial-scale=1">
  <link rel="icon" type="image/x-icon" href="favicon.ico">
</head>
<body><app-root></app-root><script src="/wwwroot/admin-web/runtime.js"></script><script src="/wwwroot/admin-web/file1.js" nomodule></script><script src="/wwwroot/admin-web/file2.js"></script><script src="/wwwroot/admin-web/styles.js"></script><script src="/wwwroot/admin-web/vendor.js"></script><script src="/wwwroot/admin-web/main.js"></script></body>
</html>

The out put I want is

<app-root></app-root>
<script src="/wwwroot/admin-web/runtime.js"></script><script src="/wwwroot/admin-web/file1.js" nomodule></script><script src="/wwwroot/admin-web/file2.js"></script><script src="/wwwroot/admin-web/styles.js"></script><script src="/wwwroot/admin-web/vendor.js"></script><script src="/wwwroot/admin-web/main.js"></script>

I am trying to achieve using regular expression.

findstr /R (?<=<body>)(.*)(?=</body>)  test.html

But this is now working in command prompt. But this regex is working in js.

Thanks in advance.

Wiktor Stribiżew · Accepted Answer

First of all, findstr does not support all cool regex features that modern regex engines offer. Especially the latest JavaScript ECMAScript2018+ compliant engines like in Chrome, Node.js, etc. So, saying that "this regex is working in js" does not mean the same pattern will work anywhere else. It won't certainly work in findstr.

You may take the hard way and go on to study how to write a batch script for this. However, there is a much simpler way with other built-in Windows apps.

I strongly suggest Powershell as it offers you a lot of features .NET provides.

Here, open PowerShell console and use

$pathToFile = 'c:\...\...\you_file.txt'
$output_file = 'c:\...\...\you_file_out.txt' 
$rx = '(?s)(?<=<body>).*?(?=</body>)'
Get-Content $pathToFile -Raw | Select-String $rx -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file

NOTE: It is best to use an IE automation to handle HTML.

$output_file = 'c:\...\...\you_file_out.txt'
$url = 'http://your_site_here.tld/...'
$ie = New-Object -comobject "InternetExplorer.Application"
$ie.visible = $true
$ie.navigate($url)

while ($ie.Busy -eq $true -Or $ie.ReadyState -ne 4) {Start-Sleep 2}

$doc = $ie.Document
$tags = $doc.getElementsByTagName("body")
$tags[0].innerHTML > $output_file

Regex for getting text between two texts using command prompt

Tags:

regex

powershell

Dennis

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

Regex for getting text between two texts using command prompt

Tags:

regex

powershell

Dennis

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us