Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for getting text between two texts using command prompt

I am trying to extract the contents with in the body tag in my html file using command prompt and findstr command. My html is as below

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>AdminWeb</title>
  <base href="/wwwroot/admin-web/">

  <meta name="viewport" content="width=device-width, initial-scale=1">
  <link rel="icon" type="image/x-icon" href="favicon.ico">
</head>
<body><app-root></app-root><script src="/wwwroot/admin-web/runtime.js"></script><script src="/wwwroot/admin-web/file1.js" nomodule></script><script src="/wwwroot/admin-web/file2.js"></script><script src="/wwwroot/admin-web/styles.js"></script><script src="/wwwroot/admin-web/vendor.js"></script><script src="/wwwroot/admin-web/main.js"></script></body>
</html>

The out put I want is

<app-root></app-root>
<script src="/wwwroot/admin-web/runtime.js"></script><script src="/wwwroot/admin-web/file1.js" nomodule></script><script src="/wwwroot/admin-web/file2.js"></script><script src="/wwwroot/admin-web/styles.js"></script><script src="/wwwroot/admin-web/vendor.js"></script><script src="/wwwroot/admin-web/main.js"></script>

I am trying to achieve using regular expression.

findstr /R (?<=<body>)(.*)(?=</body>)  test.html

But this is now working in command prompt. But this regex is working in js.

Thanks in advance.

like image 318
Dennis Avatar asked Mar 06 '26 15:03

Dennis


1 Answers

First of all, findstr does not support all cool regex features that modern regex engines offer. Especially the latest JavaScript ECMAScript2018+ compliant engines like in Chrome, Node.js, etc. So, saying that "this regex is working in js" does not mean the same pattern will work anywhere else. It won't certainly work in findstr.

You may take the hard way and go on to study how to write a batch script for this. However, there is a much simpler way with other built-in Windows apps.

I strongly suggest Powershell as it offers you a lot of features .NET provides.

Here, open PowerShell console and use

$pathToFile = 'c:\...\...\you_file.txt'
$output_file = 'c:\...\...\you_file_out.txt' 
$rx = '(?s)(?<=<body>).*?(?=</body>)'
Get-Content $pathToFile -Raw | Select-String $rx -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file

NOTE: It is best to use an IE automation to handle HTML.

$output_file = 'c:\...\...\you_file_out.txt'
$url = 'http://your_site_here.tld/...'
$ie = New-Object -comobject "InternetExplorer.Application"
$ie.visible = $true
$ie.navigate($url)

while ($ie.Busy -eq $true -Or $ie.ReadyState -ne 4) {Start-Sleep 2}

$doc = $ie.Document
$tags = $doc.getElementsByTagName("body")
$tags[0].innerHTML > $output_file
like image 173
Wiktor Stribiżew Avatar answered Mar 09 '26 07:03

Wiktor Stribiżew