Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter only uppercase words from file

I have an output.txt file which has around 1000 words that looks like this:

SESSIONDAYOFWEEK
FILMTITLELONGALT
tblTrans_Ticket.
ADMITDETAILSALT2
MESSAGESTUB2ALT3
StartDayOfWeek
Description
MESSAGESTUB2ALT2
FILMTITLESHORTALT
Applications
TICKETTYPELONGALT

I need to filter that file, select only words that only have UPPER CASE characters, and get rid of those that have lower case characters.

I run this command in PowerShell:

Get-Content .\out.txt | ForEach-Object if ($_.IsUpper) {Write-Host $_}

and the shell parse all the words one by one and for each words prints me:

ForEach-Object : Input name "if" cannot be resolved to a method.
At line:1 char:25
+ ... et-Content .\out.txt | ForEach-Object if ($_.IsUpper) {Write-Host $_}
+                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (TAIL:PSObject) [ForEach-Object], PSArgumentException
    + FullyQualifiedErrorId : MethodNotFound,Microsoft.PowerShell.Commands.ForEachObjectCommand

I don't understand where am I wrong?

like image 559
Francesco Mantovani Avatar asked Dec 09 '25 17:12

Francesco Mantovani


2 Answers

Use the -cmatch operator for case-sensitive matching against a regex (regular expression):

Get-Content .\out.txt | Where-Object { $_ -cmatch  '^\p{Lu}+$' }
  • -cmatch is the case-sensitive variant of the -match operator (whose alias is -imatch); given that -match is case-insensitive, -cmatch must be used in order to detect case distinctions.

  • \p{Lu} matches a single uppercase character - including accented non-ASCII characters such as Ü[1] - and adding + matches one or more in a row. Enclosing the expression in ^ (start of string) and $ (end of string) means that only lines entirely made up of uppercase characters are matched.

    • Ansgar Wiechers suggests -cnotmatch '\p{Ll}' instead, which works slightly differently: it would eliminate lines that contain at least one lowercase character, which means that lines are kept even if they (also) contain non-letter characters (as long as there's no lowercase letter).

An alternative with Select-String that may perform better:

Select-String -CaseSensitive '^\p{Lu}+$' .\out.txt | Select-Object -ExpandProperty Line

Select-String too is case-insensitive by default (as is PowerShell in general), so the
-CaseSensitive switch is required here.

Note that, despite its name, Select-String as of PowerShell Core 6.1.0 doesn't support outputting the matched lines directly; instead, it outputs match-information objects whose .Line property contains the matched line, hence the need for Select-Object -ExpandProperty Line.
This GitHub issue proposes adding a new switch parameter to support direct output of the matched strings.


As for what you tried:

The code to be executed by the ForEach-Object cmdlet must be passed as a script block - i.e., a piece of code enclosed in { ... }.

You neglected to do that, which caused the syntax error you saw.

Also, the [string] type (a .NET string) does not have an .IsUpper() method (and even if it did, you forgot the () after .IsUpper).

Only the [char] type has an .IsUpper() method, namely a static one, which you can call as follows: [char]::IsUpper('A') - but you'd have to call this method in a loop for each character in your input string:

Get-Content .\out.txt | Where-Object { 
  foreach ($c in $_.ToCharArray()) { if (-not [char]::IsUpper($c)) { return $False } }
  $True
}

Finally, don't use Write-Host to return results - Write-Host prints to the console only - you won't be able to capture or redirect such output[2]. Instead, use Write-Output or, better yet, rely on PowerShell's implicit output behavior: simply using $_ as a statement of its own will output it - any expression or command you neither capture nor redirect is automatically output (sent to the success output stream).


[1] By contrast, using character range expression [A-Z] would only recognize ASCII-range (English) uppercase characters.

[2] Never in PSv4-, but with additional effort you can in PSv5+ - but the point is that Write-Host is not meant for outputting results (data).

like image 199
mklement0 Avatar answered Dec 12 '25 17:12

mklement0


The easiest way to do this is probably with regex.

Get-Content .\out.txt | Where-Object { $_ -cmatch "\b[A-Z0-9_]+\b" }

Where-Object acts as a filter, allowing anything that matches through and discarding anything that doesn't match.

-cmatch will do case-sensitive regex match

Regex explanation:

+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

A-Z a single character in the range between A (index 65) and Z (index 90)

0-9 a single character in the range between 0 (index 48) and 9 (index 57)

_ matches the character _ literally

\b assert position at a word boundary

You can remove 0-9 and _ if you don't want to allow words with those characters through the filter.

See: https://regex101.com/r/CfgEmU/1

like image 40
Jacob Avatar answered Dec 12 '25 17:12

Jacob



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!