Noob here.
I'm trying to pare down a list of domains by eliminating all subdomains if the parent domain is present in the list. I've managed to cobble together a script that somewhat does this with PowerShell after some searching and reading. The output is not exactly what I want, but will work OK. The problem with my solution is that it takes so long to run because of the size of my initial list (tens of thousands of entries).
UPDATE: I've updated my example to clarify my question.
Example "parent.txt" list:
adk2.co
adk2.com
adobe.com
helpx.adobe.com
manage.com
list-manage.com
graph.facebook.com
Example output "repeats.txt" file:
adk2.com (different top level domain than adk2.co but that's ok)
helpx.adobe.com
list-manage.com (not subdomain of manage.com but that's ok)
I would then take and eliminate the repeats from the parent, leaving a list of "unique" subdomains and domains. I have this in a separate script.
Example final list with my current script:
adk2.co
adobe.com
manage.com
graph.facebook.com (it's not facebook.com because facebook.com wasn't in the original list.)
Ideal final list:
adk2.co
adk2.com (since adk2.co and adk2.com are actually distinct domains)
adobe.com
manage.com
graph.facebook.com
Below is my code:
I've taken my hosts list (parent.txt) and checked it against itself, and spit out any matches into a new file.
$parent = Get-Content("parent.txt")
$hosts = Get-Content("parent.txt")
$repeats =@()
$out_file = "$PSScriptRoot\repeats.txt"
$hosts | where {
$found = $FALSE
foreach($domains in $parent){
if($_.Contains($domains) -and $_ -ne $domains){
$found = $TRUE
$repeats += $_
}
if($found -eq $TRUE){
break
}
}
$found
}
$repeats = $repeats -join "`n"
[System.IO.File]::WriteAllText($out_file,$repeats)
This seems like a really inefficient way to do it since I'm going through each element of the array. Any suggestions on how to best optimize this? I have some ideas like putting more conditions on what elements to check and check against, but I feel like there's a drastically different approach that would be far better.
First, a solution based strictly on shared domain names (e.g., helpx.adobe.com and adobe.com are considered to belong to the same domain, but list-manage.com and manage.com are not).
This is not what you asked for, but perhaps more useful to future readers:
Get-Content parent.txt | Sort-Object -Unique { ($_ -split '\.')[-2,-1] -join '.' }
Assuming list.manage.com rather than list-manage.com in your sample input, the above command yields:
adk2.co
adk2.com
adobe.com
graph.facebook.com
manage.com
{ ($_ -split '\.')[-2,-1] -join '.' } sorts the input lines by the last 2 domain components (e.g., adobe.com):
-Unique discards duplicates.
A shared-suffix solution, as requested:
# Helper function for (naively) reversing a string.
# Note: Does not work properly with Unicode combining characters
# and surrogate pairs.
function reverse($str) { $a = $str.ToCharArray(); [Array]::Reverse($a); -join $a }
# * Sort the reversed input lines, which effectively groups them by shared suffix
# with the shortest entry first (e.g., the reverse of 'manage.com' before the
# reverse of 'list-manage.com').
# * It is then sufficient to output only the first entry in each group, using
# wildcard matching with -notlike to determine group boundaries.
# * Finally, sort the re-reversed results.
Get-Content parent.txt | ForEach-Object { reverse $_ } | Sort-Object |
ForEach-Object { $prev = $null } {
if ($null -eq $prev -or $_ -notlike "$prev*" ) {
reverse $_
$prev = $_
}
} | Sort-Object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With