Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unset array if preg_match does not match pattern?

I have a multidimensional array that looks like this:

Array
(
    [0] => Array
        (
            [0] => Title 1
            [1] => Some text ... US5801351017 ...
        )

    [1] => Array
        (
            [0] => Title 2
            [1] => Some text ... US0378331005 ...
        )

    [2] => Array
        (
            [0] => Title 3
            [1] => Some text ... //Note here that it does not contain an ISIN Code
        )
...

I am trying to filter out the arrays that match my Regex containg an ISIN Code. The array above was produced from the following code:

$title = $html->find("h3.r a");
$titlearray = array_map(function($value){
    return trim($value->plaintext);
}, $title);

$description = $html->find("span.st");
$descriptionarray = array_map(function($value){
    $string = strip_tags($value);
    return $string;
}, $description);

$result1 = array();
foreach($titlearray as $key => $value) {
    $tmp = array($value);
    if (isset($descriptionarray[$key])) {
        $tmp[] = $descriptionarray[$key];
    }
    $result1[] = $tmp;
}

print_r($result1);

I have written some code that comes very close but does not really unset the arrays that do not contain an ISIN Code. The code I have is this:

$title = $html->find("h3.r a");
$titlearray = array_map(function($value){
    return trim($value->plaintext);
}, $title);

$description = $html->find("span.st");
$descriptionarray = array_map(function($value){
    $match = array();
    $string = strip_tags($value);
    $pattern = "/[BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|US|BR|XS|FI|GR|IS|RU|LB|"
            . "PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|"
            . "IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|"
            . "SK|KRSI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU]{2}[A-Z0-9]{10}/";
    preg_match($pattern, $string, $match);
    return $match;
}, $description);

$merged = array();
$i=0;
foreach($descriptionarray as $value){
  $merged[$i] = $value;
  $merged[$i][] = $titlearray[$i];
  $i++;
}

print_r($merged);

which gives me these arrays:

Array
(
    [0] => Array
        (
            [0] => US5801351017
            [1] => Title 1
        )

    [1] => Array
        (
            [0] => US0378331005
            [1] => Title 2
        )

    [2] => Array
        (
            [0] => Title 3
        )
...

How can I get rid of the arrays that do not match my Regex? What I am looking for is this output:

Array
(
    [0] => Array
        (
            [0] => Title 1
            [1] => US5801351017
        )

    [1] => Array
        (
            [0] => Title 2
            [1] => US0378331005
        )
...

EDIT

@CasimiretHippolyte

According to his answer, I have this code now:

$titles = $html->find("h3.r a");

$descriptions = $html->find("span.st");

$ISIN_PATTERN = "/[BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|US|BR|XS|FI|GR|IS|RU|LB|"
            . "PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|"
            . "IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|"
            . "SK|KRSI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU]{2}[A-Z0-9]{10}/";

$results = [];

foreach ($descriptions as $k => $v) {
    if (preg_match($ISIN_PATTERN, strip_tags($v), $m)) {
        $results[] = ['Title' => trim($titles[$k]->plaintext), 'ISIN' => $m[1]];
    }
}

print_r($results);

This narrows my array down selecting merely the elements that match the Regex, but it does not display the matches under 'ISIN' => $m[1] . It outputs this:

Array
(
    [0] => Array
        (
            [Title] => Title 1
            [ISIN] => 
        )

    [1] => Array
        (
            [Title] => Title 2
            [ISIN] => 
        )
...

FURTHER EDIT

This code solves the issue:

$titles = $html->find("h3.r a");

$descriptions = $html->find("span.st");

$ISIN_PATTERN = "/[BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|US|BR|XS|FI|GR|IS|RU|LB|"
            . "PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|"
            . "IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|"
            . "SK|KRSI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU]{2}[A-Z0-9]{10}/";

$results1 = [];

foreach ($descriptions as $k => $v) {
    if (preg_match($ISIN_PATTERN, strip_tags($v), $m)) {
        $results1[] = ['Title' => trim($titles[$k]->plaintext), 'ISIN' => $m[1]];
    }
}

$titlesarray = array_column($results1, 'Title');

$results2 = array_map(function($value){
    $match = array();
    $string = strip_tags($value);
    $pattern = "/[BE|BM|FR|BG|VE|DK|HR|DE|JP|HU|HK|JO|US|BR|XS|FI|GR|IS|RU|LB|"
            . "PT|NO|TW|UA|TR|LK|LV|LU|TH|NL|PK|PH|RO|EG|PL|AA|CH|CN|CL|EE|CA|"
            . "IR|IT|ZA|CZ|CY|AR|AU|AT|IN|CS|CR|IE|ID|ES|PE|TN|PA|SG|IL|US|MX|"
            . "SK|KRSI|KW|MY|MO|SE|GB|GG|KY|JE|VG|NG|SA|MU]{2}[A-Z0-9]{10}/";
    preg_match($pattern, $string, $match);
    return $match;
}, $descriptions);

$descriptionarray = array_column($results2, 0);

$result3 = array();
foreach($titlesarray as $key => $value) {
    $tmp = array($value);
    if (isset($descriptionarray[$key])) {
        $tmp[] = $descriptionarray[$key];
    }
    $result3[] = $tmp;
}

print_r($result3);

I scraped something together very fast as I needed a quick solution. This is highly inefficient given that I use an extra arrar_map(), simplify the arrays into a Simple Array and then join them back together. Apart from that, I repeat my Regex.


LAST EDIT

@CasimiretHippolyte answer is the most efficient solution and gives the answer for using either his pattern with $m[1] or my pattern with $m[0].

like image 848
Ava Barbilla Avatar asked Jun 29 '26 12:06

Ava Barbilla


1 Answers

You can design your code in an other way with a simple foreach and build the result items one by one only when the ISIN code is found:

$titles = $html->find("h3.r a");
$descriptions = $html->find("span.st");

define ('ISIN_PATTERN', '~
 \b  # there is probably a word boundary at the begin of the ISIN code
 (?=([A-Z]{2}[A-Z0-9]{10})\b) # check the format before testing the whole alternation
                              # at the same time, the ISIN is captured in group 1
 (?: # so, this alternation is only here to make the pattern fail or succeed
     C[AHLNRSYZ]|I[DELNRST]|P[AEHKLT]|S[AEIGK]|A[ARTU]|B[EGMR]|L[BKUV]|M[OUXY]|T[HNRW]
     |E[EGS]|G[BGR]|H[KRU]|J[EOP]|K[RWY]|N[GLO]|D[EK]|F[IR]|R[OU]|U[AS]|V[EG]|XS|ZA
 )~x');

$results = [];

foreach ($descriptions as $k => $v) {
    if (preg_match(ISIN_PATTERN, strip_tags($v), $m))
        $results[] = [ 'ISIN' => $m[1], 'Title' => trim($titles[$k]->plaintext) ]; 
}

print_r($results);

Note: this code is not tested and can probably be improved. Several ideas:

  • stop to use simplehtml and use DOMDocument and DOMXPath
  • the hand driven pattern is designed with the assumption that all countries are equiprobable. If it isn't the case, rewrite it to check the most current countries in priority
like image 116
Casimir et Hippolyte Avatar answered Jul 01 '26 03:07

Casimir et Hippolyte