Ok, so I working on extracting CAS Numbers from uploaded SDS files (working on docx before I move to pdf). I have successfully converted the docx to a string in the page, but I need to extract several strings if they exist. Here is the code I'm using, and I don't think I'm using preg_match_all correctly at all.
$docObj = new DocxConversion($_FILES["sdsFile"]["tmp_name"]);
$docText = $docObj->convertToText();
preg_match_all("/[0-9]{2,7}-[0-9]{2}-[0-9]{1}/", $docText, $matches);
print_r($matches);
This give me Array ( [0] => Array ( ) ). Not very helpful when I'm looking for:
The output of $docText is:
IDENTIFICATION PRODUCT IDENTIFIER USED ON LABEL: Finished Product Item Number Customer Item Number LABEL DESCRIPTION ACTUAL BRAND SM5802EE ECHO POWERBLEND X EXTENDED LIFE OIL ECHO SMGR33EC 6450005 ECHO POWER BLEND X ECHO SMGR01EC 6450025 ECHO POWER BLEND X ECHO SMGR07EC 6450002 ECHO POWER BLEND X ECHO SM5101EC X6972270101/99988800086 ECHO POWER BLEND X ECHO SM5905EC 6450250 ECHO BAR & CHAIN OIL ECHO SM5818ER 6450114 ECHO POWER BLEND X HIGH PERFORMANCE 2 STROKE ENGINE ECHO SM5818EG 6450103 ECHO POWER BLEND X ECHO SM5238EC 99988800088 ECHO POWER BLEND X ECHO SM5218EC X6972270201/99988800085 ECHO POWER BLEND X ECHO SMGR25EC X6974100202 ECHO POWER BLEND X ECHO SMGR02EC 6450001 ECHO POWER BLEND X ECHO SMGR29EC 6450000 ECHO POWERBLEND X ECHO SM5818EE 6450102 ECHO POWER BLEND X LOW SMOKE ECHO SM5818EC 6450100/6450099 ECHO POWER BLEND X ECHO SM5818EM 6450060 ECHO POWER BLEND X ECHO SMGR34EE ECHO POWERBLEND X ECHO SM5906EC 6450050 ECHO POWER BLEND X ECHO SM5906EM 6450062 ECHO POWER BLEND X ECHO SM5943EE 6450116 ECHO POWER BLEND X ECHO SMGR33EK 6450118 ECHO POWERBLEND X ECHO SMGR34ER 6450109 ECHO POWER BLEND X ECHO SM5926EC 6450006 ECHO POWERBLEND X XTENDED LIFE OIL ECHO SMGR34EE ECHO POWER BLEND X ECHO SMGR34EC 6450108 ECHO POWER BLEND X ECHO SMGR12EC 99988800089 ECHO POWER BLEND X ECHO SMGR34EK 6450119 ECHO POWERBLEND X ECHO SM5834EM 6450061 ECHO POWER BLEND X ECHO Finished Product Item Number Customer Item Number LABEL DESCRIPTION ACTUAL BRAND SMGR34EG 6450115 ECHO POWER BLEND X ECHO SM5955EC 6452750 ECHO POWER BLEND X ECHO RECOMMENDED USE OF THE CHEMICAL AND RESTRICTIONS ON USE; PETROLEUM LUBRICATING OIL NO OTHER USES RECOMMENDED NAME, ADDRESS, AND TELEPHONE NUMBER OF THE CHEMICAL MANUFACTURER, IMPORTER, OR OTHER RESPONSIBLE PARTY: 1.3.1. Spectrum Lubricants Corporation 500 Industrial Park Drive Selmer, TN 38375‐3276 United States of America Product Information MSDS Requests: (800) 264‐6457 or +17316454972 Technical Information: (800) 264‐6457 or +17316454972 General Information: [email protected] PHONE NUMBER: 1.4.1. Emergency Response North America: CHEMTREC (800) 424‐9300 after 5:00pm CST Or +17035273887 Health Emergency USA: (800) 264‐6457 or +17316454972 HAZARD(S) IDENTIFICATION CLASSIFICATION OF THE CHEMICAL IN ACCORDANCE WITH PARAGRAPH (d) of §1910.1200: Acute Inhalation Category 4 Eye Irritant Category 2 Skin Corrosion/Irritation Category 2 Flammable Liquid Category 4 Signal Word: Warning Symbol: Hazard Statements: Harmful if Inhaled Causes serious eye irritation Causes skin irritation Combustible Liquid Precautionary Statements: Prevention: Avoid breathing mist or spray. Use only outdoors or in a well‐ventilated area. Wear eye/face protection Wear protective gloves Keep away from heat, hot surfaces, sparks, open flames and other ignition sources. No smoking. Response: If inhaled: Remove person to fresh air and keep comfortable for breathing. If in eyes: Rinse cautiously with water for several minutes. Remove contact lenses, if present and easy to do. Continue rinsing. If eye irritation persists get medical advice/attention. If on skin: wash with plenty of water, if irritation or rash occurs get medical advice/attention. Take off contaminated clothing and wash it before reuse. Call a poison center/doctor if you feel unwell. In case of fire: Use water fog, foam, dry chemical or carbon dioxide (CO2) to extinguish flames. Storage: Store in well‐ventilated place. Disposal: Dispose of contents/container in accordance with local/regional/national/international regulations. Composition/ information on ingredients The chemical name and concentration (exact percentage) or concentration ranges of all ingredients which are classified as health hazards in accordance with paragraph (d) of §1910.1200 3.1.1. COMPONENTS CAS Number EU Number Concentration (%) Hazard Statements (see Section 16) Distillates (petroleum), hydrotreated light 64742‐47‐8 265‐149‐8 10‐30 H226, H304, H315, Solvent‐dewaxed heavy paraffinic distillates 64742‐65‐0 265‐169‐7 40‐50 H315, H332 Polyiosbutylene 9003‐29‐6 Not available 40‐70 H315, H319, H332 FIRST AID MEASURES
There's more, but I'll spare you...
You need to add other hyphens:
~\d{2,7}\p{Pd}\d{2}\p{Pd}\d~u
See a demo on regex101.com.
~ # pattern delimiter
\d{2,7} # digits, 2-7 times
\p{Pd} # matches any kind of hyphen or dash (including unicode characters)
\d{2} # 2 digits
\p{Pd} # same as above
\d # one digit
~ # pattern delimiter
u # unicode flag (pattern modifier)
preg_match_all('~\d{2,7}\p{Pd}\d{2}\p{Pd}\d~u', $docText, $matches);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With