I want to extract email addresses from a large text file. what is the best way to do it?
My idea is to find '@' in the text and use "Regex" to find email address into substring at (for example) 256 chars before this position and length of 512.
P.S.: Straightforwardly I want to know the best and most efficient way to find some pattern (like email addresses) in a huge text.
256 and 512 sound like arbitrary values.
The local-part of an e-mail address may be up to 64 characters long and the domain name may have a maximum of 255 characters.
So those values would be nicer.
Now combine both methods and voila, you have your algorithm.
It depends on how many false positives and false negatives you want. Email addresses tend to be made up of letters, numbers, and certain symbols. However, while it is probably extremely rare to see characters out of that set in a real email address, the standard certainly allows it. So you really need to decide how many real matches you want and how many matches you want that match your regular expression but are not actually email addresses.
Here's one answer excludes many valid cases and also probably includes too many:
[A-Za-z0-9!#$%&*+-=?^_~]{1,64}@[A-Za-z0-9-.]{1,255}\.[A-Z]{2,6}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With