Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract IBAN from text with Python

I want to extract IBAN numbers from text with Python. The challenge here is, that the IBAN itself can be written in so many ways with spaces bewteen the numbers, that I find it difficult to translate this in a usefull regex pattern.

I have written a demo version which tries to match all German and Austrian IBAN numbers from text.

^DE([0-9a-zA-Z]\s?){20}$

I have seen similar questions on stackoverflow. However, the combination of different ways to write IBAN numbers and also extracting these numbers from text, makes it very difficult to solve my problem.

Hope you can help me with that!

like image 547
PParker Avatar asked Jan 15 '21 11:01

PParker


People also ask

What is the Iban string number for Iban?

IBAN NL 91ABNA0417463300 IBAN NL91ABNA0417164300 Iban: NL 69 ABNA 402032566 And that string may have three or more variations.

How is an IBAN validation performed?

An IBAN is validated by converting it into an integer and performing a basic mod-97 operation (as described in ISO 7064) on it. If the IBAN is valid, the remainder equals 1. The algorithm of IBAN validation is as follows:

What is the correct way to replace Iban with regexp?

iban = NL 91ABNA0417463300 iban.replace (" ", "") iban.upper () It's not what you want, but works. IBAN has a strict format, so it's better to normalize it, and next just cut part, because everything will match regexp, as an example:

How to read a Latin text file in Python?

Copy and paste the latin text above into a text file, and save it as lorem.txt, so you can run the example code using this file as input. A Python program can read a text file using the built-in open () function.


2 Answers

ISO landcode Verification# Bank# Account#
Germany 2a 2n 8n 10n
Austria 2a 2n 5n 11n

Note: a - alphabets (letters only), n - numbers (numbers only)

So the main difference is really the length in digits. That means you could try:

\b(?:DE(?:\s*\d){20}|AT(?:\s*\d){18})\b(?!\s*\d)

See the online demo.


  • \b - Word-boundary.
  • (?: - Open 1st non-capturing group.
    • DE - Match uppercase "DE" literally.
    • (?:- Open 2nd non-capturing group.
      • \s*\d - Zero or more spaces upto a single digit.
      • ){20} - Close 2nd non-capturing group and match it 20 times.
    • | - Or:
    • AT - Match uppercase "AT" literally.
    • (?:- Open 3rd non-capturing group.
      • \s*\d - Zero or more spaces upto a single digit.
      • ){18} - Close 2nd non-capturing group and match it 20 times.
    • ) - Close 1st non-capturing group.
  • \b - Word-boundary.
  • (?!\s*\d) - Negative lookahead to prevent any trailing digits.

It does show that your Austrian IBAN numbers are invalid. If you wish to extract up to the point where they would still be valid, I guess you can remove \b(?!\s*\d)

like image 148
JvdV Avatar answered Oct 04 '22 17:10

JvdV


In general, to match German and Austrian IBAN codes, you can use

codes = re.findall(r'\b(DE(?:\s*[0-9]){20}|AT(?:\s*[0-9]){18})\b(?!\s*[0-9])', text)

Details:

  • \b - word boundary
  • (DE(?:\s*[0-9]){20}|AT(?:\s*[0-9]){18}) - Group 1: DE and 20 repetitions of a digit with any amount of whitespace in between, or AT and then 18 repetitions of single digits eventaully separated with any amount of whitespaces
  • \b(?!\s*[0-9]) - word boundary that is NOT immediately followed with zero or more whitespaces and an ASCII digit.

See this regex demo.

For the data you showed in the question that includes non-proper IBAN codes, you can use

\b(?:DE|AT)(?:\s?[0-9a-zA-Z]){18}(?:(?:\s?[0-9a-zA-Z]){2})?\b

See the regex demo. Details:

  • \b - word boundary
  • (?:DE|AT) - DE or AT
  • (?:\s?[0-9a-zA-Z]){18} - eighteen occurrences of an optional whitespace and then an alphanumeric char
  • (?:(?:\s?[0-9a-zA-Z]){2})? - an optional occurrence of two sequences of an optional whitespace and an alphanumeric char
  • \b - word boundary.
like image 33
Wiktor Stribiżew Avatar answered Oct 04 '22 16:10

Wiktor Stribiżew