Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a regex for parsing Arabic Dates

I'm working on a program that is running a series of regexs to attempt to find a date within the DOM from a webpage. For example, in www.engadget.com/2010/07/19/windows-phone-7-in-depth-preview/, I would match "Jul 19th 2010" with my regex. Things were going fine in multiple formats and languages until I hit an Arabic webpage. As an example, consider http://islammaktoob.maktoobblog.com/. The date July 18, 2010 appears in Arabic at the top of the post, but I can't figure out how to match it. Does anyone have any experience on matching Arabic dates? If someone could post an example or the regex they would use to match that Arabic date, it would be very helpful. Thank you!

Update:

Getting closer:

String fromTheSite = "كتبها اسلام مكتوب ، في 18 تموز 2010 الساعة: 09:42 ص"; 
    NamedMatcher infoMatcher = NamedPattern.compile("(?<Day>[0-3]?[0-9]) (?<Month>يناير|فبراير|مارس|أبريل|إبريل|مايو|يونيو|يونيه|يوليو|يوليه|أغسطس|سبتمبر|أكتوبر|نوفمبر|ديسمبر|كانون الثاني|شباط|آذار|نيسان|أيار|حزيران|تموز|آب|أيلول|تشرين الأول|تشرين الثاني|كانون الأول) (?<Year>[1-2][0-9][0-9][0-9]) ", Pattern.CANON_EQ).matcher(fromTheSite);
    while(infoMatcher.find()){
        System.out.println(infoMatcher.group());
        System.out.println(infoMatcher.group("Day"));
        System.out.println(infoMatcher.group("Month"));
        System.out.println(infoMatcher.group("Year"));
    }

Gives me

18 تموز 2010
18
تموز
2010

Why does the match appear out of order?

like image 842
chsbellboy Avatar asked Jul 19 '10 20:07

chsbellboy


1 Answers

If you look at the binary code of your copied text you can see that the sentence is actually saved reading from right to left (so the first letter on the right side is the first in the file).
It changes tehe text back while rendering such that it looks like it's written right to left (This causes also this strange selection behavior).

Therefor you have to search from right to left.
Additionally, it's important to notice that numbers aren't switched.

Example:

If you can read "txet emos 20 yluJ 2016 srahc modnar",
it's saved as "random chars 2016 July 20 some text" in the file.

like image 182
Snow bunting Avatar answered Sep 18 '22 15:09

Snow bunting