Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing complex HTML tables

I'm trying to parse the class schedule provided by my university in order to import the information into some kind of calendar. An example of the schedule can be seen here:
http://www.asw-berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIA13-7.%20Block.html

The auto-generated HTML-content is, in my opinion, a mess and very hard to grasp. E.g. the tables are mainly built with rowspans and colspans (the positions of cells in the code compared with their actual visual position in the browser seem partially arbitrary).

What I've already tried:

  1. Asking the university's administration office to provide a simpler, easier to read file separately. Of course this wasn't possible, after all it would mean one minute of additional effort.
  2. Researching the original tool used to generate the HTML. It is called "sked Stundenplan Software". I couldn't find any hints or tools to "reverse" the generation process.
  3. Looking for an existing solution, at which point I found some tools (e.g. http://code.google.com/p/skd-schedule-parser/) that do not work for my schedule. After studying the codes of these tools I concluded that they must have been designed for an other/outdated version of sked.
  4. Parsing the HTML with PHP (mostly using DOMDocument). That worked sometimes, but was way too unreliable...The exceptions to take into account seem indefinite.

Right now I don't think that conventional HTML parsing will get me far, at least not in an acceptable developing time. What I am looking for are other methods to fetch information from complex HTML tables, something like YQL, or maybe utilities that can normalize such tables with col-/rowspans. Because I don't have anything concrete in mind, I am mainly asking for some tips or hints for another approach.

Are there other, more suitable methods to parse such tables or am I stuck with conventional HTML parsing?

Edit:

On behalf of a request, I'll paste an example of raw code...

This week:
Schedule week

Results from this code:
http://pastebin.com/BJduUVtU

Edit 2:
Because of some parsing discussions I'll also add my PHP code. It's my first time with PHP so it's not very sophisticated. It should rather give an insight on how far I've come with parsing the tables in theory. The actual work happens in the function parseSkedTable(), please concentrate on this one. Also, I would like to point out the term "double courses" appearing in the comments, which describes two different courses happening at the same time (the class would be split in such moments). An example of these courses can be found here in week two:
http://www.asw-berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIB14-4.%20Block.html

It looks like this: Double courses

The corresponding HTML-code of that week can also be accessed here:
http://pastebin.com/gLTWz5KU

And now the PHP-code (I had a hard time translating the comments since I already struggled expressing them in my first language...I hope they may still be helpful):
http://pastebin.com/Nzi8m2v8

Update

So far, there have been some solutions to my parsing problem, each of them using JavaScript. Since JavaScript (being especially powerful here because of the ability to use browser-rendered data) seems to be the only efficient way to retrieve reliable information from the HTML, I am now looking for a way to implement some kind of headless browser or rendering engine on my free server at x10hosting.com. Sadly, I am neither able to install software other than provided by softaculous nor allowed to use PHP's exec() command.
Any idea would be appreciated!

For the sake of completeness, I'll post both solutions, existing until now:

  1. jQuery parser by Pierre Dubois:

    (function ($) { $(document).ready(function() {

        var _pe = window.pe || {
            fn : {}
        };
    
        var tblNumber = 0; // Just a incremental number to identify the schedule item with the table
    
        // For each table
        $('table').each(function () {
    
            $('#output').append('Parsing the table number: ' + tblNumber + '<br>');
            // console.log('Parsing the table number: ' + tblNumber);
            tblNumber += 1;
    
            var currentTable = this;
    
    
            // Parser the complex table
            _pe.fn.parsertable.parse($(currentTable));
    
            // Retrieve the parsed data
            var parsedData = $(currentTable).data().tblparser;
    
            //
            // Information about the column structure, nice that is consistent
            //
    
            // Day: Cell index position (0 based)
            // Mo: 3
            // Di: 7
            // Mi: 11
            // Do: 15
            // Fr: 19
            // Sa: 23
    
            // Title Location at Row index position "0"
    
            // "i" represent the middle column position
            for (var i = 3; i < 24; i += 4) {
    
                var currentDay;
    
                // Get the day
                currentDay = $(parsedData.row[0].cell[i].elem).text();
    
                $('#output').append('  Day: ' + currentDay + '<br>');
                // console.log('Day: ' + currentDay);
    
                // Get all the events for that day, excluding the first row and the last row
                for (var j = 1; j < parsedData.col[i].cell.length - 2; j += 1) {
    
                    // First column 
                    if (parsedData.col[i - 1].cell[j - 1].uid !== parsedData.col[i - 1].cell[j].uid ) {
    
                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i - 1].cell[j].elem).text().trim();
    
                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
    
                    // Second Column
                    if (parsedData.col[i].cell[j - 1].uid !== parsedData.col[i].cell[j].uid &&
                        parsedData.col[i - 1].cell[j].uid !== parsedData.col[i].cell[j].uid) {
    
                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i].cell[j].elem).text().trim();
    
                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
    
                    // Third Column
                    if (parsedData.col[i + 1].cell[j - 1].uid !== parsedData.col[i + 1].cell[j].uid &&
                        parsedData.col[i].cell[j].uid !== parsedData.col[i + 1].cell[j].uid) {
    
                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i + 1].cell[j].elem).text().trim();
    
                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
                } 
    
            }
    
        });
    
    
    });
    

    }(jQuery));

  2. JS parser using positional information by me, realizing rambo coder's idea

like image 697
MCL Avatar asked Oct 20 '12 13:10

MCL


2 Answers

You could make use of a browsers rendering/layout engine here.

Use http://phantomjs.org/ to get access to a headless browser that lets you execute javascript on a webpage's dom.

A dash of jquery would make the remaining pseudocode easy to implement:

foreach (td.t as dateElement) {
    //parse date from element text
    //use pixel position + dimensions to calc pixel coord of center
    // save this center in a list along with the date
}

foreach (td.v as calendarEntryElement) {
    //parse time + other stuff from element text
    //use pixel position to find the closest date element in that list(it must be the closest one above)
}

I feel positional information would be very reliable here, because everything is a nested rectangle and its all done via tables.

You don't need to use phantomjs, you could just as easily execute a browser manually, and let it send a request to a local server to collect the results.

Some shell command roughly like

firefox file://foo123.html

Where you've appended some custom <script> to the end of one of their webpages and saved it.

like image 127
goat Avatar answered Oct 20 '22 05:10

goat



I study at the same university and a few weeks ago I faced the same problem to parse this time table and convert it to an ICS file. Finally I found my own solution and generalized the code, so that students from other universities, using the Sked software and have a much more complex time table, can import their time table too.
I also created a website, where students can sign up and configure the urls to the time tables which they want to subscribe. In the background runs a cronjob which ensures, that the subscribed calendars are always up to date. You can find the result of the project on my website:
http://calendar.pineappledeveloper.com/
(it is only in German available).

like image 27
Johannes Schuh Avatar answered Oct 20 '22 06:10

Johannes Schuh