I'm trying to parse the class schedule provided by my university in order to import the information into some kind of calendar.
An example of the schedule can be seen here:
http://www.asw-berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIA13-7.%20Block.html
The auto-generated HTML-content is, in my opinion, a mess and very hard to grasp. E.g. the tables are mainly built with rowspans and colspans (the positions of cells in the code compared with their actual visual position in the browser seem partially arbitrary).
What I've already tried:
Right now I don't think that conventional HTML parsing will get me far, at least not in an acceptable developing time. What I am looking for are other methods to fetch information from complex HTML tables, something like YQL, or maybe utilities that can normalize such tables with col-/rowspans. Because I don't have anything concrete in mind, I am mainly asking for some tips or hints for another approach.
Are there other, more suitable methods to parse such tables or am I stuck with conventional HTML parsing?
Edit:
On behalf of a request, I'll paste an example of raw code...
This week:
Results from this code:
http://pastebin.com/BJduUVtU
Edit 2:
Because of some parsing discussions I'll also add my PHP code. It's my first time with PHP so it's not very sophisticated. It should rather give an insight on how far I've come with parsing the tables in theory. The actual work happens in the function parseSkedTable(), please concentrate on this one. Also, I would like to point out the term "double courses" appearing in the comments, which describes two different courses happening at the same time (the class would be split in such moments). An example of these courses can be found here in week two:
http://www.asw-berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIB14-4.%20Block.html
It looks like this:
The corresponding HTML-code of that week can also be accessed here:
http://pastebin.com/gLTWz5KU
And now the PHP-code (I had a hard time translating the comments since I already struggled expressing them in my first language...I hope they may still be helpful):
http://pastebin.com/Nzi8m2v8
Update
So far, there have been some solutions to my parsing problem, each of them using JavaScript. Since JavaScript (being especially powerful here because of the ability to use browser-rendered data) seems to be the only efficient way to retrieve reliable information from the HTML, I am now looking for a way to implement some kind of headless browser or rendering engine on my free server at x10hosting.com. Sadly, I am neither able to install software other than provided by softaculous nor allowed to use PHP's exec() command.
Any idea would be appreciated!
For the sake of completeness, I'll post both solutions, existing until now:
jQuery parser by Pierre Dubois:
(function ($) { $(document).ready(function() {
var _pe = window.pe || {
fn : {}
};
var tblNumber = 0; // Just a incremental number to identify the schedule item with the table
// For each table
$('table').each(function () {
$('#output').append('Parsing the table number: ' + tblNumber + '<br>');
// console.log('Parsing the table number: ' + tblNumber);
tblNumber += 1;
var currentTable = this;
// Parser the complex table
_pe.fn.parsertable.parse($(currentTable));
// Retrieve the parsed data
var parsedData = $(currentTable).data().tblparser;
//
// Information about the column structure, nice that is consistent
//
// Day: Cell index position (0 based)
// Mo: 3
// Di: 7
// Mi: 11
// Do: 15
// Fr: 19
// Sa: 23
// Title Location at Row index position "0"
// "i" represent the middle column position
for (var i = 3; i < 24; i += 4) {
var currentDay;
// Get the day
currentDay = $(parsedData.row[0].cell[i].elem).text();
$('#output').append(' Day: ' + currentDay + '<br>');
// console.log('Day: ' + currentDay);
// Get all the events for that day, excluding the first row and the last row
for (var j = 1; j < parsedData.col[i].cell.length - 2; j += 1) {
// First column
if (parsedData.col[i - 1].cell[j - 1].uid !== parsedData.col[i - 1].cell[j].uid ) {
// Get the content of that cell and remove ending space
var event = $(parsedData.col[i - 1].cell[j].elem).text().trim();
if (event.length > 0) {
$('#output').append(' + Event: ' + event + '<br>');
// console.log('Event: ' + event);
}
}
// Second Column
if (parsedData.col[i].cell[j - 1].uid !== parsedData.col[i].cell[j].uid &&
parsedData.col[i - 1].cell[j].uid !== parsedData.col[i].cell[j].uid) {
// Get the content of that cell and remove ending space
var event = $(parsedData.col[i].cell[j].elem).text().trim();
if (event.length > 0) {
$('#output').append(' + Event: ' + event + '<br>');
// console.log('Event: ' + event);
}
}
// Third Column
if (parsedData.col[i + 1].cell[j - 1].uid !== parsedData.col[i + 1].cell[j].uid &&
parsedData.col[i].cell[j].uid !== parsedData.col[i + 1].cell[j].uid) {
// Get the content of that cell and remove ending space
var event = $(parsedData.col[i + 1].cell[j].elem).text().trim();
if (event.length > 0) {
$('#output').append(' + Event: ' + event + '<br>');
// console.log('Event: ' + event);
}
}
}
}
});
});
}(jQuery));
JS parser using positional information by me, realizing rambo coder's idea
You could make use of a browsers rendering/layout engine here.
Use http://phantomjs.org/ to get access to a headless browser that lets you execute javascript on a webpage's dom.
A dash of jquery would make the remaining pseudocode easy to implement:
foreach (td.t as dateElement) {
//parse date from element text
//use pixel position + dimensions to calc pixel coord of center
// save this center in a list along with the date
}
foreach (td.v as calendarEntryElement) {
//parse time + other stuff from element text
//use pixel position to find the closest date element in that list(it must be the closest one above)
}
I feel positional information would be very reliable here, because everything is a nested rectangle and its all done via tables.
You don't need to use phantomjs, you could just as easily execute a browser manually, and let it send a request to a local server to collect the results.
Some shell command roughly like
firefox file://foo123.html
Where you've appended some custom <script>
to the end of one of their webpages and saved it.
I study at the same university and a few weeks ago I faced the same problem to parse this time table and convert it to an ICS file. Finally I found my own solution and generalized the code, so that students from other universities, using the Sked software and have a much more complex time table, can import their time table too.
I also created a website, where students can sign up and configure the urls to the time tables which they want to subscribe. In the background runs a cronjob which ensures, that the subscribed calendars are always up to date.
You can find the result of the project on my website:
http://calendar.pineappledeveloper.com/
(it is only in German available).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With