Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing/scanning through a 17gb xml file

I am trying to parse the stackoverflow dump file (Posts.xml- 17gb) .It is of the form:

<posts>
<row Id="15228715" PostTypeId="1" />
.
<row Id="15228716" PostTypeId="2" ParentId="1600647" LastActivityDate="2013-03-05T16:13:24.897"/>
</posts>

I have to 'group' each question with their answers. Basically find a question (posttypeid=1) find its answers using parentId of another row and store it in db .

I tried doing this using querypath (DOM), but it kept exiting(139) . My guess is because of the large size of the file, my PC couldn't handle it, even with huge swap.

I considered xmlreader, but as I see it using xmlreader, the program would be reading through the file a whole lot of times(find question, look for answers, repeat a lot of times) and hence is not viable. Am I wrong ?

Is there any other method/way ?

Help!

It is a one time parsing.

like image 890
gyaani_guy Avatar asked Feb 16 '23 09:02

gyaani_guy


1 Answers

I considered xmlreader, but as I see it using xmlreader, the program would be reading through the file a whole lot of times(find question, look for answers, repeat a lot of times) and hence is not viable. Am I wrong ?

Yes you are wrong. With XMLReader you specify your own how often your want to traverse the file (you normally do it once). For your case I see no reason why you should not be able to even insert this 1:1 on each <row> element. You can decide per the attribute which database (table?) you would like to insert into.

I normally suggest a set of Iterators that make traversing with XMLReader easier. It's called XMLReaderIterator and allows to foreach over the XMLReader so that the code is often easier to read and write:

$reader = new XMLReader();
$reader->open($xmlFile);

/* @var $users XMLReaderNode[] - iterate over all <post><row> elements */
$posts = new XMLElementIterator($reader, 'row');
foreach ($posts as $post)
{
    $isAnswerInsteadOfQuestion = (bool)$post->getAttribute('ParentId')

    $importer = $isAnswerInsteadOfQuestion 
                ? $importerAnswers 
                : $importerQuestions;

    $importer->importRowNode($post);
}

If you are concerned about the order (e.g. you might fear that some answers parent's aren't available while the answers are), I would take care inside the importer layer, not inside the traversal.

Depending if that happens often, very often, never or quite never I would use a different strategy. E.g. for never I would insert directly into database tables with foreign key constraints activated. If often, I would create an insert transaction for the whole import in which the key constraints are lifted and re-activated at the end.

like image 96
hakre Avatar answered Feb 24 '23 17:02

hakre