I am trying to parse the stackoverflow dump file (Posts.xml- 17gb) .It is of the form:
<posts>
<row Id="15228715" PostTypeId="1" />
.
<row Id="15228716" PostTypeId="2" ParentId="1600647" LastActivityDate="2013-03-05T16:13:24.897"/>
</posts>
I have to 'group' each question with their answers. Basically find a question (posttypeid=1) find its answers using parentId of another row and store it in db .
I tried doing this using querypath (DOM), but it kept exiting(139) . My guess is because of the large size of the file, my PC couldn't handle it, even with huge swap.
I considered xmlreader, but as I see it using xmlreader, the program would be reading through the file a whole lot of times(find question, look for answers, repeat a lot of times) and hence is not viable. Am I wrong ?
Is there any other method/way ?
Help!
It is a one time parsing.
I considered xmlreader, but as I see it using xmlreader, the program would be reading through the file a whole lot of times(find question, look for answers, repeat a lot of times) and hence is not viable. Am I wrong ?
Yes you are wrong. With XMLReader you specify your own how often your want to traverse the file (you normally do it once). For your case I see no reason why you should not be able to even insert this 1:1 on each <row>
element. You can decide per the attribute which database (table?) you would like to insert into.
I normally suggest a set of Iterators that make traversing with XMLReader easier. It's called XMLReaderIterator and allows to foreach
over the XMLReader
so that the code is often easier to read and write:
$reader = new XMLReader();
$reader->open($xmlFile);
/* @var $users XMLReaderNode[] - iterate over all <post><row> elements */
$posts = new XMLElementIterator($reader, 'row');
foreach ($posts as $post)
{
$isAnswerInsteadOfQuestion = (bool)$post->getAttribute('ParentId')
$importer = $isAnswerInsteadOfQuestion
? $importerAnswers
: $importerQuestions;
$importer->importRowNode($post);
}
If you are concerned about the order (e.g. you might fear that some answers parent's aren't available while the answers are), I would take care inside the importer layer, not inside the traversal.
Depending if that happens often, very often, never or quite never I would use a different strategy. E.g. for never I would insert directly into database tables with foreign key constraints activated. If often, I would create an insert transaction for the whole import in which the key constraints are lifted and re-activated at the end.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With