Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I split a large xml file?

Tags:

windows

xml

We export “records” to an xml file; one of our customers has complained that the file is too big for their other system to process. Therefore I need to split up the file, while repeating the “header section” in each of the new files.

So I am looking for something that will let me define some xpaths for the section(s) that should always be outputted, and another xpath for the “rows” with a parameter that says how many rows to put in each file and how to name the files.

Before I start writing some custom .net code to do this; is there a standard command line tool that will work on windows that does it?

(As I know how to program in C#, I am more included to write code then try to mess about with complex xsl etc, but a "of the self" solution would be better then custom code.)

like image 529
Ian Ringrose Avatar asked Feb 04 '23 00:02

Ian Ringrose


2 Answers

First download foxe xml editor from this link http://www.firstobject.com/foxe242.zip

Watch that video http://www.firstobject.com/xml-splitter-script-video.htm Video explains how split code works.

There is a script code on that page (starts with split() ) copy the code and on the xml editor program make a "New Program" under the "File". Paste the code and save it. The code is:

split()
{
  CMarkup xmlInput, xmlOutput;
  xmlInput.Open( "**50MB.xml**", MDF_READFILE );
  int nObjectCount = 0, nFileCount = 0;
  while ( xmlInput.FindElem("//**ACT**") )
  {
    if ( nObjectCount == 0 )
    {
      ++nFileCount;
      xmlOutput.Open( "**piece**" + nFileCount + ".xml", MDF_WRITEFILE );
      xmlOutput.AddElem( "**root**" );
      xmlOutput.IntoElem();
    }
    xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
    ++nObjectCount;
    if ( nObjectCount == **5** )
    {
      xmlOutput.Close();
      nObjectCount = 0;
    }
  }
  if ( nObjectCount )
    xmlOutput.Close();
  xmlInput.Close();
  return nFileCount;
}

Change the bold marked (or ** ** marked) fields for your needs. (this is also expressed at the video page)

On the xml editor window right click and click the RUN (or simply F9). There is output bar on the window where it shows number of files that generated.

Note: input File name can be "C:\\Users\\AUser\\Desktop\\a_xml_file.xml" (double slashes) and output file "C:\\Users\\AUser\\Desktop\\anoutputfolder\\piece" + nFileCount + ".xml"

like image 112
ewroman Avatar answered Feb 07 '23 09:02

ewroman


There's no general-purpose solution to this, because there's so many different possible ways that your source XML could be structured.

It's reasonably straightforward to build an XSLT transform that will output a slice of an XML document. For instance, given this XML:

<header>
  <data rec="1"/>
  <data rec="2"/>
  <data rec="3"/>
  <data rec="4"/>
  <data rec="5"/>
  <data rec="6"/>
</header>

you can output a copy of the file containing only data elements within a certain range with this XSLT:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes"/>
  <xsl:param name="startPosition"/>
  <xsl:param name="endPosition"/>

  <xsl:template match="@* | node()">
      <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
      </xsl:copy> 
  </xsl:template>

  <xsl:template match="header">
    <xsl:copy>
      <xsl:apply-templates select="data"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="data">
    <xsl:if test="position() &gt;= $startPosition and position() &lt;= $endPosition">
      <xsl:copy>
        <xsl:apply-templates select="@* | node()"/>
      </xsl:copy>
    </xsl:if>
  </xsl:template>

</xsl:stylesheet>

(Note, by the way, that because this is based on the identity transform, it works even if header isn't the top-level element.)

You still need to count the data elements in the source XML, and run the transform repeatedly with the values of $startPosition and $endPosition that are appropriate for the situation.

like image 44
Robert Rossney Avatar answered Feb 07 '23 10:02

Robert Rossney