Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Horrible Performance Parsing XHTML File with Doctype as XML Document

Tags:

java

xml

xhtml

When I parse this xhtml file as xml, it takes approximately 2 minutes to do the parsing on such a simple file. I have found that if I remove the doctype declaration, it parses nigh instantaneously. What is wrong that is causing this file to take so long to parse?

Java Example

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware( true );
DocumentBuilder bob = dbf.newDocumentBuilder();
Document template = bob.parse( new InputSource( new FileReader( xmlFile ) ) );

XHTML Example

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
    xmlns:ex="http://www.example.com/schema/v1_0_0">
    <head><title>Test</title></head>
    <body>
        <h1>Test</h1>
        <p>Hello, World!</p>
        <p><ex:test>Text</ex:test></p>
    </body>
</html>

Thanks

Edit: Solution

To actually fix the problem based on the information provided about why it was happening in the first place, I did these basic steps:

  1. Downloaded the DTD-related files to a src/main/resources folder
  2. Created a custom EntityResolver to read these files from the classpath
  3. Told my DocumentBuilder to use my new EntityResolver

I referenced this SO answer in doing so: how to validate XML using java?

New EntityResolver

import java.io.IOException;

import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class LocalXhtmlDtdEntityResolver implements EntityResolver {

    /* (non-Javadoc)
     * @see org.xml.sax.EntityResolver#resolveEntity(java.lang.String, java.lang.String)
     */
    @Override
    public InputSource resolveEntity( String publicId, String systemId )
            throws SAXException, IOException {
        String fileName = systemId.substring( systemId.lastIndexOf( "/" ) + 1 );    
        return new InputSource( 
                getClass().getClassLoader().getResourceAsStream( fileName ) );
    }

}

How to use new EntityResolver:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware( true );
DocumentBuilder bob = dbf.newDocumentBuilder();
bob.setEntityResolver( new LocalXhtmlDtdEntityResolver() );
Document template = bob.parse( new InputSource( new FileReader( xmlFile ) ) );
like image 304
Marshmellow1328 Avatar asked Jan 20 '26 07:01

Marshmellow1328


1 Answers

Java is downloading the specified DTD and its and included files in order to validate that your xhtml file obeys the specified DTD. Using Charles proxy I recorded the following requests taking the specified amounts to load:

  • http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd - 30.4 sec
  • http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent - 30.19 sec
  • http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent - 30.23 sec
  • http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent - 30.20 sec
like image 111
Charlie Avatar answered Jan 21 '26 20:01

Charlie