Monthly Archive for December, 2009

DTD calling home

Facing the entrance of the office, we have a monitor displaying several development related indicators, like recent build results, the last commit statistics and so forth. While walking into the office one morning, I noticed that the statuses [sic] of several of our overnight unit tests had changed to crimson red. Upon closer inspection it became clear, that all failed tests were related to parsing XML configuration files for Struts.

Going through the logs showed each test case was throwing the same exception:

java.io.FileNotFoundException: http://struts.apache.org/dtds/struts-config_1_1.dtd

This originated inside a SAX parser which was parsing several different struts-config.xml for those test cases. A simple text search revealed that the Struts configuration files all started with the following preamble:

1
2
3
4
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE struts-config PUBLIC
   "-//Apache Software Foundation//DTD Struts Configuration 1.1//EN"
   "http://jakarta.apache.org/struts/dtds/struts-config_1_1.dtd">

That DOCTYPE declaration references an external DTD – given in the URL pointing to the Apache web site – which the SAX Parser tries to parse as well when it hits the declaration: it opens an HTTP connection in order to read the file. That operation failed, thus causing the test cases to fail as well. This actually raises two issues: Why was the parser not able to connect to the above URL (which is a valid server and document name) and why did it try to do so in the first place?

The first question was quickly dismissed, as the failure probably resulted from some network glitch which could not be reproduced, but the second one was rather alarming. The software tested by the unit tests is supposed to run in a rather restricted environment and should not establish any outbound network connections that are not explicitly requested by the user. So how come the Java SAXParser started to connect to remote web sites? The answer is simple: That’s what it’s supposed to do.

Document Type Definitions

XML has become the major data exchange format and even though shockingly many applications don’t seem to care, the provider and consumer of the XML should both have a clear understanding of how the document they create or parse should be structured. This was originally achieved through a document type definition (DTD), which has been replaced with the more flexible XML schema language in recent years.

A DTD can appear inside the XML document it defines or in an external file, which is referenced through a document type declaration (DOCTYPE) as seen above. In both cases the DTD can contain the following information:

  • Element declarations: This defines the names of elements (tags) that can appear in the document, the content that can appear inside an element (text, nothing, other elements) and what kind of attributes an element can have (see next item)
  • Attribute list declarations: This defines groups of attributes that can be assigned to elements, including what content each attribute can contain, if it’s optional or required and what default values are assigned
  • Entity declarations: Entities can be thought of as place holders that should be replaced with other text. The most commonly known entities are the special character entities in HTML, e.g., &lt; which is replaced by <, but you can also define custom ones, for example that &company; should be replaced with Snake Oil International, Inc.
  • Notation declarations: These allow to define formats of non-XML data, mostly using MIME-types, e.g., for images or other binary data that is referenced in the document and should be handled by the parsing application

To parse the XML data correctly and present the document content to the application reading the XML, the parser needs to take the DTD into consideration. If the DTD defines that there is an implicit attribute on an element with a default value, that attribute should be reported to the application when encountering that type of element, even though the attribute might not exist in the document on that element. In the same way, entities should be expanded and replaced with the text they represent, as the application wouldn’t know what text to substitute for the entity.

Thus, the XML parser should parse an inline DTD or external reference to one if either one is encountered. This is also the default setting in a lot of SAX parsers, including Xerces, which is used by our test cases. However, in some cases the application might not care about validating the XML or should at least not try to read external entities or document type definitions.

Configuring the XML Parser

In addition to loading external DTDs, there are a lot of other issues that you might want to configure for your XML parser: Should it validate the document as it parses it, or don’t care about the content? Are XML namespaces important to you? If you do want to load external schema definitions, how do you want to resolve the URLs? These issues are called “features” in the Java XML APIs, they are identified through a URL and there are several standard features every SAX compliant parser must support. In addition, the specific parser implementations provide extensions to that standard set, the one to turn of external DTD loading is available in the specific Xerces feature list and is called


http://apache.org/xml/features/nonvalidating/load-external-dtd

To deactivate this feature, you can tell the SAXParserFactory directly:

1
2
3
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
SAXParser parser = factory.newSAXParser();

Alternatively, you can set it at the XMLReader after the parser was created:

1
2
SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
saxParser.getXMLReader().setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

These settings can also be applied to other XML related classes, like the DocumentBuilderFactory as this stackoverflow.com post describes.

These examples skip any exception handling etc., but should make the point clear where and how to set the settings.

Disabling the feature fixed our test cases and we reviewed all other places we handled XML to see if other features had to be adjusted as well. As mentioned above, deliberately not taking DTDs or XML schemas into consideration might ignore important implicit data inside the XML document, so it should always be verified if the failure to resolve the document definitions is just an annoyance that should be eliminated or a serious problem that needs to be addressed.