DTD calling home

Facing the entrance of the office, we have a monitor displaying several development related indicators, like recent build results, the last commit statistics and so forth. While walking into the office one morning, I noticed that the statuses [sic] of several of our overnight unit tests had changed to crimson red. Upon closer inspection it became clear, that all failed tests were related to parsing XML configuration files for Struts.

Going through the logs showed each test case was throwing the same exception:

java.io.FileNotFoundException: http://struts.apache.org/dtds/struts-config_1_1.dtd

This originated inside a SAX parser which was parsing several different struts-config.xml for those test cases. A simple text search revealed that the Struts configuration files all started with the following preamble:

1
2
3
4
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE struts-config PUBLIC
   "-//Apache Software Foundation//DTD Struts Configuration 1.1//EN"
   "http://jakarta.apache.org/struts/dtds/struts-config_1_1.dtd">

That DOCTYPE declaration references an external DTD – given in the URL pointing to the Apache web site – which the SAX Parser tries to parse as well when it hits the declaration: it opens an HTTP connection in order to read the file. That operation failed, thus causing the test cases to fail as well. This actually raises two issues: Why was the parser not able to connect to the above URL (which is a valid server and document name) and why did it try to do so in the first place?

The first question was quickly dismissed, as the failure probably resulted from some network glitch which could not be reproduced, but the second one was rather alarming. The software tested by the unit tests is supposed to run in a rather restricted environment and should not establish any outbound network connections that are not explicitly requested by the user. So how come the Java SAXParser started to connect to remote web sites? The answer is simple: That’s what it’s supposed to do.

Document Type Definitions

XML has become the major data exchange format and even though shockingly many applications don’t seem to care, the provider and consumer of the XML should both have a clear understanding of how the document they create or parse should be structured. This was originally achieved through a document type definition (DTD), which has been replaced with the more flexible XML schema language in recent years.

A DTD can appear inside the XML document it defines or in an external file, which is referenced through a document type declaration (DOCTYPE) as seen above. In both cases the DTD can contain the following information:

  • Element declarations: This defines the names of elements (tags) that can appear in the document, the content that can appear inside an element (text, nothing, other elements) and what kind of attributes an element can have (see next item)
  • Attribute list declarations: This defines groups of attributes that can be assigned to elements, including what content each attribute can contain, if it’s optional or required and what default values are assigned
  • Entity declarations: Entities can be thought of as place holders that should be replaced with other text. The most commonly known entities are the special character entities in HTML, e.g., &lt; which is replaced by <, but you can also define custom ones, for example that &company; should be replaced with Snake Oil International, Inc.
  • Notation declarations: These allow to define formats of non-XML data, mostly using MIME-types, e.g., for images or other binary data that is referenced in the document and should be handled by the parsing application

To parse the XML data correctly and present the document content to the application reading the XML, the parser needs to take the DTD into consideration. If the DTD defines that there is an implicit attribute on an element with a default value, that attribute should be reported to the application when encountering that type of element, even though the attribute might not exist in the document on that element. In the same way, entities should be expanded and replaced with the text they represent, as the application wouldn’t know what text to substitute for the entity.

Thus, the XML parser should parse an inline DTD or external reference to one if either one is encountered. This is also the default setting in a lot of SAX parsers, including Xerces, which is used by our test cases. However, in some cases the application might not care about validating the XML or should at least not try to read external entities or document type definitions.

Configuring the XML Parser

In addition to loading external DTDs, there are a lot of other issues that you might want to configure for your XML parser: Should it validate the document as it parses it, or don’t care about the content? Are XML namespaces important to you? If you do want to load external schema definitions, how do you want to resolve the URLs? These issues are called “features” in the Java XML APIs, they are identified through a URL and there are several standard features every SAX compliant parser must support. In addition, the specific parser implementations provide extensions to that standard set, the one to turn of external DTD loading is available in the specific Xerces feature list and is called


http://apache.org/xml/features/nonvalidating/load-external-dtd

To deactivate this feature, you can tell the SAXParserFactory directly:

1
2
3
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
SAXParser parser = factory.newSAXParser();

Alternatively, you can set it at the XMLReader after the parser was created:

1
2
SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
saxParser.getXMLReader().setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

These settings can also be applied to other XML related classes, like the DocumentBuilderFactory as this stackoverflow.com post describes.

These examples skip any exception handling etc., but should make the point clear where and how to set the settings.

Disabling the feature fixed our test cases and we reviewed all other places we handled XML to see if other features had to be adjusted as well. As mentioned above, deliberately not taking DTDs or XML schemas into consideration might ignore important implicit data inside the XML document, so it should always be verified if the failure to resolve the document definitions is just an annoyance that should be eliminated or a serious problem that needs to be addressed.

Grails, Acegi and the authority ROLE_ prefix

After spending way too much time figuring this one out, I am not sure if it’s an RTFM issue or if the Grails part of the documentation is incomplete/unclear about it.

The story goes like this: I was trying to set up a simple Grails application that should include user authentication via LDAP against a number of roles stored in the application database. Fortunately, there is a convenient plugin available that makes Acegi Security available for Grails.

I had the business objects and logic pretty much done before I started with the authentication part – this is a small internal project, so it’s rather simple stuff – and as this was a quick rewrite of a legacy application, the users and roles already existed in the database. So I followed the appropriate tutorial in the beginning and everything worked fine. The users and roles were mapped correctly and the LDAP connection worked out of the box as I was able to log in with an existing user (I haven’t been able to get the authentication through bind working yet, but that’s a separate issue).

I had defined two roles, i.e., DEVELOPER and MANAGER:

Roles in the database

The configuration was straight forward as well, here is just the part of the SecurityConfig.groovy that affects the user and role mappings:

1
2
3
4
5
6
7
8
active = true
loginUserDomainClass = "User"
authorityDomainClass = "Role"
//... some more stuff
useLdap = true
useControllerAnnotations = true
ldapRetrieveGroupRoles = false
ldapRetrieveDatabaseRoles = true

For testing purposes, one of the business object controllers was annotated to only accept the MANAGER role.

1
2
3
4
5
6
7
8
import org.codehaus.groovy.grails.plugins.springsecurity.Secured
 
@Secured(['MANAGER']) // THIS DOES NOT WORK
class MyController {
   def index = {
      // do something useful
   }
}

But even though the user authentication worked fine, I was not able to access the secured controller after login but would get the “Sorry, you’re not authorized to view this page” error message, even though the log showed that the user was granted the correct authorities: Granted Authorities: MANAGER.

I obviously noticed that in the examples all roles had the format ROLE_XYZ and it was mentioned that the roles retrieved through LDAP would be converted into that format automatically, but it didn’t seem to be a requirement. However, after fixing the setup accordingly, everything worked great:

Fixed roles in database

1
2
3
4
5
6
7
8
import org.codehaus.groovy.grails.plugins.springsecurity.Secured
 
@Secured(['ROLE_MANAGER']) // the ROLE_ prefix is a must
class MyController {
   def index = {
      // do something useful
   }
}

Behind the scenes

The requirement for the prefix originates from the default implementation of the AccessDecisionVoter which is RoleVoter inside the Spring Security code.

That implementation will compare the ConfigAttributes in the session against the granted authorities. However, there might be attributes that are not roles and should not be considered for comparison – as that might introduce security vulnerabilities – so the voter is only taking the attributes into consideration that start with a certain prefix. The corresponding snippet from RoleVoter.java makes that pretty obvious:

56
private String rolePrefix = "ROLE_";
75
76
77
if ((attribute.getAttribute() != null) && attribute.getAttribute().startsWith(getRolePrefix())) {
    return true;
}

It is of course possible to provide a different implementation for AccessDecisionVoter. That class would not have to rely on a prefix, could use a completely different comparison or you could simply pass a different prefix, e.g., an empty String, to the RoleVoter, but I guess unless you have special requirements or cannot prefix your authority names for some odd reason, that shouldn’t be necessary. Looking at that implementation a couple of lines further down also makes it apparent that the authority names comparisons are case sensitive, another little gotcha to look out for. Some more details about the internals of how Acegi Security handles the authorization is covered in this JavaWorld article.

I guess I would have been looking in the right direction earlier, had I created the roles through the generated interfaces instead of using a prepopulated database, as the forms generated by the Grails plugin don’t accept role names that do not contain "ROLE" in the authority name. Here is a snippet of the UserController template in the Acegi Security plugin 0.5.2:

137
138
139
140
141
142
143
 private void addRoles(person) {
    for (String key in params.keySet()) {
        if (key.contains('ROLE') && 'on' == params.get(key)) {
            ${authorityClassName}.findByAuthority(key).addToPeople(person)
        }
    }
}

Even though this code was probably added with good intentions, it has several major flaws:

  1. Any authority name not matching the requirement is silently dropped without any feedback
  2. The constraint is not obvious, documented or explained in the UI
  3. The check is semantically wrong, as the role name "PAROLEE" would be accepted as well – which would give developers working for the judicial system some headaches, I guess. Here, key.startsWith('ROLE_') would be an improvement

Running into this issue would have been at least an indicator that the name can not be chosen arbitrarily. The documentation of the Spring Security classes involved is pretty straight forward, but the tutorials and manuals of the Grails plugin could be a little bit more obvious in that respect. I guess this is just another case of leaky abstractions.

The Ominous First Post

The first step is always the hardest. Mustering that extra energy that’s necessary to get moving, which automatically leads to the realization that the decision to try something new has actually been made. This is especially true if you are trying something that you weren’t really convinced of not too long ago.

I’ve been a little bit skeptical about the whole blogosphere as a lot of it seemed like random noise that was large revolving around itself. However, after stumbling over solutions to technical problems I was facing in posts that came up in Google and discovering the value of reading development related blogs for keeping up with current industry issues, I started to reconsider my point of view. There was suddenly an urge to chime into some of the discussions or to give back some fix suggestions for issues that I had to figure out myself.

Opinions

To dip my toes into the water I started an internal blog for the development team in the compay I work for. This was at first just intended to improve communication in the team, which worked out great. After some time, though, I caught myself putting more general thoughts into some posts which went beyond the character of internal communication. Some of these topics reflected current debates that were raised on popular programming blogs or in podcasts, which were picked up within our team and resulted in interesting discussions.

Some of the internal contributions made to those debate were quite intriguing, and I realized the value of getting feedback and the potential that a blog has as a medium. So part of the reason for putting up a public blog is to be able to participate in some of those distributed pondering of ideas and opinions.

Being a German software developer working in an international team in Taiwan with an English and Chinese speaking environment, there is of course also the faint hope that the rather uncommon circumstances of my professional life could provide some interesting insight to other people out there. I would currently consider that more a nice by-product than an integral part of the experiment, though.

Technical notes to self

Everything mentioned above about the exchange of ideas and participation in debates of course assumes that there is somebody actually reading what I am coming up with here. At this point, though, I am pretty sure that I am writing for nobody else but myself and Google – given that their algorithm deems anything here worthy of indexing.

As mentioned before, having a Google indexed scrap pad with notes of technical nature is the other major motivation for starting this experiment. For now I will consider the description of technical challenges and the solutions I’ve found as notes for myself, the blog simply serves as incentive to actually write them down which hopefully makes it easier to remember them or to find them more efficiently. If somebody finds them helpful, that’s even better. I could count that as good Karma as I would finally be able to give back to the hordes of software developers from whose posts and forum contributions I’ve profited so often before.

It seems like the first step has been made, there is now a real post that somewhat outlines the purpose of this blog and serves as a reminder to myself. Now, I “just” need to follow up with some real content – this feels a little bit like a new feature that is almost finished as the prototype works, it’s “just” the actual implementation that’s missing.