What is attoparser?

attoparser is a Java parser for XML and HTML markup.
It is a SAX-style event-based parser —though it does not implement the SAX standard— but it can also act as a DOM-style parser.

Its goals are:

  • To be easy to use. Few lines of code needed. And no more parser library hell worrying about your JDK's parser API versions.
  • To be fast. As fast as the fastest standard parsers. And in many scenarios, faster.
  • To offer a powerful interface. Consider well-formedness optional, line + column location, ability to reconstruct the original document, etc.
  • To simplify your parsing experience. By removing the need to worry about validation or entity resolution —both unneeded in many cases.

Can it be my parser instead of the standard ones?

The answer is simple: if you don't need neither DTD/Schema validation nor entity resolution, then yes, it can.

What does it look like?

First, you should create an implementation of IAttoHandler, usually by extending one of its predefined abstract implementations:

public class MyHandler extends AbstractStandardMarkupAttoHandler {
    /*
     * Provide implementations for the events you are interested on.
     */
}   

Then simply execute the parser using your handler:

final Reader documentReader = ...;

final IAttoParser parser = new MarkupAttoParser(); // this is thread-safe and can be reused
final IAttoHandler handler = new MyHandler();

parser.parse(documentReader, handler);

The features

Java-based Requires Java SE 5.0 or newer.
Easy to deploy attoparser is just a .jar library with no additional dependencies. No need to worry about the versions your JDK build includes of the SAX, DOM or any other XML-related standards.
Light attoparser's only .jar file weighs just about 85 Kbytes.
Event-based (SAX style) attoparser offers an event-based interface, calling handler methods on a user-provided handler class implementing a specific interface —usually extending one of the provided abstract classes providing different levels of event detail—. This works in an equivalent way to the implementation of the ContentHandler interface when using standard SAX parsers.
HTML-specific intelligence attoparser offers specific intelligence in order to correctly parse HTML markup. For example: it can report an <img src="..."> element as a standalone element even if it is not minimized (<img src="..." />) and it has no closing tag.
Optional DOM-style attoparser also offers a prebuilt handler class that translates parsing events into a fully-featured attoDOM (attoparser-customized Document Object Model) tree of nodes, which can be modified and written back to markup if needed.
Optional well-formedness Users are not restricted to parsing only well-formed markup (from an XML standpoint). attoparser can be configured to ignore well-formedness rules like tag balancing, attribute values delimited by commas, correct XML/XHTML/HTML prolog specification, etc. This makes attoparser especially well-suited for parsing HTML code.
Small memory footprint Unless specifically required by the user's handler implementation, attoparser avoids copying the document contents in memory by working always with the original char[] buffer, providing (offset,len) pairs for delimiting event artifacts.
Full event location Each event artifact (and attoDOM node) provides its location at the original document with its line and column number.
Several levels of detail Users can specify the level of detail they need for their events by choosing a specific abstract base class for their handler implementations. For example, if a user is not interested in delimiting element (tag) names or attributes, he/she can choose a detail level that ignores tag contents, resulting in a performance improvement.
Document reconstruction attoparser takes all the required measures to ensure that, when needed, the original markup will be completely reconstructable after parsing. No single character or artifact is ignored or left out of event reporting at the most detailed level. This is a useful feature when the parser is used for processing templates.
No escaping/unescaping No text escaping or unescaping is applied to parsed artifacts, and also no entity substitution —e.g. &aacute; to á— is performed, allowing the user to apply his/her own rules where required. This frees the parser from making possibly invalid assumptions about markup due to differences between XML and HTML escaping rules, and also allows a complete reconstruction of the original document after parsing, if needed.

How is it distributed?

attoparser is Open Source Software, and it is distributed under the terms of the Apache License 2.0.

Project status

attoparser is stable and production-ready. Current version is 1.3.