attoparser: powerful and easy java parser for XML and HTML markup

Glossary

In order to fully understand how attoparser works, you will first need to know some basic concepts:

markup	For the sake of conciseness, in attoparser we will consider this term as a synonym of "XML and/or HTML".
structure	A structure is an artifact in the parsed document that is not simply text, this is, some kind of directive, format, metadata... for example: elements, DOCTYPE clauses, comments, etc.
element	The term element is just the official standard name for a markup tag.
`(offset,len)` pair	An `(offset,len)` pair is a couple of integer numbers that specify a subsequence of elements in an array. The first component (offset) signals the first position in the array to be included in the subsequence, and the second component (len) indicates the length of the subsequence. These pairs are extensively used in attoparser in order to delimit parsed artifacts on the original `char[]` buffer. Converting an `(offset,len)` pair into a `String` object is easy, just do `new String(buffer, offset, len)`.
attoparser DOM	A DOM-style interface offered by attoparser similar to the standard DOM, but implemented with classes from the `org.attoparser.dom` package.

Handlers

The first thing we need to do for using attoparser is creating an event handler. This event handler will be an implementation of the IMarkupHandler interface, but we will normally not implement this interface directly, creating instead a subclass of one of the no-op abstract base classes provided by attoparser.

attoparser offers two handler interfaces, each with its own no-op abstract base class:

`AbstractMarkupHandler`	No-op base class for the `IMarkupHandler` interface, which offers the full level of detail: every artifact (element names, each attribute, inner whitespace, comments, CDATA, DOCTYPE, etc.) is reported directly on the original `char[]` buffer by means of `(offset,len)` pairs, avoiding the creation of intermediate objects.
`AbstractSimpleMarkupHandler`	No-op base class for the `ISimpleMarkupHandler` interface, a simplified set of events more similar to the standard SAX `ContentHandler`: fewer events, element names reported as `String` and attributes already built as a `Map<String,String>`.

Note that, unlike in older (1.x) versions of attoparser, the difference between XML and HTML parsing is no longer a matter of handler class, but of configuration: you simply pass a ParseConfiguration.xmlConfiguration() or ParseConfiguration.htmlConfiguration() object to the parser (more on this below).

Creating a handler

For example, we could extend AbstractSimpleMarkupHandler and create a very simple handler for counting the number of standalone elements in our parsed documents, like:

public class StandaloneCountingHandler extends AbstractSimpleMarkupHandler {

    // Let's count the number of standalone elements in our document!
    private int standaloneCount = 0;

    public int getStandaloneCount() {
        return this.standaloneCount;
    }

    @Override
    public void handleStandaloneElement(
            final String elementName, final Map<String, String> attributes,
            final boolean minimized, final int line, final int col)
            throws ParseException {
        this.standaloneCount++;
    }

}

Looking at the code above, note that attoparser reports different events for open elements and standalone elements, a differentiation that is not easy to achieve using standard SAX parsers.

Also note the line and col arguments, specifying the exact position of these standalone elements in the document.

Whether the parsed document is treated as XML or HTML —and therefore whether it is required to be well-formed from an XML standpoint, or instead parsed with HTML-specific intelligence— is decided later, by the configuration we give to the parser, not by the handler. For example, HTML parsing knows that a <img src="..."> is a standalone tag (and not an open tag) even if it isn't written like <img src="..." />.

Just one more...

What if we wanted to strip our document of markup tags, leaving only the text? We could easily create a handler for this by overriding just the handleText event:

public class TagStrippingHandler extends AbstractSimpleMarkupHandler {

    private final StringBuilder strBuilder = new StringBuilder();

    public String getTagStrippedText() {
        return this.strBuilder.toString();
    }

    @Override
    public void handleText(
            final char[] buffer, final int offset, final int len,
            final int line, final int col)
            throws ParseException {
        this.strBuilder.append(buffer, offset, len);
    }

}

Quite easy, right? In fact, this is such a common need that attoparser already ships a ready-made TextOutputMarkupHandler that does exactly this out of the box.

Ready-made handlers

Besides creating your own, attoparser provides several useful handler implementations out of the box:

`OutputMarkupHandler`	Writes the received events back as markup to a `java.io.Writer`, without any loss of information. Useful at the end of a handler chain to output the result of a filtering/transformation operation.
`TextOutputMarkupHandler`	Writes only the text of the document to a `java.io.Writer`, effectively stripping all markup.
`DOMBuilderMarkupHandler`	Builds an attoparser DOM tree (classes in `org.attoparser.dom`) as a result of parsing. More easily applied through the `DOMMarkupParser` convenience class.
`SimplifierMarkupHandler`	Converts the full set of `IMarkupHandler` events into the simplified `ISimpleMarkupHandler` ones. This is what the `SimpleMarkupParser` convenience class uses underneath.
`MinimizeHtmlMarkupHandler`	Minimizes (compacts) HTML markup: removes excess white space, unquotes attributes when possible, etc.
`BlockSelectorMarkupHandler` / `NodeSelectorMarkupHandler`	Apply markup selectors (a CSS/DOM-like syntax) in order to select fragments of markup and route selected vs. non-selected events to different handlers.
`DuplicateMarkupHandler`	Sends every event to two different handlers at the same time.

Parsers

attoparser offers a main parser interface called IMarkupParser, with a single implementation: MarkupParser. It works with the full-detail IMarkupHandler interface.

There are also two convenience parsers for the most common scenarios: SimpleMarkupParser (interface ISimpleMarkupParser), which works with the simplified ISimpleMarkupHandler handlers, and DOMMarkupParser (interface IDOMMarkupParser), which produces an attoparser DOM tree.

Parser instances should be directly used (without extending), and are created by passing them the ParseConfiguration we want to use —most commonly one of the pre-initialized ParseConfiguration.htmlConfiguration() or ParseConfiguration.xmlConfiguration() objects. Parser instances are thread-safe, so they can be safely reused without synchronization. Note that this thread-safety feature usually does not apply to handlers.

Parsing our document

Parsers allow us to specify the document to be parsed in several useful ways: as a java.io.Reader, a String or a char[].

Let's say we have a document in our classpath and we want to parse it using our recently created handler in order to count the number of standalone elements it contains. As our handler is an ISimpleMarkupHandler, we will use a SimpleMarkupParser. For the sake of simplicity, we will ignore the try..finally code required to adequately close the streams:

final InputStream is = Thread.currentThread().getContextClassLoader().getResourceAsStream(fileName);
// We know our file's encoding is ISO-8859-1, and we need that info to create a Reader
final Reader reader = new BufferedReader(new InputStreamReader(is, "ISO-8859-1"));

// Parsers are thread-safe and reusable. We choose the HTML configuration here.
final ISimpleMarkupParser parser = new SimpleMarkupParser(ParseConfiguration.htmlConfiguration());

final StandaloneCountingHandler handler = new StandaloneCountingHandler();
parser.parse(reader, handler);

final int standaloneCount = handler.getStandaloneCount();

And we are done! Had we wanted to require the document to be well-formed XML instead, we would simply have created the parser with ParseConfiguration.xmlConfiguration().

Using the DOM features

As a plus to its main SAX-style parsing capabilities, attoparser offers us a DOM-style interface that enables us to handle a document as an attoparser DOM tree (classes in the org.attoparser.dom package).

Using it is easy: we just need the convenience DOMMarkupParser, which returns a Document object directly:

final Reader reader = ...;

final IDOMMarkupParser parser = new DOMMarkupParser(ParseConfiguration.htmlConfiguration());

// Parse the document into an attoparser DOM tree
final Document doc = parser.parse("My document", reader);

// Navigate the tree using the typed-child accessors
final DocType docType = doc.getFirstChildOfType(DocType.class);
final Element rootElement = doc.getFirstChildOfType(Element.class);
final List<Element> children = rootElement.getChildrenOfType(Element.class);
...

Writing markup from an attoparser DOM tree

attoparser provides, out-of-the-box, a utility capable of writing an attoparser DOM tree as markup code again. It's the DOMWriter class:

final Document doc = ...;

// Modify our document if we wish
...

final StringWriter stringWriter = new StringWriter();

// Execute the writer (DOMWriter exposes static write(...) methods)
DOMWriter.write(doc, stringWriter);

// Obtain the resulting markup
final String markup = stringWriter.toString();