attoparser: powerful and easy java parser for XML and HTML markup

Glossary

In order to fully understand how attoparser works, you will first need to know some basic concepts:

markup	For the sake of conciseness, in attoparser we will consider this term as a synonym of "XML and/or HTML".
structure	A structure is an artifact in the parsed document that is not simply text, this is, some kind of directive, format, metadata... for example: elements, DOCTYPE clauses, comments, etc.
element	The term element is just the official standard name for a markup tag.
`(offset,len)` pair	An `(offset,len)` pair is a couple of integer numbers that specify a subsequence of elements in an array. The first component (offset) signals the first position in the array to be included in the subsequence, and the second component (len) indicates the length of the subsequence. These pairs are extensively used in attoparser in order to delimit parsed artifacts on the original `char[]` buffer. Converting an `(offset,len)` pair into a `String` object is easy, just do `new String(buffer, offset, len)`.
attoDOM	A DOM-style interface offered by attoparser similar to the standard DOM, but implemented with classes from the `org.attoparser.markup.dom` package.

Handlers

The first thing we need to do for using attoparser is creating an event handler. This event handler will be an implementation of the IAttoHandler interface, but we will normally not use this interface directly, creating instead a subclass of one of the several abstract base classes already provided by attoparser.

Each of these abstract base classes provide a set of overriddable methods —all of them having a default empty implementation—, and each of these sets of methods will offer a different level of detail to us:

Example general handlers

`AbstractAttoHandler`	Basic implementation only differentiating between text and structures.
`AbstractBasicMarkupAttoHandler`	Abstract handler able to differentiate among different types of markup structures: elements, comments, CDATA, DOCTYPE, etc. without breaking them down (for example, elements will be offered as a whole, without differentiating name and attributes).
`AbstractDetailedMarkupAttoHandler`	Abstract handler able not only to differentiate among different types of markup structures, but also of reporting lowel-level detail inside elements (name, attributes, inner whitespace) and DOCTYPE clauses (keyword, root element name, public and system ID, etc.).
`AbstractStandardMarkupAttoHandler`	Higher-level abstract handler that offers an interface more similar to the Standard SAX `ContentHandler`s (fewer events, use of `String` instead of `char[]`, attributes reported as `Map<String,String>`, etc).

Example XML handlers

`AbstractDetailedXmlAttoHandler`	Abstract handler with the same level of detail as `AbstractDetailedMarkupAttoHandler`, using specific XML configuration.
`AbstractStandardXmlAttoHandler`	Higher-level abstract handler similar to `AbstractStandardMarkupAttoHandler`, using specific XML configuration.
`DOMXmlAttoHandler`	Specialized handler that converts SAX-style events into an attoDOM tree.

Example HTML handlers

`AbstractDetailedNonValidatingHtmlAttoHandler`	Abstract handler with the same level of detail as `AbstractDetailedMarkupAttoHandler`, using specific HTML configuration and intelligence.
`AbstractStandardNonValidatingHtmlAttoHandler`	Higher-level abstract handler similar to `AbstractStandardMarkupAttoHandler`, using specific HTML configuration and intelligence.

Creating a handler

For example, we could choose AbstractStandardMarkupAttoHandler and create a very simple handler for counting the number of standalone elements in our parsed documents, like:

public class StandaloneCountingAttoHandler extends AbstractStandardXmlAttoHandler {

    // Let's count the number of standalone elements in our document!
    private int standaloneCount = 0;
    
    public StandaloneCountingAttoHandler() {
        super();
    }
    
    public int getStandaloneCount() {
        return this.standaloneCount;
    }

    @Override
    public void handleXmlStandaloneElement(
            final String elementName, final Map<String, String> attributes, 
            final int line, final int col)
            throws AttoParseException {
        this.standaloneCount++;
    }

}

Looking at the code above, note that most attoparser handlers offer different handlers for opening elements and standalone elements, a differentiation that is not easy to achieve using standard SAX parsers.

Also note the line and col arguments, specifying the exact position of these standalone elements in the document.

And finally, note the fact that we are using an XML-specific handler, which instructs attoparser to require the parsed document to be well-formed from an XML standpoint. This means a well-formed prolog, balanced tags, correctly formatted attribute values, etc.

If our code was HTML instead of XML, we could have created our handler as an implementation of, for example, AbstractStandardNonValidatingHtmlMarkupAttoHandler, which would offer similar events to those of its XML counterpart, but removing a lot of restrictions of format (of attributes, for instance) and adding some HTML-specific intelligence like knowing that a <img src="..."> is a standalone tag (and not an open tag) even if it isn't written like <img src="..." />.

Just one more...

What if we wanted to strip our document of markup tags, leaving only the text? We could easily create a handler for this by extending AbstractBasicMarkupAttoHandler:

public class TagStrippingAttoHandler extends AbstractBasicMarkupAttoHandler {

    private final StringBuilder strBuilder;
    
    public TagStrippingAttoHandler() {
        super();
        this.strBuilder = new StringBuilder();
    }
    
    public String getTagStrippedText() {
        return this.strBuilder.toString();
    }

    @Override
    public void handleText(
            final char[] buffer, final int offset, final int len, 
            final int line, final int col)
            throws AttoParseException {
        this.strBuilder.append(buffer, offset, len);
    }
    

}

Quite easy, right?

Parsers

attoparser offers a parser interface called IAttoParser, and only one implementation for it: MarkupAttoParser.

This MarkupAttoParser class should be directly used (without extending) and its instances are thread-safe, so they can be safely reused without synchronization. Also note that this thread-safety feature usually does not apply to handlers.

Parsing our document

MarkupAttoParser allows us to specify the document to be parsed in several useful ways: as a java.io.Reader, a String or a char[].

Let's say we have a document in our classpath and we want to parse it using our recently created handler in order to count the number of standalone elements it contains. For the sake of simplicity, we will ignore the try..finally code required to adequately close the streams:

final InputStream is = Thread.currentThread().getContextClassLoader().getResourceAsStream(fileName);
// We know our file's encoding is ISO-8859-1, and we need that info to create a Reader
final Reader reader = new BufferedReader(new InputStreamReader(is, "ISO-8859-1"));
                
final StandaloneCountingAttoHandler handler = new StandaloneCountingAttoHandler();
parser.parse(reader, handler);

final int standaloneCount = handler.getStandaloneCount();

And we are done!

Using the DOM features

As a plus to its main SAX-style parsing capabilities, attoparser offers us a DOM-style interface that enables us to handle a document as an attoDOM tree. Note that, currently, only an XML version of the DOM conversion facilities is offered out-of-the-box.

Using it is easy: we just need to use the prebuilt DOMXmlAttoHandler:

final Reader reader = ...
                
final DOMXmlAttoHandler handler = new DOMXmlAttoHandler();
parser.parse(reader, handler);

final Document doc = handler.getDocument();
final DocType docType = doc.getDocType();
final List<Element> elements = doc.getRootElement().getElementChildren();
...

Writing markup from an attoDOM tree

attoparser provides, out-of-the-box, a writer object capable of writing an attoDOM tree as markup code again. It's the XmlDOMWriter class:

final Document doc = ...

// Modify our document if we wish 
...
                
final StringWriter stringWriter = new StringWriter();
final XmlDOMWriter domWriter = new XmlDOMWriter(); 

// Execute the writer
domWriter.writeDocument(doc, stringWriter);

// Obtain the result of executing the visitor            
final String markup = stringWriter.toString();