In order to fully understand how attoparser works, you will first need to know some basic concepts:
| markup | For the sake of conciseness, in attoparser we will consider this term as a synonym of "XML and/or HTML". |
| structure | A structure is an artifact in the parsed document that is not simply text, this is, some kind of directive, format, metadata... for example: elements, DOCTYPE clauses, comments, etc. |
| element | The term element is just the official standard name for a markup tag. |
| (offset,len) pair |
An (offset,len) pair is a couple of integer numbers that specify a subsequence of elements
in an array. The first component (offset) signals the first position in the array to be
included in the subsequence, and the second component (len) indicates the length of the
subsequence. These pairs are extensively used in attoparser in order to delimit parsed artifacts on the original char[] buffer. Converting an (offset,len) pair into a String object is easy, just do new String(buffer, offset, len).
|
| attoparser DOM | A DOM-style interface offered by attoparser similar to the standard DOM, but implemented with classes from the org.attoparser.dom package. |
The first thing we need to do for using attoparser is creating an event handler. This event handler will be an implementation of the IMarkupHandler interface, but we will normally not implement this interface directly, creating instead a subclass of one of the no-op abstract base classes provided by attoparser.
attoparser offers two handler interfaces, each with its own no-op abstract base class:
| AbstractMarkupHandler | No-op base class for the IMarkupHandler interface, which offers the full level of detail: every artifact (element names, each attribute, inner whitespace, comments, CDATA, DOCTYPE, etc.) is reported directly on the original char[] buffer by means of (offset,len) pairs, avoiding the creation of intermediate objects. |
| AbstractSimpleMarkupHandler | No-op base class for the ISimpleMarkupHandler interface, a simplified set of events more similar to the standard SAX ContentHandler: fewer events, element names reported as String and attributes already built as a Map<String,String>. |
Note that, unlike in older (1.x) versions of attoparser, the difference between XML and HTML parsing is no longer a matter of handler class, but of configuration: you simply pass a ParseConfiguration.xmlConfiguration() or ParseConfiguration.htmlConfiguration() object to the parser (more on this below).
For example, we could extend AbstractSimpleMarkupHandler and create a very simple handler for counting the number of standalone elements in our parsed documents, like:
public class StandaloneCountingHandler extends AbstractSimpleMarkupHandler {
// Let's count the number of standalone elements in our document!
private int standaloneCount = 0;
public int getStandaloneCount() {
return this.standaloneCount;
}
@Override
public void handleStandaloneElement(
final String elementName, final Map<String, String> attributes,
final boolean minimized, final int line, final int col)
throws ParseException {
this.standaloneCount++;
}
}
Looking at the code above, note that attoparser reports different events for open elements and standalone elements, a differentiation that is not easy to achieve using standard SAX parsers.
Also note the line and col arguments, specifying the exact position
of these standalone elements in the document.
Whether the parsed document is treated as XML or HTML —and therefore whether it is
required to be well-formed from an XML standpoint, or instead parsed with HTML-specific
intelligence— is decided later, by the configuration we give to the parser,
not by the handler. For example, HTML parsing knows that a
<img src="..."> is a standalone tag (and not an open tag) even if
it isn't written like <img src="..." />.
What if we wanted to strip our document of markup tags, leaving only the text? We could easily create a handler for this by overriding just the handleText event:
public class TagStrippingHandler extends AbstractSimpleMarkupHandler {
private final StringBuilder strBuilder = new StringBuilder();
public String getTagStrippedText() {
return this.strBuilder.toString();
}
@Override
public void handleText(
final char[] buffer, final int offset, final int len,
final int line, final int col)
throws ParseException {
this.strBuilder.append(buffer, offset, len);
}
}
Quite easy, right? In fact, this is such a common need that attoparser already ships a ready-made TextOutputMarkupHandler that does exactly this out of the box.
Besides creating your own, attoparser provides several useful handler implementations out of the box:
| OutputMarkupHandler | Writes the received events back as markup to a java.io.Writer, without any loss of information. Useful at the end of a handler chain to output the result of a filtering/transformation operation. |
| TextOutputMarkupHandler | Writes only the text of the document to a java.io.Writer, effectively stripping all markup. |
| DOMBuilderMarkupHandler | Builds an attoparser DOM tree (classes in org.attoparser.dom) as a result of parsing. More easily applied through the DOMMarkupParser convenience class. |
| SimplifierMarkupHandler | Converts the full set of IMarkupHandler events into the simplified ISimpleMarkupHandler ones. This is what the SimpleMarkupParser convenience class uses underneath. |
| MinimizeHtmlMarkupHandler | Minimizes (compacts) HTML markup: removes excess white space, unquotes attributes when possible, etc. |
| BlockSelectorMarkupHandler / NodeSelectorMarkupHandler |
Apply markup selectors (a CSS/DOM-like syntax) in order to select fragments of markup and route selected vs. non-selected events to different handlers. |
| DuplicateMarkupHandler | Sends every event to two different handlers at the same time. |
attoparser offers a main parser interface called IMarkupParser, with a single implementation: MarkupParser. It works with the full-detail IMarkupHandler interface.
There are also two convenience parsers for the most common scenarios: SimpleMarkupParser (interface ISimpleMarkupParser), which works with the simplified ISimpleMarkupHandler handlers, and DOMMarkupParser (interface IDOMMarkupParser), which produces an attoparser DOM tree.
Parser instances should be directly used (without extending), and are created by passing them the ParseConfiguration we want to use —most commonly one of the pre-initialized ParseConfiguration.htmlConfiguration() or ParseConfiguration.xmlConfiguration() objects. Parser instances are thread-safe, so they can be safely reused without synchronization. Note that this thread-safety feature usually does not apply to handlers.
Parsers allow us to specify the document to be parsed in several useful ways: as a java.io.Reader, a String or a char[].
Let's say we have a document in our classpath and we want to parse it using our recently created
handler in order to count the number of standalone elements it contains. As our handler
is an ISimpleMarkupHandler, we will use a SimpleMarkupParser. For the sake
of simplicity, we will ignore the try..finally code required to adequately close
the streams:
final InputStream is = Thread.currentThread().getContextClassLoader().getResourceAsStream(fileName); // We know our file's encoding is ISO-8859-1, and we need that info to create a Reader final Reader reader = new BufferedReader(new InputStreamReader(is, "ISO-8859-1")); // Parsers are thread-safe and reusable. We choose the HTML configuration here. final ISimpleMarkupParser parser = new SimpleMarkupParser(ParseConfiguration.htmlConfiguration()); final StandaloneCountingHandler handler = new StandaloneCountingHandler(); parser.parse(reader, handler); final int standaloneCount = handler.getStandaloneCount();
And we are done! Had we wanted to require the document to be well-formed XML instead, we would simply have created the parser with ParseConfiguration.xmlConfiguration().
As a plus to its main SAX-style parsing capabilities, attoparser offers us a DOM-style interface that enables us to handle a document as an attoparser DOM tree (classes in the org.attoparser.dom package).
Using it is easy: we just need the convenience DOMMarkupParser, which returns a Document object directly:
final Reader reader = ...;
final IDOMMarkupParser parser = new DOMMarkupParser(ParseConfiguration.htmlConfiguration());
// Parse the document into an attoparser DOM tree
final Document doc = parser.parse("My document", reader);
// Navigate the tree using the typed-child accessors
final DocType docType = doc.getFirstChildOfType(DocType.class);
final Element rootElement = doc.getFirstChildOfType(Element.class);
final List<Element> children = rootElement.getChildrenOfType(Element.class);
...
attoparser provides, out-of-the-box, a utility capable of writing an attoparser DOM tree as markup code again. It's the DOMWriter class:
final Document doc = ...; // Modify our document if we wish ... final StringWriter stringWriter = new StringWriter(); // Execute the writer (DOMWriter exposes static write(...) methods) DOMWriter.write(doc, stringWriter); // Obtain the resulting markup final String markup = stringWriter.toString();