Overview (attoparser 2.0.6.RELEASE API)

Packages
Package	Description
org.attoparser	Main parser and handler artifacts: basic interfaces and implementations.
org.attoparser.config	Parser configuration artifacts.
org.attoparser.discard	Handlers for discarding markup.
org.attoparser.dom	Handlers for creating DOM trees as a result of parsing.
org.attoparser.duplicate	Handlers for duplicating events between more than one handler.
org.attoparser.minimize	Handlers for minimizing (compacting) HTML markup.
org.attoparser.output	Handlers for outputting markup as a result of parsing.
org.attoparser.prettyhtml	Handlers for creating a pretty-HTML representation of parsing events.
org.attoparser.select	Handlers for filtering a part or several parts of markup during parsing in a fast and efficient way.
org.attoparser.simple	Artifacts for parsing using a simplified version of the handler interfaces.
org.attoparser.trace	Handlers for creating traces of parsing events (for testing/debugging).
org.attoparser.util	Utility classes.

attoparser is a Java parser for XML and HTML markup.

Main features

The main features of attoparser are:

Fast, lightweight and easy to use.
Supports and understands both XML and HTML (including HTML5).
Powerful API. Does not implement the official SAX or DOM standard XML APIs. On purpose.
Event-based (SAX-style), uses markup handler objects for processing parsing events, which can be (and many times are) chained to achieve the desired final results.
Though it is event-based, it offers out-of-the-box a handler that can turn events into a DOM-style object tree.
Does not perform any DTD / XSD validation, namespace processing, entity resolution or escaping / unescaping operations. All of this on purpose, too.
Allows ill-formed markup (XML or HTML) if configured to do so.
Performs auto-balancing of tags if configured to do so. Both in XML and in HTML parsing modes (will do it according to the HTML5 specification if in HTML mode).
Zero loss parsing. Does not lose any information during parsing (keyword case, attribute quoting...), so that the exact original markup can be reconstructed at the handler layer.
Can perform fast fragment selection operations during parsing, based on powerful markup selection expressions like //div/p#content.
Loaded with other useful goodies like HTML minimization, event trace building or pretty-HTML reporting.

How to use it

Using attoparser can be as simple as:


  // Obtain a java.io.Reader on the document to be parsed
  final Reader documentReader = ...;

  // Create the handler instance. Extending the no-op AbstractMarkupHandler is a good start
  final IMarkupHandler handler = new AbstractMarkupHandler() {
      ... // some events implemented
  };

  // Create or obtain the parser instance (can be reused). Example uses the default configuration for HTML
  final IMarkupParser parser = new MarkupParser(ParseConfiguration.htmlConfiguration());

  // Parse it!
  parser.parse(documentReader, handler);

A more complex example, say you want to extract to another file only the <div> elements with class "content" from an HTML file. You will need a BlockSelectorMarkupHandler instance to do the selection, and then an OutputMarkupHandler chained to the former in order to write the output markup somewhere:


    // Obtain a java.io.Reader on the document to be parsed
    final Reader documentReader = ...;

    // Obtain a java.io.Writer on the resource you want the results to be written to
    final Writer documentWriter = ...;

    // Last step of the chain will be the OutputMarkupHandler, who will write events as markup to the writer
    final OutputMarkupHandler outputHandler = new OutputMarkupHandler(documentWriter);

    // Before outputting, we will need to select those div's by means of a "markup selector expression", so we chain it
    final BlockSelectorMarkupHandler selectorHandler = new BlockSelectorMarkupHandler(outputHandler, "div.content");

    // Create or obtain the parser instance (can be reused). We will use the default configuration for HTML
    final IMarkupParser parser = new MarkupParser(ParseConfiguration.htmlConfiguration());

    // Parse it!
    parser.parse(documentReader, selectorHandler);

Where to start

The best place to start learning about attoparser by reading this docs is having a look at the IMarkupParser and especially the IMarkupHandler interfaces.