Skip navigation links

attoparser 2.0.6.RELEASE API

attoparser is a Java parser for XML and HTML markup.

See: Description

Packages 
Package Description
org.attoparser
Main parser and handler artifacts: basic interfaces and implementations.
org.attoparser.config
Parser configuration artifacts.
org.attoparser.discard
Handlers for discarding markup.
org.attoparser.dom
Handlers for creating DOM trees as a result of parsing.
org.attoparser.duplicate
Handlers for duplicating events between more than one handler.
org.attoparser.minimize
Handlers for minimizing (compacting) HTML markup.
org.attoparser.output
Handlers for outputting markup as a result of parsing.
org.attoparser.prettyhtml
Handlers for creating a pretty-HTML representation of parsing events.
org.attoparser.select
Handlers for filtering a part or several parts of markup during parsing in a fast and efficient way.
org.attoparser.simple
Artifacts for parsing using a simplified version of the handler interfaces.
org.attoparser.trace
Handlers for creating traces of parsing events (for testing/debugging).
org.attoparser.util
Utility classes.

attoparser is a Java parser for XML and HTML markup.

Main features

The main features of attoparser are:

How to use it

Using attoparser can be as simple as:


  // Obtain a java.io.Reader on the document to be parsed
  final Reader documentReader = ...;

  // Create the handler instance. Extending the no-op AbstractMarkupHandler is a good start
  final IMarkupHandler handler = new AbstractMarkupHandler() {
      ... // some events implemented
  };

  // Create or obtain the parser instance (can be reused). Example uses the default configuration for HTML
  final IMarkupParser parser = new MarkupParser(ParseConfiguration.htmlConfiguration());

  // Parse it!
  parser.parse(documentReader, handler);

A more complex example, say you want to extract to another file only the <div> elements with class "content" from an HTML file. You will need a BlockSelectorMarkupHandler instance to do the selection, and then an OutputMarkupHandler chained to the former in order to write the output markup somewhere:


    // Obtain a java.io.Reader on the document to be parsed
    final Reader documentReader = ...;

    // Obtain a java.io.Writer on the resource you want the results to be written to
    final Writer documentWriter = ...;

    // Last step of the chain will be the OutputMarkupHandler, who will write events as markup to the writer
    final OutputMarkupHandler outputHandler = new OutputMarkupHandler(documentWriter);

    // Before outputting, we will need to select those div's by means of a "markup selector expression", so we chain it
    final BlockSelectorMarkupHandler selectorHandler = new BlockSelectorMarkupHandler(outputHandler, "div.content");

    // Create or obtain the parser instance (can be reused). We will use the default configuration for HTML
    final IMarkupParser parser = new MarkupParser(ParseConfiguration.htmlConfiguration());

    // Parse it!
    parser.parse(documentReader, selectorHandler);

Where to start

The best place to start learning about attoparser by reading this docs is having a look at the IMarkupParser and especially the IMarkupHandler interfaces.

Skip navigation links

Copyright © 2022 The ATTOPARSER team. All rights reserved.