See: Description
Package | Description |
---|---|
org.attoparser |
Main parser and handler artifacts: basic interfaces and implementations.
|
org.attoparser.config |
Parser configuration artifacts.
|
org.attoparser.discard |
Handlers for discarding markup.
|
org.attoparser.dom |
Handlers for creating DOM trees as a result of parsing.
|
org.attoparser.duplicate |
Handlers for duplicating events between more than one handler.
|
org.attoparser.minimize |
Handlers for minimizing (compacting) HTML markup.
|
org.attoparser.output |
Handlers for outputting markup as a result of parsing.
|
org.attoparser.prettyhtml |
Handlers for creating a pretty-HTML representation of parsing events.
|
org.attoparser.select |
Handlers for filtering a part or several parts of markup during parsing
in a fast and efficient way.
|
org.attoparser.simple |
Artifacts for parsing using a simplified version of the handler interfaces.
|
org.attoparser.trace |
Handlers for creating traces of parsing events (for testing/debugging).
|
org.attoparser.util |
Utility classes.
|
attoparser is a Java parser for XML and HTML markup.
The main features of attoparser are:
Using attoparser can be as simple as:
// Obtain a java.io.Reader on the document to be parsed
final Reader documentReader = ...;
// Create the handler instance. Extending the no-op AbstractMarkupHandler is a good start
final IMarkupHandler handler = new AbstractMarkupHandler() {
... // some events implemented
};
// Create or obtain the parser instance (can be reused). Example uses the default configuration for HTML
final IMarkupParser parser = new MarkupParser(ParseConfiguration.htmlConfiguration());
// Parse it!
parser.parse(documentReader, handler);
A more complex example, say you want to extract to another file only the <div> elements with class "content" from an HTML file. You will need a BlockSelectorMarkupHandler instance to do the selection, and then an OutputMarkupHandler chained to the former in order to write the output markup somewhere:
// Obtain a java.io.Reader on the document to be parsed
final Reader documentReader = ...;
// Obtain a java.io.Writer on the resource you want the results to be written to
final Writer documentWriter = ...;
// Last step of the chain will be the OutputMarkupHandler, who will write events as markup to the writer
final OutputMarkupHandler outputHandler = new OutputMarkupHandler(documentWriter);
// Before outputting, we will need to select those div's by means of a "markup selector expression", so we chain it
final BlockSelectorMarkupHandler selectorHandler = new BlockSelectorMarkupHandler(outputHandler, "div.content");
// Create or obtain the parser instance (can be reused). We will use the default configuration for HTML
final IMarkupParser parser = new MarkupParser(ParseConfiguration.htmlConfiguration());
// Parse it!
parser.parse(documentReader, selectorHandler);
The best place to start learning about attoparser by reading this docs is having a look at the IMarkupParser and especially the IMarkupHandler interfaces.
Copyright © 2022 The ATTOPARSER team. All rights reserved.