A parse for HTML5 based on the official W3C specification.
the html source text is:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>My test page</title>
</head>
<body>
<img src="images/firefox-icon.png" alt="My test image">
</body>
</html>we can use this code to parse html source to HtmlNode list:
let sourceText = <html>...</html>
let nodes: HtmlNode list = HtmlUtils.parseDoc sourceTextdoctype is a string that is extracted from doctype tag. and nodes is a HtmlNode list.
type HtmlNode =
| HtmlElement of
name: string *
attributes: list<string * string> *
elements: HtmlNode list
| HtmlComment of string
| HtmlCData of string
| HtmlText of string
| HtmlDoctype of string
| HtmlWS of stringAll parsing processes in a package are public, and you are free to compose them to implement your functional requirements. Parser is highly configurable, see source code HtmlUtils
module FSharp.HTML.HtmlUtils
let parseDoc (txt: string) =
txt
|> HtmlCompiler.compileText
|> Whitespace.trimWhitespace
|> List.map CharacterReference.processCharRefsThe user can parse the string through the functions in the HtmlUtils module.
EncodeUtils