home / 2020.12.10 12:00 / scala / html / tree / epub / javascript
This summer, I worked on developing a web-based comic-book reading application, and after everything was running well with comic books, I decided to extend the functionality to reading e-books. This proved to be a more complicated problem. The format I focused on was EPUB, which are archives containing html/xhtml files. To create a good e-book reading experience, I needed to split the book into pages, but how much would fit into a page depends on a lot of things: the size of the viewport, font-sizes, margins; and these limitations meant we have to split the book into pages directly on the UI side. But in this first implementation, I still needed to parse the book on the server side as well, in order to compute the number of positions in the book.
A position, in this case, would be one character. To compute the number of positions in a book stored as html files, I would only need to extract the text (without tags) from the html file and compute its size. Html-parsing libraries for Java and the Javascript DOM parser in the browser both have this functionality. The problem, however, is that they never give the same results. This meant that the book size computed on the server was almost always different that the size computed on the UI. This translated to the UI reporting book progress reached a certan position that was completely meaningless for the positions recorded on the server. And this is completely ignoring more complex cases where, for the logic in the page computation algorithm, certain elements like images and tables should be considered as occupying a single position. To elaborate on that, when deciding how much content fits into a page, from position X to position Y, we can't split an image or a table between two pages, so these elements must be considered as a single position. That image or table either fits in the current page, or it will be displayed in the next page.
For the original implementation I found a hacky implementation to sortof make this work, with the progress in the book being just an approximation, but it didn't really work well. I couldn't really let this go, especially now that the reading season is upon us (December holidays), so I went back and rewrote everything related to EPUB parsing. And the necessary approach was clear: to get the same results for both the backend and the UI I had to write my own parser, implemented in both Scala and Javascript. These parser implementations would have to treat spaces the same way, compute positions in book sections the same way, treat special one-position elements like images and tables the same way.
A position in the book represents one displayable element in that book. In most cases, this is a character, a letter or a space. In some special cases, this can be a more complex element, like an image. Positions are important in our algorithm because they are used to split the book into pages. Every page in the book has a start position and an end position. When computing a page, we start with the element at the start position and keep adding elements to the page until no more elements fit in the page. If those elements are characters, we just keep adding characters to the page until the page is filled. If those elements take more space, like images, less of them will fit in a page. A page may fit 1000 characters, or 500 characters and an image.
An ideal parsing algorithm, which would put us in an ideal position to deliver the best performance and accuracy for parsing e-books, would need the following functionaliy:
The data structure for this algorithm will look as follows:
class BookNode {
@BeanProperty
var name: String = _
@JsonBackReference
@BeanProperty
var parent: BookNode = _
var children: Seq[BookNode] = Seq()
@BeanProperty
var content: String = _
@BeanProperty
var start: Long = _
@BeanProperty
var end: Long = _
// ...
}
The name
holds the element type. content
will hold the actual text for text elements, but it will hold the entire opening tag, with all attributes, for other nodes. start
and end
represent the start and end positions for the node. Every node has a reference to its parent and may contain children.
To parse an HTML string we have a companion object for the BookNode
class with the main parsing code:
object BookNode {
// ...
private def parseBody(body: String,
entrancePosition: Long): Option[BookNode] = {
val bodyNode = new BookNode("body", "")
var current: BookNode = bodyNode
var content = ""
for (i <- 0 until body.length) {
val c = body(i)
if (c == '<') {
// starting a new tag
// save what we have in content
if (isTag(content)) throw new Throwable("this should not happen")
else {
// can only be a text node or nothing
if (content.length > 0) {
current.addChild(new BookNode("text", content))
content = ""
}
}
}
// accumulate content
content += c
if (c == '>') {
// ending a tag
if (isTag(content)) {
val name = getTagName(content)
// if it's comment, ignore
if (isComment(content)) {
// ignoring comment
} else if (isEndTag(content)) {
// we check that this tag closes the current node correctly
if (isVoidElement(name)) {
// the last child should have the correct name
if (name != current.children.last.name) {
throw new Throwable("incompatible end for void tag")
} else {
current.children.last.content += content
}
} else {
// the current node should have the correct name, and it is
// getting closed
if (name != current.name) {
throw new Throwable("incompatible end tag " + current.name)
}
// move current node up
current = current.parent
}
} else if (isBothTag(content) || isVoidElement(name)) {
// just add this tag without content
current.addChild(new BookNode(name, content))
} else {
// a start tag
val newNode = new BookNode(name, content)
current.addChild(newNode)
current = newNode
}
// reset content
content = ""
} else throw new Throwable("wild > encountered")
}
}
// add the last text node, if there is still such a thing remaining
if (content.nonEmpty) {
if (isTag(content)) throw new Throwable("this should not happen")
else {
// can only be a text node or nothing
current.addChild(new BookNode("text", content))
}
}
bodyNode.collapseLeaves()
bodyNode.updatePositions(entrancePosition)
Some(bodyNode)
}
// ...
}
The parseBody
method receives the HTML body contents as string, and a start position. This start position is the start position of the book section, and it is used to compute the absolute positions in the entire book. The algorithm start by creating the book section root node of type body
which it uses as the current node. It then starts going over the text iteratively, performing the following actions:
<
it starts a new tag and, if there is accumulated content, which should be a text node, it creates a leaf node for this text content and adds it as a child to the current node;There is also a special case to handle remaining text content at the end of a HTML file, followed by some cleanup work and computing the positions in this tree.
class BookNode {
// ...
private def collapseLeaves(): Unit = {
if (shouldBeLeafElement(this.name) && this.children.nonEmpty) {
// extract content from children
this.content = this.extractContent()
this.children = Seq()
} else {
this.children.foreach(_.collapseLeaves())
}
}
// ...
}
The collapseLeaves
method will go over the tree recursively and transform some structures into leaf elements. This is done to ensure that some structures in the HTML document are treated as a single position when pagination is done. At the current implementation of this algorithm, images IMG
and table rows TR
are the structures which are considered single-position for pagination. This allows us to split a long table between pages, but not clip a row of that table.
class BookNode {
// ...
private def updatePositions(entrancePosition: Long): Unit = {
var position = entrancePosition
this.start = position
if (this.name == "text") {
this.end = this.start + this.content.length - 1
} else if (shouldBeLeafElement(this.name)) {
// occupies a single position
this.end = this.start
} else if (this.children.isEmpty) {
// an element without children, maybe used for spacing, should occupy a
// single position
this.end = this.start
} else {
// compute for children and update
for (i <- this.children.indices) {
val child = this.children(i)
child.updatePositions(position)
position = child.end + 1
}
this.end = this.children.last.end
}
}
// ...
}
The updatePositions
method will also recursively traverse the tree and it will update the start
and end
fields for each node as following:
start
position will be the entrancePosition
end
position will be the start
position plus that length;end
position will be the start
position plus one, since leaf nodes that are not text are treated as taking a single position;end
position is computed by computing start
and end
positions for their children, with every child starting at the previous child end
position plus one.As a final part of code to detail in this section, utility methods that find out the tag type, tag name, and extract information from tags, are implemented using regex:
object BookNode {
private val VOID_ELEMENTS = Seq("area","base","br","col","hr","img","input",
"link","meta","param","keygen","source")
private val LEAF_ELEMENTS = Seq("img", "tr")
private def isTag(str: String) = "^</?[^>]+>$".r matches str
private def isComment(str: String) = "^<!\\-\\-[^>]+\\-\\->$".r matches str
private def isEndTag(str: String) = "^</[^>]+>$".r matches str
private def isBothTag(str: String) = "^<[^>/]+/>$".r matches str
private def getTagName(str: String) = {
"</?([^>\\s]+)".r findFirstMatchIn str match {
case Some(m) => m.group(1)
case None => null
}
}
private def isVoidElement(tagName: String) =
VOID_ELEMENTS.contains(tagName.toLowerCase())
private def shouldBeLeafElement(tagName: String) =
LEAF_ELEMENTS.contains(tagName.toLowerCase())
// ...
}
Now that we have our book in the correct structure, we need two more functions to build our page computing algorithm: find the next space after a given position and extract book content between given positions.
class BookNode {
// ...
def findSpaceAfter(position: Long): Long = {
val spacePattern = "\\s".r
// first get leaf at position
var leaf = leafAtPosition(position)
// for a text node, next space may be in the text node, next space character
// after position
// if other kind of node, next space is the start of next leaf
if (leaf != null && leaf.end == position) {
// we need to look in the next node
leaf = leaf.nextLeaf()
}
if (leaf != null && leaf.name == "text") {
val searchStartPosition = if (position - leaf.start + 1 > 0) {
position - leaf.start + 1
} else 0
val searchContent = leaf.content.substring(searchStartPosition.toInt)
val m = spacePattern.findFirstMatchIn(searchContent)
if (m.isDefined) {
return m.get.start + position + 1
}
}
if (leaf != null) leaf.end
else getDocumentEnd()
}
// ...
}
All content is always in leaves, so the space we are looking for, if it exists, will be in a leaf. Our first task is to find the leaf that may contain our desired space. This is either the current leaf, if the current leaf is a text node and we are not at the end of the current text node, or the next leaf. Once we have the correct leaf, we search for the next space in that leaf. If the leaf is a text node, the next space is the first space after the current position or the end of the leaf. If the leaf is not a text node, so it is some other kind of structured node that is always treated as occupying a single position, the next space is the end of this leaf. And the last case, if there is no valid leaf where we can search for the next space we are at the end of the document, so we return the document end position.
class BookNode {
// ...
def nextLeaf(): BookNode = {
// is this a leaf?
var current = this
if (current.children.isEmpty) {
// go up the parent line until we find next sibling
var parent = current.parent
while (parent != null
&& parent.children.indexOf(current) == parent.children.size - 1) {
current = parent
parent = current.parent
}
if (parent != null) {
// we have the next sibling in current, must find first leaf
current = parent.children(parent.children.indexOf(current) + 1)
} else {
// we have reached root, this was the last leaf, there is no other
return null
}
}
// find first child of the current node
while (current.children.nonEmpty) {
current = current.children.head
}
current
}
// ...
}
I am also including the code for finding the next leaf in a tree, a standard tree algorithm which we can implement without recursion because we have references to both the children and the parent of a node.
The code also contains method for finding the previous space and previous leaves, but in the final page computation algorithm this is never used, so I am not including it in this article.
With the current code, it becomes easy to compute pages in a section, and I will go over that algorithm in the next article. But once we have those pages, we need to extract the content from the section, while preserving the content structure. This is not referring only to relative structure, but to the full structure starting at the root. For example, if we have a paragraph P
node containing a child text node starting at position 0 and ending at position 1000, and we want to extract the page from position 100 to position 200, we don't want only the text between those two positions. The result should contain the text between those two positions wrapped in correcnt paragraph P
tags, a partial paragraph keeping the full structure of the document. This should apply to all complect nested structures in our document, quotes, spans, links, tables.
class BookNode {
// ...
def copy(from: Long, to: Long): BookNode = {
if (this.name == "text") {
if (from <= this.start && this.end <= to) {
// this node is copied whole
new BookNode("text", this.content, this.start, this.end)
} else if (from <= this.start && this.start <= to && to<= this.end) {
// copy ends at this node
new BookNode(
this.name,
this.content.substring(0, (to - this.start + 1).toInt),
this.start,
to
)
} else if (this.start <= from && from <= this.end && this.end <= to) {
// copy starts at this node
new BookNode(
this.name,
this.content.substring((from - this.start).toInt),
from,
this.end
)
} else if (this.start <= from && to < this.end) {
// we only copy part of this node
new BookNode(
this.name,
this.content.substring((from - this.start).toInt,
(to - this.start + 1).toInt),
from,
to
)
} else {
null
}
} else if (shouldBeLeafElement(this.name)) {
if (from <= this.start && this.end <= to) {
// include element in selection
new BookNode(this.name, this.content, this.start, this.end)
} else {
null
}
} else {
if (this.end < from || this.start > to) {
// this node is outside the range and should not be copied
null
} else {
val newNode = new BookNode(this.name, this.content)
newNode.children = this.children
.map(_.copy(from, to))
.filter(_ != null)
newNode.children.foreach(_.parent = newNode)
if (newNode.children.isEmpty) {
newNode.start = this.start
newNode.end = this.end
} else {
newNode.start = newNode.children.head.start
newNode.end = newNode.children.last.end
}
newNode
}
}
}
// ...
}
The desired functionality will be achieved with the copy
method, which gives us a new BookNode
structure representing a part of the original tree. The recursive algorithm handles different kinds of nodes in different ways:
from
and to
positions in the copy
method and based on this we decide if we return a copy of the full text node, a partial copy, or null
if this text node is not included in the selection at all;null
;from
and to
limits:
null
from
and to
limits; and after we know the new children we can compute the start and end positions of the new parent node.This method will effectively give us a copy of the tree that contains the content between positions from
and to
while keeping all the structure for the leaves included in those limits. And once we have this subtree, we can use the extractContent
method to get the HTML for the page:
class BookNode {
// ...
@JsonIgnore
def extractContent(): String = {
if (this.name == "text") this.content
else if (this.name == "body") {
this.children.map(_.extractContent()).mkString("")
} else if (shouldBeLeafElement(this.name) && this.children.isEmpty) {
this.content
} else {
this.content
+ this.children.map(_.extractContent()).mkString("")
+ "</" + this.name + ">"
}
}
// ...
}
The full implementation for this class can be viewed on the project repository. There is more functionality included there. One use-case that is not discussed here is keeping links inside a book functional by replacing paths inside the EPUB archive with book positions. There is also code for scanning the entire book, parsing it and computing the length of a book before saving it to the database.
The repository also contains the Javascript implementation of the Scala algorithms which are used on the UI side. With this custom parser implementation we can now guarantee consistent results on both the server side and the UI side when navigating the book based on positions, when saving and displaying read progress and for detecting when a book has been read.