home / 2020.12.10 12:00 / scala / html / tree / epub / javascript

Custom Ebook Parser

This summer, I worked on developing a web-based comic-book reading application, and after everything was running well with comic books, I decided to extend the functionality to reading e-books. This proved to be a more complicated problem. The format I focused on was EPUB, which are archives containing html/xhtml files. To create a good e-book reading experience, I needed to split the book into pages, but how much would fit into a page depends on a lot of things: the size of the viewport, font-sizes, margins; and these limitations meant we have to split the book into pages directly on the UI side. But in this first implementation, I still needed to parse the book on the server side as well, in order to compute the number of positions in the book.

A position, in this case, would be one character. To compute the number of positions in a book stored as html files, I would only need to extract the text (without tags) from the html file and compute its size. Html-parsing libraries for Java and the Javascript DOM parser in the browser both have this functionality. The problem, however, is that they never give the same results. This meant that the book size computed on the server was almost always different that the size computed on the UI. This translated to the UI reporting book progress reached a certan position that was completely meaningless for the positions recorded on the server. And this is completely ignoring more complex cases where, for the logic in the page computation algorithm, certain elements like images and tables should be considered as occupying a single position. To elaborate on that, when deciding how much content fits into a page, from position X to position Y, we can't split an image or a table between two pages, so these elements must be considered as a single position. That image or table either fits in the current page, or it will be displayed in the next page.

For the original implementation I found a hacky implementation to sortof make this work, with the progress in the book being just an approximation, but it didn't really work well. I couldn't really let this go, especially now that the reading season is upon us (December holidays), so I went back and rewrote everything related to EPUB parsing. And the necessary approach was clear: to get the same results for both the backend and the UI I had to write my own parser, implemented in both Scala and Javascript. These parser implementations would have to treat spaces the same way, compute positions in book sections the same way, treat special one-position elements like images and tables the same way.

The Ideal Position

A position in the book represents one displayable element in that book. In most cases, this is a character, a letter or a space. In some special cases, this can be a more complex element, like an image. Positions are important in our algorithm because they are used to split the book into pages. Every page in the book has a start position and an end position. When computing a page, we start with the element at the start position and keep adding elements to the page until no more elements fit in the page. If those elements are characters, we just keep adding characters to the page until the page is filled. If those elements take more space, like images, less of them will fit in a page. A page may fit 1000 characters, or 500 characters and an image.

An ideal parsing algorithm, which would put us in an ideal position to deliver the best performance and accuracy for parsing e-books, would need the following functionaliy:

The data structure for this algorithm will look as follows:

class BookNode { @BeanProperty var name: String = _ @JsonBackReference @BeanProperty var parent: BookNode = _ var children: Seq[BookNode] = Seq() @BeanProperty var content: String = _ @BeanProperty var start: Long = _ @BeanProperty var end: Long = _ // ... }

The name holds the element type. content will hold the actual text for text elements, but it will hold the entire opening tag, with all attributes, for other nodes. start and end represent the start and end positions for the node. Every node has a reference to its parent and may contain children.

Parsing the HTML

To parse an HTML string we have a companion object for the BookNode class with the main parsing code:

object BookNode { // ... private def parseBody(body: String, entrancePosition: Long): Option[BookNode] = { val bodyNode = new BookNode("body", "") var current: BookNode = bodyNode var content = "" for (i <- 0 until body.length) { val c = body(i) if (c == '<') { // starting a new tag // save what we have in content if (isTag(content)) throw new Throwable("this should not happen") else { // can only be a text node or nothing if (content.length > 0) { current.addChild(new BookNode("text", content)) content = "" } } } // accumulate content content += c if (c == '>') { // ending a tag if (isTag(content)) { val name = getTagName(content) // if it's comment, ignore if (isComment(content)) { // ignoring comment } else if (isEndTag(content)) { // we check that this tag closes the current node correctly if (isVoidElement(name)) { // the last child should have the correct name if (name != current.children.last.name) { throw new Throwable("incompatible end for void tag") } else { current.children.last.content += content } } else { // the current node should have the correct name, and it is // getting closed if (name != current.name) { throw new Throwable("incompatible end tag " + current.name) } // move current node up current = current.parent } } else if (isBothTag(content) || isVoidElement(name)) { // just add this tag without content current.addChild(new BookNode(name, content)) } else { // a start tag val newNode = new BookNode(name, content) current.addChild(newNode) current = newNode } // reset content content = "" } else throw new Throwable("wild > encountered") } } // add the last text node, if there is still such a thing remaining if (content.nonEmpty) { if (isTag(content)) throw new Throwable("this should not happen") else { // can only be a text node or nothing current.addChild(new BookNode("text", content)) } } bodyNode.collapseLeaves() bodyNode.updatePositions(entrancePosition) Some(bodyNode) } // ... }

The parseBody method receives the HTML body contents as string, and a start position. This start position is the start position of the book section, and it is used to compute the absolute positions in the entire book. The algorithm start by creating the book section root node of type body which it uses as the current node. It then starts going over the text iteratively, performing the following actions:

There is also a special case to handle remaining text content at the end of a HTML file, followed by some cleanup work and computing the positions in this tree.

class BookNode { // ... private def collapseLeaves(): Unit = { if (shouldBeLeafElement(this.name) && this.children.nonEmpty) { // extract content from children this.content = this.extractContent() this.children = Seq() } else { this.children.foreach(_.collapseLeaves()) } } // ... }

The collapseLeaves method will go over the tree recursively and transform some structures into leaf elements. This is done to ensure that some structures in the HTML document are treated as a single position when pagination is done. At the current implementation of this algorithm, images IMG and table rows TR are the structures which are considered single-position for pagination. This allows us to split a long table between pages, but not clip a row of that table.

class BookNode { // ... private def updatePositions(entrancePosition: Long): Unit = { var position = entrancePosition this.start = position if (this.name == "text") { this.end = this.start + this.content.length - 1 } else if (shouldBeLeafElement(this.name)) { // occupies a single position this.end = this.start } else if (this.children.isEmpty) { // an element without children, maybe used for spacing, should occupy a // single position this.end = this.start } else { // compute for children and update for (i <- this.children.indices) { val child = this.children(i) child.updatePositions(position) position = child.end + 1 } this.end = this.children.last.end } } // ... }

The updatePositions method will also recursively traverse the tree and it will update the start and end fields for each node as following:

As a final part of code to detail in this section, utility methods that find out the tag type, tag name, and extract information from tags, are implemented using regex:

object BookNode { private val VOID_ELEMENTS = Seq("area","base","br","col","hr","img","input", "link","meta","param","keygen","source") private val LEAF_ELEMENTS = Seq("img", "tr") private def isTag(str: String) = "^</?[^>]+>$".r matches str private def isComment(str: String) = "^<!\\-\\-[^>]+\\-\\->$".r matches str private def isEndTag(str: String) = "^</[^>]+>$".r matches str private def isBothTag(str: String) = "^<[^>/]+/>$".r matches str private def getTagName(str: String) = { "</?([^>\\s]+)".r findFirstMatchIn str match { case Some(m) => m.group(1) case None => null } } private def isVoidElement(tagName: String) = VOID_ELEMENTS.contains(tagName.toLowerCase()) private def shouldBeLeafElement(tagName: String) = LEAF_ELEMENTS.contains(tagName.toLowerCase()) // ... }

Tree algorithms for leaves

Now that we have our book in the correct structure, we need two more functions to build our page computing algorithm: find the next space after a given position and extract book content between given positions.

class BookNode { // ... def findSpaceAfter(position: Long): Long = { val spacePattern = "\\s".r // first get leaf at position var leaf = leafAtPosition(position) // for a text node, next space may be in the text node, next space character // after position // if other kind of node, next space is the start of next leaf if (leaf != null && leaf.end == position) { // we need to look in the next node leaf = leaf.nextLeaf() } if (leaf != null && leaf.name == "text") { val searchStartPosition = if (position - leaf.start + 1 > 0) { position - leaf.start + 1 } else 0 val searchContent = leaf.content.substring(searchStartPosition.toInt) val m = spacePattern.findFirstMatchIn(searchContent) if (m.isDefined) { return m.get.start + position + 1 } } if (leaf != null) leaf.end else getDocumentEnd() } // ... }

All content is always in leaves, so the space we are looking for, if it exists, will be in a leaf. Our first task is to find the leaf that may contain our desired space. This is either the current leaf, if the current leaf is a text node and we are not at the end of the current text node, or the next leaf. Once we have the correct leaf, we search for the next space in that leaf. If the leaf is a text node, the next space is the first space after the current position or the end of the leaf. If the leaf is not a text node, so it is some other kind of structured node that is always treated as occupying a single position, the next space is the end of this leaf. And the last case, if there is no valid leaf where we can search for the next space we are at the end of the document, so we return the document end position.

class BookNode { // ... def nextLeaf(): BookNode = { // is this a leaf? var current = this if (current.children.isEmpty) { // go up the parent line until we find next sibling var parent = current.parent while (parent != null && parent.children.indexOf(current) == parent.children.size - 1) { current = parent parent = current.parent } if (parent != null) { // we have the next sibling in current, must find first leaf current = parent.children(parent.children.indexOf(current) + 1) } else { // we have reached root, this was the last leaf, there is no other return null } } // find first child of the current node while (current.children.nonEmpty) { current = current.children.head } current } // ... }

I am also including the code for finding the next leaf in a tree, a standard tree algorithm which we can implement without recursion because we have references to both the children and the parent of a node.

The code also contains method for finding the previous space and previous leaves, but in the final page computation algorithm this is never used, so I am not including it in this article.

With the current code, it becomes easy to compute pages in a section, and I will go over that algorithm in the next article. But once we have those pages, we need to extract the content from the section, while preserving the content structure. This is not referring only to relative structure, but to the full structure starting at the root. For example, if we have a paragraph P node containing a child text node starting at position 0 and ending at position 1000, and we want to extract the page from position 100 to position 200, we don't want only the text between those two positions. The result should contain the text between those two positions wrapped in correcnt paragraph P tags, a partial paragraph keeping the full structure of the document. This should apply to all complect nested structures in our document, quotes, spans, links, tables.

class BookNode { // ... def copy(from: Long, to: Long): BookNode = { if (this.name == "text") { if (from <= this.start && this.end <= to) { // this node is copied whole new BookNode("text", this.content, this.start, this.end) } else if (from <= this.start && this.start <= to && to<= this.end) { // copy ends at this node new BookNode( this.name, this.content.substring(0, (to - this.start + 1).toInt), this.start, to ) } else if (this.start <= from && from <= this.end && this.end <= to) { // copy starts at this node new BookNode( this.name, this.content.substring((from - this.start).toInt), from, this.end ) } else if (this.start <= from && to < this.end) { // we only copy part of this node new BookNode( this.name, this.content.substring((from - this.start).toInt, (to - this.start + 1).toInt), from, to ) } else { null } } else if (shouldBeLeafElement(this.name)) { if (from <= this.start && this.end <= to) { // include element in selection new BookNode(this.name, this.content, this.start, this.end) } else { null } } else { if (this.end < from || this.start > to) { // this node is outside the range and should not be copied null } else { val newNode = new BookNode(this.name, this.content) newNode.children = this.children .map(_.copy(from, to)) .filter(_ != null) newNode.children.foreach(_.parent = newNode) if (newNode.children.isEmpty) { newNode.start = this.start newNode.end = this.end } else { newNode.start = newNode.children.head.start newNode.end = newNode.children.last.end } newNode } } } // ... }

The desired functionality will be achieved with the copy method, which gives us a new BookNode structure representing a part of the original tree. The recursive algorithm handles different kinds of nodes in different ways:

This method will effectively give us a copy of the tree that contains the content between positions from and to while keeping all the structure for the leaves included in those limits. And once we have this subtree, we can use the extractContent method to get the HTML for the page:

class BookNode { // ... @JsonIgnore def extractContent(): String = { if (this.name == "text") this.content else if (this.name == "body") { this.children.map(_.extractContent()).mkString("") } else if (shouldBeLeafElement(this.name) && this.children.isEmpty) { this.content } else { this.content + this.children.map(_.extractContent()).mkString("") + "</" + this.name + ">" } } // ... }

Conclusion

The full implementation for this class can be viewed on the project repository. There is more functionality included there. One use-case that is not discussed here is keeping links inside a book functional by replacing paths inside the EPUB archive with book positions. There is also code for scanning the entire book, parsing it and computing the length of a book before saving it to the database.

The repository also contains the Javascript implementation of the Scala algorithms which are used on the UI side. With this custom parser implementation we can now guarantee consistent results on both the server side and the UI side when navigating the book based on positions, when saving and displaying read progress and for detecting when a book has been read.