Custom Ebook Parser

This summer, I worked on developing a web-based comic-book reading application, and after everything was running well with comic books, I decided to extend the functionality to reading e-books. This proved to be a more complicated problem. The format I focused on was EPUB, which are archives containing html/xhtml files. To create a good e-book reading experience, I needed to split the book into pages, but how much would fit into a page depends on a lot of things: the size of the viewport, font-sizes, margins; and these limitations meant we have to split the book into pages directly on the UI side. But in this first implementation, I still needed to parse the book on the server side as well, in order to compute the number of positions in the book.

A position, in this case, would be one character. To compute the number of positions in a book stored as html files, I would only need to extract the text (without tags) from the html file and compute its size. Html-parsing libraries for Java and the Javascript DOM parser in the browser both have this functionality. The problem, however, is that they never give the same results. This meant that the book size computed on the server was almost always different that the size computed on the UI. This translated to the UI reporting book progress reached a certan position that was completely meaningless for the positions recorded on the server. And this is completely ignoring more complex cases where, for the logic in the page computation algorithm, certain elements like images and tables should be considered as occupying a single position. To elaborate on that, when deciding how much content fits into a page, from position X to position Y, we can't split an image or a table between two pages, so these elements must be considered as a single position. That image or table either fits in the current page, or it will be displayed in the next page.

For the original implementation I found a hacky implementation to sortof make this work, with the progress in the book being just an approximation, but it didn't really work well. I couldn't really let this go, especially now that the reading season is upon us (December holidays), so I went back and rewrote everything related to EPUB parsing. And the necessary approach was clear: to get the same results for both the backend and the UI I had to write my own parser, implemented in both Scala and Javascript. These parser implementations would have to treat spaces the same way, compute positions in book sections the same way, treat special one-position elements like images and tables the same way.

The Ideal Position

A position in the book represents one displayable element in that book. In most cases, this is a character, a letter or a space. In some special cases, this can be a more complex element, like an image. Positions are important in our algorithm because they are used to split the book into pages. Every page in the book has a start position and an end position. When computing a page, we start with the element at the start position and keep adding elements to the page until no more elements fit in the page. If those elements are characters, we just keep adding characters to the page until the page is filled. If those elements take more space, like images, less of them will fit in a page. A page may fit 1000 characters, or 500 characters and an image.

An ideal parsing algorithm, which would put us in an ideal position to deliver the best performance and accuracy for parsing e-books, would need the following functionaliy:

parse and HTML document into a tree of nodes
compute and store start and end positions for each node in the tree
let us extract content (structured) from the node tree between specified positions X and Y - for extracting pages
not lose any data - extracting content from position 0 to maximum position should return the original HTML file (current implementation does not strictly adhere to this rule with regards to HTML comments)
be able to find the next and previous space starting at position X - this is to facilitate us to split the book only at spaces, and avoid splitting a word between pages

The data structure for this algorithm will look as follows:

class BookNode {

  @BeanProperty
  var name: String = _

  @JsonBackReference
  @BeanProperty
  var parent: BookNode = _

  var children: Seq[BookNode] = Seq()

  @BeanProperty
  var content: String = _
  
  @BeanProperty
  var start: Long = _
  
  @BeanProperty
  var end: Long = _

  // ...

}

The name holds the element type. content will hold the actual text for text elements, but it will hold the entire opening tag, with all attributes, for other nodes. start and end represent the start and end positions for the node. Every node has a reference to its parent and may contain children.

Parsing the HTML

To parse an HTML string we have a companion object for the BookNode class with the main parsing code:

object BookNode {

  // ...

  private def parseBody(body: String, 
                        entrancePosition: Long): Option[BookNode] = {
    val bodyNode = new BookNode("body", "")
    var current: BookNode = bodyNode

    var content = ""

    for (i <- 0 until body.length) {
      val c = body(i)

      if (c == '<') {
        // starting a new tag
        // save what we have in content
        if (isTag(content)) throw new Throwable("this should not happen")
        else {
          // can only be a text node or nothing
          if (content.length > 0) {
            current.addChild(new BookNode("text", content))
            content = ""
          }
        }
      }

      // accumulate content
      content += c

      if (c == '>') {
        // ending a tag
        if (isTag(content)) {
          val name = getTagName(content)
          // if it's comment, ignore
          if (isComment(content)) {
            // ignoring comment
          } else if (isEndTag(content)) {
            // we check that this tag closes the current node correctly
            if (isVoidElement(name)) {
              // the last child should have the correct name
              if (name != current.children.last.name) {
                throw new Throwable("incompatible end for void tag")
              } else {
                current.children.last.content += content
              }
            } else {
              // the current node should have the correct name, and it is
              // getting closed
              if (name != current.name) {
                throw new Throwable("incompatible end tag " + current.name)
              }
              // move current node up
              current = current.parent
            }
          } else if (isBothTag(content) || isVoidElement(name)) {
            // just add this tag without content
            current.addChild(new BookNode(name, content))
          } else {
            // a start tag
            val newNode = new BookNode(name, content)
            current.addChild(newNode)
            current = newNode
          }
          // reset content
          content = ""
        } else throw new Throwable("wild > encountered")
      }
    }
    // add the last text node, if there is still such a thing remaining
    if (content.nonEmpty) {
      if (isTag(content)) throw new Throwable("this should not happen")
      else {
        // can only be a text node or nothing
        current.addChild(new BookNode("text", content))
      }
    }
    bodyNode.collapseLeaves()
    bodyNode.updatePositions(entrancePosition)
    Some(bodyNode)
  }

  // ...
}

The parseBody method receives the HTML body contents as string, and a start position. This start position is the start position of the book section, and it is used to compute the absolute positions in the entire book. The algorithm start by creating the book section root node of type body which it uses as the current node. It then starts going over the text iteratively, performing the following actions:

when it encounters character < it starts a new tag and, if there is accumulated content, which should be a text node, it creates a leaf node for this text content and adds it as a child to the current node;
it adds the current character to the current content;
when it encounters character '>' it handles tag ending, which, after making sure we have a valid tag in the current content, can be treated in several ways:
- if this is a comment, we just ignore it (this also means that comments will be lost when we extract content from the node tree)
- if we have an end tag:
  - we check that the end tag name matches the current node name and switch the current node to the current node parent (move back up in the tree)
  - we have a special case for void elements; these are elements that should not have content, they are always leaves, when a new void element is added the current node does not change to reference it
- if we have a "both" tag, a tag that both opens and closes an element, or a void element, an element that does not have to be closed and should never have content, we just make a new leaf node for the current content and add it to the current node
- and the last case, if this is a start tag we create a new node for it and switch the current node to this new node (move down the tree hierarchy)

There is also a special case to handle remaining text content at the end of a HTML file, followed by some cleanup work and computing the positions in this tree.

class BookNode {

  // ...

  private def collapseLeaves(): Unit = {
    if (shouldBeLeafElement(this.name) && this.children.nonEmpty) {
      // extract content from children
      this.content = this.extractContent()
      this.children = Seq()
    } else {
      this.children.foreach(_.collapseLeaves())
    }
  }

  // ...

}

The collapseLeaves method will go over the tree recursively and transform some structures into leaf elements. This is done to ensure that some structures in the HTML document are treated as a single position when pagination is done. At the current implementation of this algorithm, images IMG and table rows TR are the structures which are considered single-position for pagination. This allows us to split a long table between pages, but not clip a row of that table.

class BookNode {

  // ...

  private def updatePositions(entrancePosition: Long): Unit = {
    var position = entrancePosition
    this.start = position
    if (this.name == "text") {
      this.end = this.start + this.content.length - 1
    } else if (shouldBeLeafElement(this.name)) {
      // occupies a single position
      this.end = this.start
    } else if (this.children.isEmpty) {
      // an element without children, maybe used for spacing, should occupy a
      // single position
      this.end = this.start
    } else {
      // compute for children and update
      for (i <- this.children.indices) {
        val child = this.children(i)
        child.updatePositions(position)
        position = child.end + 1
      }
      this.end = this.children.last.end
    }
  }

  // ...

}

The updatePositions method will also recursively traverse the tree and it will update the start and end fields for each node as following:

for any node, the start position will be the entrancePosition
for a text node, its length is the length of its content, so the end position will be the start position plus that length;
for any leaf node, image and table row, the end position will be the start position plus one, since leaf nodes that are not text are treated as taking a single position;
for nodes that are not leaves, the end position is computed by computing start and end positions for their children, with every child starting at the previous child end position plus one.

As a final part of code to detail in this section, utility methods that find out the tag type, tag name, and extract information from tags, are implemented using regex:

object BookNode {

  private val VOID_ELEMENTS = Seq("area","base","br","col","hr","img","input",
                                  "link","meta","param","keygen","source")
  private val LEAF_ELEMENTS = Seq("img", "tr")

  private def isTag(str: String) = "^</?[^>]+>$".r matches str
  private def isComment(str: String) = "^<!\\-\\-[^>]+\\-\\->$".r matches str
  private def isEndTag(str: String) = "^</[^>]+>$".r matches str
  private def isBothTag(str: String) = "^<[^>/]+/>$".r matches str
  private def getTagName(str: String) = {
    "</?([^>\\s]+)".r findFirstMatchIn str match {
      case Some(m) => m.group(1)
      case None => null
    }
  }
  private def isVoidElement(tagName: String) = 
    VOID_ELEMENTS.contains(tagName.toLowerCase())
  private def shouldBeLeafElement(tagName: String) = 
    LEAF_ELEMENTS.contains(tagName.toLowerCase())

  // ...

}

Tree algorithms for leaves

Now that we have our book in the correct structure, we need two more functions to build our page computing algorithm: find the next space after a given position and extract book content between given positions.

class BookNode {

  // ...

  def findSpaceAfter(position: Long): Long = {
    val spacePattern = "\\s".r
    // first get leaf at position
    var leaf = leafAtPosition(position)
    // for a text node, next space may be in the text node, next space character
    // after position
    // if other kind of node, next space is the start of next leaf
    if (leaf != null && leaf.end == position) {
      // we need to look in the next node
      leaf = leaf.nextLeaf()
    }
    if (leaf != null && leaf.name == "text") {
      val searchStartPosition = if (position - leaf.start + 1 > 0) {
        position - leaf.start + 1
      } else 0
      val searchContent = leaf.content.substring(searchStartPosition.toInt)
      val m = spacePattern.findFirstMatchIn(searchContent)
      if (m.isDefined) {
        return m.get.start + position + 1
      }
    }
    if (leaf != null) leaf.end
    else getDocumentEnd()
  }

  // ...

}

All content is always in leaves, so the space we are looking for, if it exists, will be in a leaf. Our first task is to find the leaf that may contain our desired space. This is either the current leaf, if the current leaf is a text node and we are not at the end of the current text node, or the next leaf. Once we have the correct leaf, we search for the next space in that leaf. If the leaf is a text node, the next space is the first space after the current position or the end of the leaf. If the leaf is not a text node, so it is some other kind of structured node that is always treated as occupying a single position, the next space is the end of this leaf. And the last case, if there is no valid leaf where we can search for the next space we are at the end of the document, so we return the document end position.

class BookNode {

  // ...

  def nextLeaf(): BookNode = {
    // is this a leaf?
    var current = this
    if (current.children.isEmpty) {
      // go up the parent line until we find next sibling
      var parent = current.parent
      while (parent != null 
             && parent.children.indexOf(current) == parent.children.size - 1) {
        current = parent
        parent = current.parent
      }
      if (parent != null) {
        // we have the next sibling in current, must find first leaf
        current = parent.children(parent.children.indexOf(current) + 1)
      } else {
        // we have reached root, this was the last leaf, there is no other
        return null
      }
    }
    // find first child of the current node
    while (current.children.nonEmpty) {
      current = current.children.head
    }
    current
  }

  // ...

}

I am also including the code for finding the next leaf in a tree, a standard tree algorithm which we can implement without recursion because we have references to both the children and the parent of a node.

The code also contains method for finding the previous space and previous leaves, but in the final page computation algorithm this is never used, so I am not including it in this article.

With the current code, it becomes easy to compute pages in a section, and I will go over that algorithm in the next article. But once we have those pages, we need to extract the content from the section, while preserving the content structure. This is not referring only to relative structure, but to the full structure starting at the root. For example, if we have a paragraph P node containing a child text node starting at position 0 and ending at position 1000, and we want to extract the page from position 100 to position 200, we don't want only the text between those two positions. The result should contain the text between those two positions wrapped in correcnt paragraph P tags, a partial paragraph keeping the full structure of the document. This should apply to all complect nested structures in our document, quotes, spans, links, tables.

class BookNode {
  
  // ...

  def copy(from: Long, to: Long): BookNode = {
    if (this.name == "text") {
      if (from <= this.start && this.end <= to) {
        // this node is copied whole
        new BookNode("text", this.content, this.start, this.end)
      } else if (from <= this.start && this.start <= to && to<= this.end) {
        // copy ends at this node
        new BookNode(
          this.name,
          this.content.substring(0, (to - this.start + 1).toInt),
          this.start,
          to
        )
      } else if (this.start <= from && from <= this.end && this.end <= to) {
        // copy starts at this node
        new BookNode(
          this.name,
          this.content.substring((from - this.start).toInt),
          from,
          this.end
        )
      } else if (this.start <= from && to < this.end) {
        // we only copy part of this node
        new BookNode(
          this.name,
          this.content.substring((from - this.start).toInt,
          (to - this.start + 1).toInt),
          from,
          to
        )
      } else {
        null
      }
    } else if (shouldBeLeafElement(this.name)) {
      if (from <= this.start && this.end <= to) {
        // include element in selection
        new BookNode(this.name, this.content, this.start, this.end)
      } else {
        null
      }
    } else {
      if (this.end < from || this.start > to) {
        // this node is outside the range and should not be copied
        null
      } else {
        val newNode = new BookNode(this.name, this.content)
        newNode.children = this.children
          .map(_.copy(from, to))
          .filter(_ != null)
        newNode.children.foreach(_.parent = newNode)
        if (newNode.children.isEmpty) {
          newNode.start = this.start
          newNode.end = this.end
        } else {
          newNode.start = newNode.children.head.start
          newNode.end = newNode.children.last.end
        }
        newNode
      }
    }
  }

  // ...

}

The desired functionality will be achieved with the copy method, which gives us a new BookNode structure representing a part of the original tree. The recursive algorithm handles different kinds of nodes in different ways:

if the current node is a text node, we look at the text node start and end positions and the from and to positions in the copy method and based on this we decide if we return a copy of the full text node, a partial copy, or null if this text node is not included in the selection at all;
for leaf nodes that are not text, since they represent a single position, they are either included in the selection and we return a copy of the leaf node, or are not included so we return null;
and for nodes with children we again check if the current node is within the from and to limits:
- if it isn't, we return a null
- if it is, we make a copy of the current node that will contain copies of the current node children that respect the from and to limits; and after we know the new children we can compute the start and end positions of the new parent node.

This method will effectively give us a copy of the tree that contains the content between positions from and to while keeping all the structure for the leaves included in those limits. And once we have this subtree, we can use the extractContent method to get the HTML for the page:

class BookNode {

  // ...

  @JsonIgnore
  def extractContent(): String = {
    if (this.name == "text") this.content
    else if (this.name == "body") {
      this.children.map(_.extractContent()).mkString("")
    } else if (shouldBeLeafElement(this.name) && this.children.isEmpty) {
      this.content
    } else {
      this.content 
      + this.children.map(_.extractContent()).mkString("") 
      + "</" + this.name + ">"
    }
  }

  // ...

}

Conclusion

The full implementation for this class can be viewed on the project repository. There is more functionality included there. One use-case that is not discussed here is keeping links inside a book functional by replacing paths inside the EPUB archive with book positions. There is also code for scanning the entire book, parsing it and computing the length of a book before saving it to the database.

The repository also contains the Javascript implementation of the Scala algorithms which are used on the UI side. With this custom parser implementation we can now guarantee consistent results on both the server side and the UI side when navigating the book based on positions, when saving and displaying read progress and for detecting when a book has been read.