How to Create an e-Book from an Online Reading Site

I recently bought an Amazon Kindle, for the primary purpose of reading more Chinese. It has turned out to be a great investment, since I am no longer tied to my computer screen for reading things I find online. I had been collecting bookmarks to online books sites for a long time without making much use of them. Now that I am a bigger consumer of reading material, I’m starting to make use of them. In particular, I need sites that allow for downloading the raw text, so that I can convert it into a formatted book. Thus, I’m not interested in sites that use Flash for online-only reading, or sites which only publish exe’s or pdf’s (the latter of which are somewhat clunky on an e-reader).

Below I describe the steps that have worked for me, in turning an online text into a portable e-book. Be forewarned that I have applied the changeLocale hack to my Kindle, in order to avoid known issues with displaying Chinese on it. Almost everything I describe should work for you, but you may need to tinker with the encoding or file naming if any issues appear.

Online Reading Sites

The following list of reading sites is quite small compared to the number of sites that exist, but it is made up of the links I have collected over the years that have interesting content.

shuku.net: Contains books from many prominent writers
Tianyabook 天涯在线书库
readnovel 小说阅读网: user-contributed novels, but a good selection of light fiction, organized by genre. Note the genres in separate男生、女生 and 校园 editions
du8.com: Looks like current books; some content is local while some is linked to off-site locations
Hongxiu 红袖添香小说网
xiaoshuo.com 小说网
Shucang 书仓网
cnread 中国读书网
cnepub 掌上书苑: can’t see text online, can only download formatted books after registering. Has some manga, so may be worth registering to get the formatted book
HiFiWiKi: Not much selection, but books are preformatted to many e-reader formats. LOL, includes full Harry Potter books for download
Yeeyan 译言网: (honorable mention) Doesn’t contain novels, but user translations of English texts; useful for parallel text studies or checking comprehension

Making an E-Book

Starting from a website with individual chapters online, there are a few steps I take to turn them into a formatted e-book. The general process is: 1) access every chapter and extract the text; 2) concatenate the texts into a single file; 3) mark up the chapter titles which will become table of contents entries; and 4) use Calibre to format the text file as Mobipocket (for Kindle) or ePUB (for other e-readers). For smaller books with few chapters, the first three steps can be done simply by clicking on each chapter, highlighting the text, and pasting it into a running text file. For larger books with many chapters, automating the process becomes increasingly useful.

Downloading the Chapters

Many of the sites listed above have their content split into chapters as separate links from the main item page. Some have even subdivided these chapter pages further into separate web pages for each of the corresponding pages of the physical book. Whatever the format, each of these content pages will be surrounded by typical web page banners, footers, navigation, and advertisements. The actual content needs to be extracted from the web pages and combined into the final text document before further processing. For books with few chapters, it is simple enough to just click on each link, highlight the text to copy, and paste it into the target document. But the latest books I converted all had over 50 chapters per book. At this point, automation not only makes the process faster, but eliminates the danger of error, for example, from accidentally missing a chapter link or incompletely highlighting the correct text.

For batch downloading of files, there are two tools I find particularly useful. The first is a Firefox add-on called DownThemAll!. This extension presents all the links and images contained in a web page in a convenient list, where you can check off any or all of the items to download at once. Of particular usefulness is the Fast Filtering text box, which can enable all the relevant chapters at once with a simple matching pattern. Instead, or in combination with the match pattern, you can select files individually by clicking on them. DownThemAll will handle the batch downloading of all the selected files, similar to many download accelerators.

DownThemAll screenshot - item selection — DownThemAll - item selection

The other tool is the command line program wget. This utility is usually included by default in many Unix-like systems, but can be downloaded for many other operating systems. The main power in wget is in its multitude of parameters which can do anything from downloading a quick file to spidering a whole website. For the purpose of grabbing chapters from a book site, I will often create a file of all the URLs containing texts, and then type wget --no-directories -i files.txt to download them all to the current directory. This method can be useful when the chapter URLs are the same except for an incrementing number; the list can be quickly generated with some help from Excel’s Fill Series command.

Extracting the Text

At this point, I should have a large number of HTML files on my hard drive. Now, I can use scripts to go through each file in turn and extract the important text from the surrounding web page data. Each site will lay out the pages differently. For a particular site, the page content is undoubtedly template-driven, so every page will have the same markers signaling the start and end of the real content. For example, the tags <div id="content"> may signal the start of the text, the next <h2> tag may be the chapter title, and the next </div> the end of the text. If I can identify the chapter title, I use the opportunity to insert a row of dashes under it, for reasons that will be clear later.

You can see some example Perl scripts to do this parsing in this directory. The three scripts in this directory are each tuned to the particular format of one particular book from tianybook or readnovel, so it’s possible they won’t work well even with content from the same site. However, they could be useful for illustrating certain issues such as dealing with GB to Unicode conversion and parsing of text.

Making the Book

Calibre is one of those amazing do-it-all programs that is practically only limited by what you haven’t yet discovered it can do. At minimum, it can manage, catalog, and tag e-book material, and sync it between computer and multiple e-readers. But it can also do file conversion between many different e-book formats: text, PDF, MOBI, ePub, RTF, and many others specific to particular devices.

Importing the text file is as simple as dragging the file onto the Calibre icon or into the window of the open program. Some prerequisites are necessary to get the whole process to work well. I have only gotten the import to work when a) the file extension ends in .txt; and b) the file is encoded in UTF-8. It is also useful to unwrap the text so that each paragraph takes up a single line. Calibre does have a conversion option to undo hard line breaks, but I haven’t needed it so far. One trick I just discovered (via the source code) is that if the first line of the text is the title and, after two blank lines, the fourth line is the author, Calibre will detect these and automatically set the title and author of the imported book. Alternatively, if you name the file in the format “title – author.txt”, Calibre will also set the title and author based on this.

After dragging the book into Calibre, I will then edit the resulting entry to add more information using the “edit metadata” option. This is where I can edit the title to the actual Chinese name, add the author, and include other notes such as the source URL, publication date, and summary from its original source page. There is also an option to add an image as a book cover, but as my Kindle doesn’t show thumbnails in its book listing, I don’t usually bother with it.

Calibre-metadata — Calibre - editing metadata

The conversion itself is simply to highlight the entry and choose “Convert book(s)”. The input format will be text, and you can choose the output format from a number of options. The default settings for text import are generally fine, but there is one important one to change. In the conversion dialog box, under “TXT Input” there is an option called “process using markdown”. Markdown is a simple formatting syntax for text files, indicating where headings, bold and italics, lists, and other markup should go when converting to HTML or another format. This is where adding the dashes under the chapter titles becomes useful. An underline in Markdown syntax indicates a level 2 heading, which Calibre recognizes as a chapter title. Unless told not to, Calibre will not only format these titles as bold and underlined, but create a table of contents out of these and insert it into the beginning of the book.

The Result

Kindle - reading page — Kindle - reading a page from 呐喊

This is a picture of an e-book loaded onto a Kindle of Lu Xun’s story collection 《呐喊》,created by following the above process. Note the chapter markers at the bottom of the window, indicating my current location in the book. I downloaded the individual chapters from here, and subsequently used this script to create the combined text file. I’ve made the text file, as well as the .mobi and .epub files available for reference.

Resources

Nahan – Lu Xun.txt
Nahan – Lu Xun.mobi
Nahan – Lu Xun.epub
ebook_creator.zip

This Post Has 2 Comments

Carmel James January 15, 2011

This is a fabulous idea. Thanks for taking the time to share. I will definitely try it … and get myself a kindle!!
Chad May 26, 2011

The Markdown trick of using a row of dashes to indicate a chapter heading doesn’t seem to be working in the current version of Calibre (0.8.2). However, the alternate Markdown format of “## Chapter title ##” (two ‘#’ symbols and a space before the title) works instead.

Comments are closed.

Online Reading Sites

Making an E-Book

Downloading the Chapters

Extracting the Text

Making the Book

The Result

Resources

You Might Also Like

The Lancaster Corpus of Mandarin Chinese as an SQL Database

Recording Streaming Radio with VLC

An application to estimate known Chinese words

This Post Has 2 Comments