michielovertoom.com

Project Gutenberg for my eBook • 12 Jul 2009

Project Gutenberg has a nice collection of free books, but downloading them manually is tedious. I made a small program to download all the Dutch books, and do some processing on them to make them more usable with my eBook reader.

Below is a picture of my eBook reader: it's a Hanlin V3 rebranded 'BEBOOK' by Endless Ideas. I chose this model because it is more open than other brands, and not expensive. It can run OpenInkpot (a linux distro for ebook readers) so it is possible to tinker with it. Maybe I can someday even run Python on it ;-)

BeBook reader

This is how text looks like on the reader. It's perfectly readable even in bright sunlight because of the E Ink technology. I prefer plain TXT files because they render fast and give the most readable output, and also honour my custom font preference:

BeBook reader

Navigation through the directory structure is pretty simple. The reader shows eight subdirectories per screen, and you can select one by pressing the corresponding button. The directory structure is created by my program in such a way that it optimally distributes the ebooks:

BeBook reader

And this is how the book titles look. Titles from the Gutenberg etexts are used to name the file appropriately. Also, the internal Gutenberg number is shown behind the title:

BeBook reader

The source code is on GitHub. This is from the README:

This is a set of python scripts which downloads all Dutch ebooks from Project Gutenberg, renames them to human-readabele filenames, formats them so they display well on my ebook reader, and tosses them into subdirectories for easier navigation. How to use: - Run bulkdownload.py to download the raw texts from Project Gutenberg's website. - Run gutenberg.py to reformat and rename the raw texts. - Run toss.py to distribute them over subdirectories. After that, upload them to your eBook reader, and enjoy!

How it works

Downloading

Each book has its own unique Gutenberg number. The downloadscript bulkdownload.py requests the Dutch index page from the Gutenberg website, then uses a simple regexp to pull the book ids out, which consist of five-digit numbers. For this simple case I don't use a full HTML parser like BeautifulSoup or lxml.html (I will in the next version, I promise, since parsing HTML using regexps is a cardinal sin ;-).

Book ids appear like this in the index: "/etext/17077"

hrefpat = re.compile("href=\"\/etext\/([0-9]{5})\"") ids = set() f = urllib2.urlopen("http://www.gutenberg.org/browse/languages/nl") for line in f: m = hrefpat.search(line) if m: ids.add(m.group(1)) f.close()

After the ids are collected, the books themselves are fetched. Books which have been downloaded earlier are skipped to save time and bandwidth. The 5-digit Dutch books all have URLs like http://www.gutenberg.org/files/99999/99999-8.txt

for id in ids: ofn = "%s-8.txt" % id if os.path.isfile(ofn): print "Already exists:", ofn continue url = "http://www.gutenberg.org/files/%s/%s-8.txt" % (id,id) print url try: f = urllib2.urlopen(url) except urllib2.HTTPError: print "Warning: Can't fetch:", url continue of = open("%s-8.txt" % id,"wb") of.write(f.read()) of.close() f.close()

Tidying up

After downloading, the script gutenberg.py will process the raw texts and produce properly named and formatted output files. It will also remove redundant boilerplate text; When opening a book, I don't want to flip multiple pages before I arrive at the reading material.

The input files are determined, and each one is 'beautified':

sourcepattern = re.compile("^[0-9]{4,5}\-[0-9]\.txt$") for fn in os.listdir("."): if sourcepattern.match(fn): beautify(fn)

The beautify() function processes the input file line by line:

def beautify(fn): ''' Reads a raw Project Gutenberg etext, reformat paragraphs, and removes fluff. Determines the title of the book and uses it as a filename to write the resulting output text. ''' lines = [line.strip() for line in open(fn)]

Basically it's a big state machine which scans through all lines from start to end, and collects only the middle part between the START and END markers. It also grabs the title and subtitle of the text.

collect = False lookforsubtitle = False outlines = [] startseen = endseen = False title="" for line in lines: if line.startswith("Title: "): title = line[7:] lookforsubtitle = True continue if lookforsubtitle: if not line.strip(): lookforsubtitle = False else: subtitle = line.strip() subtitle = subtitle.strip(".") title += ", " + subtitle if ("*** START" in line) or ("***START" in line): collect = startseen = True paragraph = "" continue if ("*** END" in line) or ("***END" in line): endseen = True break if not collect: continue

Lines that are to appear in the output file are collected and concatenated into paragraphs. Each paragraph is output as one line, which the eBook reader will wrap in a readable way. An empty line is seen as the start of a new paragraph. Without the concatenating the result would look horribly fragmented on the eBook reader.

if not line: paragraph = paragraph.strip() for term in remove: if paragraph.startswith(term): paragraph = "" if paragraph: outlines.append(paragraph) outlines.append("") paragraph = "" else: paragraph += " " + line

When the text has been read and processed, it's time to write the output. The output file is made recognizable by using the title of the book as the file name. The filename can't be too long, and characters that are not allowed in filenames are replaced with alternatives.

ofn = title[:150] + ", " + fn ofn = ofn.replace("&", "en") ofn = ofn.replace("/", "-") ofn = ofn.replace("\"", "'") ofn = ofn.replace(":", ";") ofn = ofn.replace(",,", ",") f = open(ofn, "wt") f.write("\n".join(outlines)) f.close()

Organizing

All 400 texts in one directory would make navigation on the eBook reader cumbersome. I decided to evenly distribute the texts over a set of subdirectories. The subdirectories are named with the start- and ending letter of the titles they contain. The script toss.py first determines which files need to be tossed:

fns = [] skippattern = re.compile("^[0-9]{4,5}\-[0-9]\.txt$") for fn in glob.glob("*.txt"): if skippattern.match(fn): continue fns.append(fn)

...then it tallies the files by starting letter:

startlettercount = [0] * 26 startletter = fn[0].lower() startletter = min(max(startletter, 'a'), 'z') bin = ord(startletter) - ord('a') startlettercount[bin] += 1

Next, the files are distributed over the destination directories.

subdircount = 8 # Of each subdirectory, the letters that will be tossed into. subdirletters = [[] for x in range(subdircount)] approxfilespersubdir = len(fns) / subdircount currentsubdir = 0 filesinsubdir = 0 for i in range(26): startletter = chr(ord('a') + i) subdirletters[currentsubdir].append(startletter) filesinsubdir += startlettercount[i] if filesinsubdir > approxfilespersubdir: currentsubdir += 1 currentsubdir = min(currentsubdir, subdircount) filesinsubdir = 0

Finally, the subdirectories are created, and the files are moved into them.

for s in subdirletters: if not s: continue if len(s) == 1: subdirname = s[0] else: subdirname = s[0] + "-" + s[-1] if not os.path.isdir(subdirname): os.mkdir(subdirname) for letter in s: print "Tossing files starting with '%s' to '%s'" % (letter, subdirname) for fn in fns: startletter = fn[0].lower() startletter = min(max(startletter, 'a'), 'z') if startletter == letter: os.rename(fn, os.path.join(subdirname, fn))

That's it! All that remains is to upload the material to the eBook reader. The computer sees the eBook reader as an external USB disk when you plug it in, so that can be done with a simple copy or move.

Update: Matthew L. Jockers has modified these scripts to output TEI-XML compatible format. You can find his version here.

Comments

Leave a comment

name (required)

Fork me on GitHub

content last edited on March 2, 2011, 17:59 - rendered in 3.22 msec