An idea had been mooted on the Internet that there should be a copy of the Anna Raccoon site for posterity.
With a background in programming, WordPress and hosting, I offered my services if required as I had always been a great admirer of Anna and her writing. With the death of Anna, the idea became a more urgent reality. My offer of help was accepted. I didn’t realise quite what was ahead of me!
The first port of call was to the hosting company that had hosted annaraccoon.com in the hopes that there would be an archived backup of the site’s database. A copy of that would have made restoration the simplest of jobs. Unfortunately they had deleted all records of the site including all backups. The only remnant was the domain name.
Nevertheless, I took over the domain and set up a server to receive any efforts at a new site.
In the meantime offers of help had come in, along with news of copies of the site. I received those copies and found myself with a series of archives all in slightly different formats. I was now the proud possessor of several Gigabytes of highly compressed files.
The files were essentially in two formats – MHTML (which is a special format, can only be read with difficulty and is not at all suitable for a full website) and files which had been downloaded using HTTrack and WGET. At least the latter two were readable in any browser.
I found a way of extracting the information within the MHTML files where each file was extracted into its own folder, each of which contained the original HTML, Style sheets and images. Each folder contained between 16 and 133 individual files.
So I now had thousands of individual files, each of which had to be converted from a static file to an entry in a database.
I wrote some programmes with the following objectives –
The programmes had to cycle through all the folders, extracting the one file that held the original post. That file had to be parsed, extracting the Title, Date of posting, Author, the Post itself along with comments, but eliminating all code to do with parsing the page and surplus code (such as Facebook/Twitter/etc buttons). It also had to generate a “name” for the post as well as the Title – for example the title “Sample Post” will have a name “sample-post”. naturally there had to be a check for duplicates. In the meantime, another programme had to move any image files up to a central collection point and rename all the links to the image.
All the extracted information was then inserted into the database.
These programmes had to be modified for each of the archives I had received due to their different layouts.
To give an idea of the reasons behind all this work we can compare two possibilities.
I uploaded one of the archives in its raw state onto the server. It took around 5 Gigabytes of space and was almost impossible to navigate. There were no search facilities either.
I replaced it with the new version which had an 88 Megabyte database, was fully searchable and could be indexed in different ways. So it is searchable, nearly 5Gb smaller and much much faster.
Enjoy!
Richard [Curratech]
a.k.a.
Grandad [Head Rambles]