|Wesley R. Elsberry
Joined: May 2002
Given the propensity for re-writing the record on various antievolution weblogs, I'm looking at means of archiving the moment-to-moment versions of posts and comments on weblogs.
One intriguing approach is the feedWordPress plugin for WordPress. This plugin allows one to point to an RSS feed and it will add a post to a WordPress weblog for each item in the feed. I've already set up a WordPress blog and the plugin to try this out. It isn't a perfect solution for the particular application yet, though, as I commented on that site:
I have a slightly different interest in the project. What I want to accomplish is to completely mirror current posts and commentary at a particular weblog, without updating past items if they are modified. Having new entries made would be OK when changes occur. So far, FeedWordPress is about as close a solution as I have seen, but the big things standing between what it is and what I want are: 1) incomplete posts/comments (it appears to store what comes across in the RSS feed and doesn’t get the more extensive original text) and 2) updating of items when changes are found in the feed (I want a log of what each version was).
If I assume that I get to make modifications to get what I want, then there are some things to do. Turning off updates will be specific to the FeedWordPress plugin. Beyond that, there is getting the full page rather than jsut the RSS description.
I'm thinking that perhaps the way to handle this is to produce a mirror of each version of a post. This may require a fair amount of programming, and possibly stepping outside of PHP. Breaking this down, there are several tasks: reading and parsing the original page, downloading each element of the page, relativizing links to elements in the page, creating an MHT archive of the collected elements of the page, and linking to the locally-stored copy of the MHT archive.
There is a PHP solution to making an MHT archive once the pieces are collected: MHT FileMaker.
A big step of the remaining task would be to match up a mirroring tool with this code. That would imply a tool that would do the retrieval and URL fixup, while putting all the elements into one directory. The PHP MHTFileMaker class could then be called to generate the MHT file from the files in the directory. It would be best if the main html file could be renamed as "index.html" by the mirroring software, so that the first file does not have to be tracked on a case by case basis.
So, basically, I wanted to put these problems out in front of people to see if anyone else hs suggestions or is moved to actually put in some programming work on this project.
"You can't teach an old dogma new tricks." - Dorothy Parker