(Permalink) Posted: Dec. 04 2008,14:02   

What I would do is:

1. Download only the html - images and such aren't required.

2. Track only the 'n' most recent threads, to save bandwidth and time.

3. Feed the html for each post and comment thread into a subversion repository, which will track changes. Possibly after being fed through a script to strip out everything except the post and comments (Ads, sidebar stuff, etc.), and detect 404s.

What you could also do is store posts and comments in a database by comment id, so that when a comment goes missing it can be pulled from the database with a simple query. This wouldn't track changes without some more work, but it would save deleted comments and posts.

