Hello,
I am using a Python script (that I found somewhere) to recover all the cached site's content.
The problem is, would it be usefull? I mean, I am getting them as "1.html" "2.html"...
Therefore, is there an automatic web builder? in order to link every link with to each file? (Or upload them)
In other words, if I get the entire site in html files, would it be useful? (There are a lot of files)
I dont know a lot of web pages sorry :/
Ok, I think I find a way to make it work.
1. Gather all the forum and wiki pages.
2. Parse each file, in order to erase the google cache header
3. Parse each file to rename the file to the linking file, example:
(Of course, parsing it with a program)
Forum:
This forum thread has the following url: http://www.netgore.com/forums/post/recovering-netgores-forum-and-wiki.html
And the link in the categories is: href="/forums/post/recovering-netgores-forum-and-wiki.html"
We can see it uses the "same" topic name, to the file's name, so it can be extracted from the html gather file, and rename itself (with the parser)
Thoughts?
Don't have much to add on how to do that. Did find a list of other cache archivers (with mixed sucess), if it helps.
So far I only found a couple archives on one of the sites, but each page has to be inputted manually (won't display all of the pages, just the one you type in expicitly), so you have to have the exact URL of the page you want to view. Time consuming, especially for someone new like me that didn't know the site that well.
Not sure how much time I'll have to work on this in the next few weeks with my work schedual, but if there's anything I can do I'm enthusiastic about having netgore continue, now that I've gotten to poke through the code a bit. Hopefully will get the full thing up and running later today (have had problems of my own getting the dependancies running)
Thanks Trollhammer,
Unfortunatly, the script fails after gathering about 60 pages, since google sees it "suspicius" and blocks the connection.
I am unsure how can I gather all the pages right now, if anyone has any idea to avoid the google's block it would be appreciated (or another fully archive to gather with a bot)
Do you think they would want to charge for a service or otherwise have issues if they were contacted directly, explaining the issue? They might even have the ability to mirror the data somehow so it can just be dropped back onto netgore.com. I can't seem to get to the google cache option to see what it looks like, it's blocked out for me, or I'm doing something wrong (more likely).
If you drop the speed at which it requests pages, it might be able to pull more pages before dying. Not sure though
I thought all the forum posts were saved as a record in the database rather than an HTML page.
They are. If someone were to grab all forum posts, it would be difficult to restore them, especially since not all the accounts exist anymore. Possible, but probably not worth the effort.
Oh, I was wondering why Torraske was trying to restore forum posts since its rather difficult.
I didnt know it was so hard :/
I thought with the info (lost pages), I could make a program to parse all the gathered info into a suitable data.
Nvm then
Well if you get the data, then I can parse out each forum post, then somehow stick it in the db with the correct timestamp. Could make for a fun little project. So if you (or anyone else) gets all that data, I'll give it a shot. I just don't know how easy it will be to add the data to the db since I'm not sure how picky Drupal's schema is.
Good luck trying to sort out every post and its timestamp and format without BBCode tags. :/
Could be worth it. Might as well give it a shot.