I would like to know what is the best software solution (free or
paying doesn't matter) that will allow me to entirely download the
contents of a large website
The content I am interested in are only the documents that are hosted
on the pages (word, excel, powerpoint, acrobat, zip, arj...) Basically
all documents that are downloadable by clicking on a link or on a
"click here to download" button. I am not interested in storing the
pages.
The content must be downloaded and stored in a structured manner (not
all just dumped in one "incoming" folder), a possibility could be to
store the content in a structure that resembles that of the pages the
content was downloaded from (i.e. all documents from the root page
would be stored in a root directory, all documents from the "about us"
page would be stored in the "about us" directory, and so on). Another
option would be to create a searchable database that is linked to the
documents (like google desktop does)
The website I am targeting is a secured site to witch I have fully
authorised access so the program must be able to perform secure login.
The website I am targeting is a dynamic site (aspx) that is very large
and that branches off to other sites I also would like to selectively
download (same principle, only downloadable documents).
I must be able to download only parts of the site at a time and resume
the downloading later on, the program will keep track of the pages it
has already downloaded and resume from where it left off.
I am basically looking for a software that will crawl and backup a
website in a way that is very similar to what google does, but on a
smaller scale.
For the moment I have searched download.com and found some "offline
browsing" utilities that are not exactly what I need.
I looked at the GA thread you suggested, I basically have the same
needs except for the fact that for me it's going to be a one time deal
(download the site before it goes offline). My only extra needs are
authentication, but this feature is included in most solutions anyway.
To repond to your comments, I am not looking at creating my own
solution, so programming in Perl is out of scope.
wget and wgetgui are good solutions technically but I they lack in
ease of use, and since the person that will actually have to do the
job is not very computer literate, I am looking mostly at packaged
solutions.
I wasnt able to try out websnake because there is no downloadable trial version
I am currently trying out a software called Blackwidow that seems to
be a good compromise for what I need, my only doubt about BW is if it
will be able to handle the size of the site.
Although you may not want to store the pages, the pages must at least
be downloaded so that the pages may be crawled for the documents you
are interested in.
I would suggest using wget (with SSL support) and wgetGUI as a
frontend. wget can do the job but it is very hard to use all the
options from the command line, which is why you will need a front end.
It will preserve the directory structure. However, this will leave you
with the pages being stored. You can always delete the pages later.
You can choose which files to accept downloading or reject downloading
(except for the HTML pages themselves).
wget (Windows)
http://xoomer.virgilio.it/hherold/
wgetGUI
http://www.jensroesner.de/wgetgui/
If you really don't want the pages stored at least temporarily, then I
would suggest crawling using Xenu
(http://home.snafu.de/tilman/xenulink.html) to crawl and find out
which files to downloads. Then, exporting the list of URLs as a tab
deliminated file (with only the file types you are interested in) and
using Mass Downloader or wget to download the files you are interested
in.
Also you may want to read the following Google Answers thread and
check out the product mentioned:
http://answers.google.com/answers/threadview?id=32044
I just got back from a couple days of vacations, I am going two look
at your suggestions this evening and get back to you within the next
couple of days, for the meantime thank you very much for your
comments.
I should note that all of section 20 in the perl cookbook is about web automation.
Hmm, i don't know if there is a generalized solutions available, but I
have worked on some web scrapers at work. It's not too difficult to
write in Perl, given the very nice HTML package. Check out sections
20.18 - 20.21 in the Perl Cookbook by Christiansen & Torkington for
some very relevent examples of website scraping. If you are not a
programmer, probably a perl consultant could do this for you in a day
or two. I'll take a look around freshmeat and the like to see if
something generic and free is out there.
Have you considered WebSnake?
http://www.websnake.com/
Finally I used blackwidow, but the information I found here was very
very usefull so I consider the question to be answered. Thank you very
much.
sorry for multiple posts, but what you want to search on google for
relevent results is "web scraper". I took a look and there are many
commercial ones available. Freshmeat has some also,
www.freshmeat.net.
I've tried to match your needs with the 'function' listed on this
site...there is a free 30 day trial..might be worth looking at.
http://www.metaproducts.com/mp/mpProducts_Detail.asp?id=3
Mass Downloader is more like a download manager, you have to tell it
what do download then it takes care of queuing everything up managind
the downloand sequence, Mass Downloader doesn't really include
crawling capabilities.
The idea is really to be able to feed the program my URL, the username
and password and watch it crawl through the site and download all the
documents.
Walking/Running Puppy - Advise needed
GridPanel border
|