ashley wrote:slothrop wrote:Trying to figure out a non sucky way of making a scalable web-crawler that works with grim dynamically generated sites. Which basically means flipping between two designs (neither of which would work) for five minutes and then going back to the internet.
Eh?
wget has a spider tool, you can use this to scrape links on websites and then pump those into another process that indexes the content in a database for example? Store the URL's in a database table with a unique index on the URL itself, filtering out any duplicates?
Nah, it has to deal in a reasonably sensible way with sites that generate content dynamically, have silly numbers of internal links, and might (for instance) keep giving you the same (or similar) content at an arbitrary number of different urls but also needs to start losing stuff that no longer exists again in a reasonably timely fashion, preferably without having to completely re-crawl the site. AIUI, google analytics believes that one of the sites we're talking about has about six billion individual links...
Edit - I mean, the core of using wget (or python urllib) and a parser (beautiful soup) is fine, but getting something that performs well while handling sites with phenomenal amounts of duplication and a fairly high turnover is what's interesting. Think it's sorted, now, though.