How many of you are working right now?

ashley · Post by **ashley** » Tue Apr 19, 2011 2:13 pm

slothrop wrote:Trying to figure out a non sucky way of making a scalable web-crawler that works with grim dynamically generated sites. Which basically means flipping between two designs (neither of which would work) for five minutes and then going back to the internet.

Eh?

wget has a spider tool, you can use this to scrape links on websites and then pump those into another process that indexes the content in a database for example? Store the URL's in a database table with a unique index on the URL itself, filtering out any duplicates?

slothrop · Post by **slothrop** » Tue Apr 19, 2011 2:27 pm

ashley wrote:
slothrop wrote:Trying to figure out a non sucky way of making a scalable web-crawler that works with grim dynamically generated sites. Which basically means flipping between two designs (neither of which would work) for five minutes and then going back to the internet.
Eh?

wget has a spider tool, you can use this to scrape links on websites and then pump those into another process that indexes the content in a database for example? Store the URL's in a database table with a unique index on the URL itself, filtering out any duplicates?

Nah, it has to deal in a reasonably sensible way with sites that generate content dynamically, have silly numbers of internal links, and might (for instance) keep giving you the same (or similar) content at an arbitrary number of different urls but also needs to start losing stuff that no longer exists again in a reasonably timely fashion, preferably without having to completely re-crawl the site. AIUI, google analytics believes that one of the sites we're talking about has about six billion individual links...

Edit - I mean, the core of using wget (or python urllib) and a parser (beautiful soup) is fine, but getting something that performs well while handling sites with phenomenal amounts of duplication and a fairly high turnover is what's interesting. Think it's sorted, now, though.

Zöo Pop · Post by **Zöo Pop** » Tue Apr 19, 2011 3:42 pm

wub wrote:
ashley wrote: windows+d to minimise everything, or windows+L for emergency lock the computer

Well I never knew that

[Still finding Alt + Tab is a more natural position for my hand to rest at though]

Good know know. But Alt + Tab seems more natural for me too.

Just got a job at office, I use to work at a garbage dump so huge change.

ashley · Post by **ashley** » Tue Apr 19, 2011 4:00 pm

slothrop wrote:
ashley wrote:
slothrop wrote:Trying to figure out a non sucky way of making a scalable web-crawler that works with grim dynamically generated sites. Which basically means flipping between two designs (neither of which would work) for five minutes and then going back to the internet.
Eh?

wget has a spider tool, you can use this to scrape links on websites and then pump those into another process that indexes the content in a database for example? Store the URL's in a database table with a unique index on the URL itself, filtering out any duplicates?
Nah, it has to deal in a reasonably sensible way with sites that generate content dynamically, have silly numbers of internal links, and might (for instance) keep giving you the same (or similar) content at an arbitrary number of different urls but also needs to start losing stuff that no longer exists again in a reasonably timely fashion, preferably without having to completely re-crawl the site. AIUI, google analytics believes that one of the sites we're talking about has about six billion individual links...

Edit - I mean, the core of using wget (or python urllib) and a parser (beautiful soup) is fine, but getting something that performs well while handling sites with phenomenal amounts of duplication and a fairly high turnover is what's interesting. Think it's sorted, now, though.

Everyone knows the solution is more hardware to support inefficient code

Especially as now Google are opening up cores to boffins...

http://www.theregister.co.uk/2011/04/15 ... _donation/

Shum · Post by **Shum** » Tue Apr 19, 2011 9:03 pm

So tempted to pick a cave, but no I usually work at home even though I'm not home at the moment.

pkay · Post by **pkay** » Tue Apr 19, 2011 9:24 pm

At work... mad surfin

jameshk · Post by **jameshk** » Wed Apr 20, 2011 12:37 am

Brett get me a job working with you

Molzie · Post by **Molzie** » Wed Apr 20, 2011 1:14 am

I work from home and often browse on the shitter so...

<HOME/WORK/TOILET>

murky21 · Post by **murky21** » Thu Apr 21, 2011 3:17 pm

completely given up on everything at work now, it's too sunny and too close to 4 days off...

Boredom has reached next levels as in the last hour I have done things such as look at all of the jokes comments on Skrillex's FB fan page, started a skream photoshop job and ran a google image search on the word 'tits'

ashley · Post by **ashley** » Thu Apr 21, 2011 3:34 pm

When I get bored at work I go and twitter on the toilet

dubmatters · Post by **dubmatters** » Thu Apr 21, 2011 4:15 pm

In my second year at uni, but work full time in a pub. I can't actually belive I would prefer to work in an office again.

clifford_- · Post by **clifford_-** » Thu Apr 21, 2011 4:21 pm

id rather spend the day in an office, nice and clean, infront of a computer, than spend all day grafting and getting horribly filthy on a building site!

*the grass is always greener...

Atac · Post by **Atac** » Fri Apr 22, 2011 8:20 am

I work at 3 different self-serve frozen yogurt shops.
This is LA and ice cream isn't cool anymore.

But basically since it's self serve I just hang out on my laptop until the customer's ready.
Pretty simple job considering I'm still in school. Gets boring as fuck after a while though. I sneak in my MPK Mini some days to work on tunes. Its scary though because I think my boss would shit himself if he saw that.

*EDIT*
I work alone and my tnuc boss pops in and out whenever he feels like ruining my day.

murky21 · Post by **murky21** » Fri Dec 23, 2011 10:57 am

This one's out to all the fam inside the ride at work right now, and anyone who isnt going to do a single work related process, not even a single email

wub · Post by **wub** » Fri Dec 23, 2011 10:59 am

I am at the office, however working is a loose term. TBH the only reason I came in today is that I'm going to my parents this evening and it's on the way. I'm letting my team go at half 12 if it doesn't pick up.

nousd · Post by **nousd** » Fri Dec 23, 2011 11:02 am

I'd be working if I had a job.

gwa · Post by **gwa** » Fri Dec 23, 2011 11:03 am

3 finish IF the payments team get shit done in time

Riddles · Post by **Riddles** » Fri Dec 23, 2011 11:03 am

im at work, but no ones doing much really, office has got a half day so im here for another hour and then home

autobot · Post by **autobot** » Fri Dec 23, 2011 11:14 am

Eating cake, Internal meeting with a bottle of wine soon

murky21 · Post by **murky21** » Fri Dec 23, 2011 11:14 am

autobot wrote:Internal meating with a bottle of wine soon

Fixed

How many of you are working right now?

Are you...

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Re: How many of you are working right now?

Who is online