[H-GEN] Squid addon

Thu Apr 29 08:10:49 EDT 1999

(Note reply-to: being general at humbug.org.au vs Steve Thorne <sjthorne at ozemail.com.au>)

On Thu, 29 Apr 1999, Craig Eldershaw wrote:

> (Note reply-to: being general at humbug.org.au vs Craig Eldershaw <ce at comlab.ox.ac.uk>)
> 
> >> of course you'd have to base it on both the name and file size, and make a
> >> lower limit, there is a remote chance that index.html might be 3,212bytes
> >> at server X+Y
> >
> >Again, as mentioned above, this approach would be unsound, since there may
> >well be a fair probability of two completely different files on the same
> >server having the exact same size and filename. 
> 
> Agreed, think of the odds:  take the number of web servers in the world
> then multiply by number of directories to give an estimate (order of
> magnitude only) if files called index.html.  That's an awful lot.
> 
> >If such a Squid addon were to be written, I'd imagine it'd cache the file,
> >and generate a unique checksum for comparisons, therefore making the
> >possibility of "mistakes" much lower.  However generating checksums
> >on-the-fly would have to be kept relatively quick, so as to not not bog
> >down the caching software overly -- I doubt this'd be a problem since
> >squid is probably I/O bound anyhow. 
> 
> In my limited knowledge (any number theorists on the list ?), using
> name, size and an MD5-like checksum would be almost certain to be
> safe.  The problem is, how does squid's sister get the checksum of
> file.x at Y without actually downloading it ?  Of course if the HTTP
> protocol was modified, and all servers upgraded to provide such
> checksums on demand, then squid's sister could simply ask for the
> checksum (in the way that one can request the last modified by date).
> But that's a very long-term kind of thing.
> 
> 
I think my point has been missed - the is an extremely high chance that
index.html might be the same size on two different servers, as with many
small(ish) index or image files with common names.

Considering the goal of the squid addon is to decrease unnecesery
downloading of large files, it owuld be logical to make it work something
along the lines of -

do not download file.x from server Y IF
file.x previously downloaded from server X is
exactly the same name
exactly the same size
larger than 500kb

whats the cahnce that there will be two tarballs of exactly the same name
and size, that wouldn't be the same file?

Steve.

--
This is list (humbug) general handled by majordomo at lists.humbug.org.au .
Postings only from subscribed addresses of lists general or general-post.