|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
libwww and avoiding download of binary/unknown filesHi. I'm writing my first app based on libwww, it aims to do something similar to webbot but I'm facing a problem that I can't solve because of my limited knowledge of the libwww architecture. When a web site is scanned recursively using anchors and requests all the files are downloaded including binary files. For these save file name is prompted to the user (my app and webbot behave in the same manner), but I don't want binary files to be downloaded at all. If I define the following callback user is not prompted anymore but file is transferred from network to the black hole thus generating unuseful traffic: HTMIME_setSaveStream(HTBlackHoleConverter); So my question is, can I detect the content type of a file (presumably letting libwww read just a part of it) and then decide not to download it?How? Thanks! Silvan -- mambaSoft di Silvan Calarco - http://www.mambasoft.it |
|
|
|
|
|
Re: libwww and avoiding download of binary/unknown filesAlle 01:17, venerdì 8 settembre 2006, Adam Mlodzinski ha scritto: > How do you accomplish the recursive scanning? Is this a feature of > libwww, or have you written your own code to do this? I have defined a link callback function which gets any link from the top page I request. In this function I perform a request for any internal link found, then I wait for the event loop to end and I've got all the pages. > > all the files are downloaded including binary files. > > This is always tricky. What is a binary file? Is an image file binary. > Probaly, if it's a GIF or PNG, but what about an SVG file? Okay, easy > enough. But what about a PDF file, or files with no extension at all? > Everyone has their own ideas of what makes a binary file binary, and not > text/ASCII. By default libwww prompts for saving all the files it doesn't recognize or considers binary, that's enough for me now, libwww does it, but maybe I need to know better how it does it... I just want to get html pages and avoid downloading any other file. > You have two options: use a HEAD request for each file during the > recursive scan (although I don't think all servers support HEAD requests > properly) instead of a GET - then decided whether you want the file > based on its MIME type (probably set up a filter to do that); OR, decide > if you want the file based solely on the file name and/or extension > (essentially what MIME does, only instead of asking the server, you > decide for yourself). > > Keep in mind that file extensions don't always give away the file > contents - it's a nice convention used 99.9% of the time, but there's > nothing preventing anyone from naming a file, ASCII or binary, with any > extension they feel like. I know of (vaguely) a Perl script that can > tell you if a file is ASCII or binary by reading the first few bytes of > the file - but that requires the file to be present, an option you don't > have in your case. I suppose the HEAD request will read only the beginning of a non html file and return that the header is not recognized or is recognized with a MIME type. If I can do that it's enough. I'll try to do what you suggest and let you know. Thanks. Bye, Silvan -- mambaSoft di Silvan Calarco - http://www.mambasoft.it |
|
|
|
| Free embeddable forum powered by Nabble | Forum Help |