Hi forum!
I have a question that is jumping on my mind.
How can I parse a web link, in order to find specifical files to download. Example Pdf files.
So if the Link has 40 files, then download.
I’m thinking to parse the source code of the link and so on.
Any idea?
Thanks so much
Assuming you mean parsing a html document for links, this gets quite complicated. You will find regex examples in many places, most of which don’t really work, and they don’t seem to handle many things ( not all applicable to .pdf, clearly ).
If you are trying to get further html links you bump up against things like “document.write(”<FRAME SRC=");". You have to deal with “base href=“something” and “<a href=”#” onClick=“MM_openBrWindow(something” and "window.location.href = " and an endless list of clever java. Forward slashes, backward slashes, “/./” and suchlike. Then absolute and relative file links - as we used to say in the army doing route marches “where the f are we?”.
Over a decade ago I wrote a very messy program to do this, and it sort of works in an ugly way, but unless the pages you are trying to analyse have fairly straightforward links, it gets a bit hairy.