I work for a sailing school that owns a database with questions.
The questions are obtained from the official examination test centers as pdf documents and the passed onto our database.
There are different examination centers across the country and normally every time a new test takes place on their city, the exam is uploaded by each organization to their own website and made public.
In order for me to be aware when a new pdf file has been uploaded to one of those websites I was wondering if there is a way to monitor them like checking for link updates ( assuming the links will always be on the same place of the website )or comparing a website structure using timestamps to trigger an alarm like when those timestamps donât match ( l assuming there has been an update with that comparison )
I donât know if I made myself understood or if it is doable at all.
There are many ways to monitor websites for content changes, and you will
probably need to use several of them in combination, since you want to monitor
multiple websites being operated by completely different organisations, and
therefore they will be managing their websites in quite different ways.
Here are a few suggestions for what you could monitor, Iâm sure other people
can come up with more:
Rather blunt and inefficient, but âlynx -dump http://web.site | md5sumâ
âwget -S -q http://web.site | grep ETagâ will work for some but not all
If a website has a âlast updatedâ comment on it (and itâs reliable) then
something like âlynx -dump http://web.site/path/to/exams.html | grep âlast
updatedââ
In all cases you will get some value which needs comparing with the previously
obtained value to indicate whether thereâs been a change.
Some sites (still?) provide RSS feeds of updated content; you might be able
to use this to get a âpush notificationâ of an update, instead of having to
scrape their website at regular intervals as above.