Monitor websites for changes

Hello,

I work for a sailing school that owns a database with questions.

The questions are obtained from the official examination test centers as pdf documents and the passed onto our database.

There are different examination centers across the country and normally every time a new test takes place on their city, the exam is uploaded by each organization to their own website and made public.

In order for me to be aware when a new pdf file has been uploaded to one of those websites I was wondering if there is a way to monitor them like checking for link updates ( assuming the links will always be on the same place of the website )or comparing a website structure using timestamps to trigger an alarm like when those timestamps don’t match ( l assuming there has been an update with that comparison )

I don’t know if I made myself understood or if it is doable at all.

Many thanks for your patience.

There are many ways to monitor websites for content changes, and you will
probably need to use several of them in combination, since you want to monitor
multiple websites being operated by completely different organisations, and
therefore they will be managing their websites in quite different ways.

Here are a few suggestions for what you could monitor, I’m sure other people
can come up with more:

  1. Rather blunt and inefficient, but “lynx -dump http://web.site | md5sum”

  2. “wget -S -q http://web.site | grep ETag” will work for some but not all

  3. “wget http://web.site/path/to/pdf/exam.pdf | md5sum”

  4. If a website has a “last updated” comment on it (and it’s reliable) then
    something like “lynx -dump http://web.site/path/to/exams.html | grep ‘last
    updated’”

In all cases you will get some value which needs comparing with the previously
obtained value to indicate whether there’s been a change.

  1. Some sites (still?) provide RSS feeds of updated content; you might be able
    to use this to get a “push notification” of an update, instead of having to
    scrape their website at regular intervals as above.

Good luck:

Antony.

1 Like