A scraping incident

One of my mashups fell prey to the dreaded scrape rot - a complete overhaul of the target site that invalidated all of my scraping rules.  The pages in question are from globalincidendmap.com, which previously powered my internationalincident mashup (see Sri Lanka Incident Mashup).  The change was catastrophic - queryable content from the site is no longer free, but requires a paid membership and a login process.  One option would be to pay for a membership, but besides the steep price I doubt that the license terms allow republishing of the data.  I could support the mashup only for paid users of the service, collecting credentials and forwarding them on, but that again is both questionably secure (a user would have to trust that I didn’t abuse the credentials temporarily in my possession), and unrealistic since few if any of my audience would spring for the cost of membership.  In effect, my mashup has been totaled.

This illustrates one reason why scraping should be used only as a last resort, when no more stable forms of content are available - feeds or Web services.  When you mix the content and presentation, changes in the presentation are easily confused with changes in the content.  Although the scraping features of the WSO2 Mashup Server are popular, I like to think of them as a stop-gap while publishers find cost-effective ways of serving up presentation-free content, such as delivering simple services using the Mashup Server ;-).  Ideally, more and more publishers will recognize the value of raw content, and the need for Web scraping will diminish.  Gonna take a while though…

Even without scraping, there remains one of the deep problems with mashups and distributed programming, that of services that disappear, are altered, change usage terms, etc., breaking their dependent mashups in the process.  There has been lots written on this, which can be generally summed up as "this is a hard problem."

One thing we plan to do in the future to make sure that service changes don’t harm downstream dependents is use more of the advanced functionality of the WSO2 Registry upon which the WSO2 Mashup Server is built - namely, versioning.  Today, each time a service changes, an old copy is retained in the database, but no longer is "alive" as a service.  Some future version will have a simple interface for continuing to keep the old versions online, and help users to lock into one of these previous versions.  Some cool dependency management features on the drawing board for the Registry will also help find and record dependencies and notify dependents of changes.

But would these help in the case of the internationalincident service?  This is a case where there is a deliberate change which prevents "unauthorized" access.  The solution in this case was to mark the service as obsolete, and go out and find a whole new source of data.  The new srilankanincident service is a result - though the data is slightly different, perhaps a result of focusing narrowly on Colombo, it was a fairly short task to reprogram it, and even improve it, once I had found a new source of data.  The speed of fixing catastrophic failures is my current best hope against scrape rot.

Mashup Server Webinar May 13th

I’ve got another free Webinar coming up - again an Introduction to the WSO2 Mashup Server and to mooshup.com on 13 May from 9-10AM PST.  Join me if you:

  • Are curious about mashups, Mashup Server products, and want an overview of the capabilities of WSO2’s offering.
  • Have services, web pages, or other information sources available but you want smarter ways to use those services.
  • Always wanted an application to do (x) on the Web but it was always too costly to develop.
  • Know Javascript and want to see what it can do outside the browser.

    Register now!

  • Am I the last to know we’re cool?!

    Seems the WSO2 crew has been blogging about WSO2 appearing on the "cool 5" companies in a recent Gartner report (paid subscribers only).  What intrigued them about the WSO2 Mashup Server was support for the hitherto paradoxical "lightweight but enterprise-oriented" services.

    And here I am a couple of days late.  I guess for breaking news and the real skinny on "cool" you would do well to add the feeds of Paul, Sanjiva, Daniel, Glen, and Keith to your blogroll.