This project has been supersceded by a XML-RPC version.

Smugmug RSS Generator

The scraper (source) is pretty simple. The process is:

  1. Retrieve the gallery that was requested
  2. Run the HTML through xmllint to make it XHTML
  3. Process the XHTML through a SAX parser to pull out the salient information
  4. Output RSS

That second step took quite a bit of examining of smugmug's pages. Currently all the pages I am interested in working with are using the elegant style, so that's the only one I worked with. Things learned:

My information is limited, but as best I can tell breaking stuff up by tables should work pretty well. I think I'll do the program as a state machine. The transitions will be a combination of tag and depth. So the notation is [tag]:[class depth], so table:2 would be a transition on a table which is nested inside another table. The actual depth in the tree might be something like 5 for things like <html>, <body> and <td>. The 2 means it is a second level table. Additionally, there are arbitrarily many ignored tags. If a transition is not defined for an input, it is just ignored.

Page Flowchart

I'm pretty sure this will work and each section is atomic. Meaning, if I am buffing the output from the CDATA sections, I can clear the buffer before entering each state and on exit I won't have collected any extra garbage.


The program is working for the main page, it seems. Unfortunately I've already blown a day on this project and I've just discovered a XML-RPC interface to the site. Dammit. Oh well, I'll continue later…