This project has been supersceded by a XML-RPC version.

Smugmug RSS Generator

The scraper (source) is pretty simple. The process is:

Retrieve the gallery that was requested
Run the HTML through xmllint to make it XHTML
Process the XHTML through a SAX parser to pull out the salient information
Output RSS

That second step took quite a bit of examining of smugmug's pages. Currently all the pages I am interested in working with are using the elegant style, so that's the only one I worked with. Things learned:

Whoever programmed smugmug loves nested tables. My testcase from Marc's gallery has 36 tables on it

The breakdown of those tables are:

Table Number	Function
1	The banner across the top including the search and shopping icons
1:2	Layout for the banner
3	Header text "Marc Valentin's galleries"
4	Page intro with main thumbnail and description
5	Empty
6	Empty
7	Holder for galleries
7:8	Header for all the galleries
7:9	Left gallery in first row (Christmas)
7:9:10	Gallery desciption
7:11	Right gallery in first row (Halloween)
7:11:12	Gallery description
7:13,7:17,1:21,7:25,7:29	Left gallery
7:15,7:19,1:23,7:27	Right gallery
31	Empty
32	Photos by keyword
32:33	Photos by keyword header
34	Empty
35	Empty
36	Bottom navbar

(Mozilla's DOM Inspector was really handy for figuring these out)

The breakdown for Suzanne's page is similar. The difference is that she lacks a page intro (#4) and that table is missing along with one of the blanks (#5) following it
Scott and Jess's page is another variant which has a "featured galleries" section. This appears between the two blank tables (#5 and #6) following the description (#4)
Robert's page is the same as Scott and Jess's and follows the same format

My information is limited, but as best I can tell breaking stuff up by tables should work pretty well. I think I'll do the program as a state machine. The transitions will be a combination of tag and depth. So the notation is [tag]:[class depth], so table:2 would be a transition on a table which is nested inside another table. The actual depth in the tree might be something like 5 for things like <html>, <body> and <td>. The 2 means it is a second level table. Additionally, there are arbitrarily many ignored tags. If a transition is not defined for an input, it is just ignored.

I'm pretty sure this will work and each section is atomic. Meaning, if I am buffing the output from the CDATA sections, I can clear the buffer before entering each state and on exit I won't have collected any extra garbage.

The program is working for the main page, it seems. Unfortunately I've already blown a day on this project and I've just discovered a XML-RPC interface to the site. Dammit. Oh well, I'll continue later…