Serverless Feed Aggregator

With Serverless Framework, OpenWhisk, and Cloudant

May 25, 2017

RSS is long dead. Yes, there is still Feedly, but I found myself go there less and less often. One reason is that usually the interesting content ultimately got shared/recommended to me on either Facebook, Twitter, Medium, or Pocket. When I do go to Feedly (I have more than a hundred feeds there), I kind of marking read all the time… Not a good way to kill time.

On the other hand, for a small percentage less popular content (but must-read for me)[1], I do use IFTTT RSS Feed to automatically save new content to my Pocket. It works perfect for me, except … I have to create one applet per feed and I have to manage them manually. I could use some feed aggregator (there are many options available) but most of them do not allow updating the mixed feed once created.

Well, that is a good opportunity (excuse?) for me to play with something new, more specifically:

Event-based architecture
Serverless Framework with OpenWhisk

Event-based architecture

It is data centric with various events emitted to trigger different actions. In my case, events can be data changes or time-based schedules.

The first area is around the feeds database which simply stores my interested feed URLs. Simple wrapper actions are created for adding-to/deleting-from feeds database.
The second area is feed crawling. The feed aggregator needs to constantly check each feed to see if any new items are published. A cron action is triggered every hour, iterating through all the feeds registered. It invokes crawl action to find new feed items and write them to items database.
The third area is the notification of new feed items. This is triggered by the OpenWhisk Cloudant feed whenever there’s a new item written to items database. I’m taking advantage of IFTTT to trigger the saving to my Pocket.

Event-based flow makes actions feel more like “glues”, which is a good thing. Most actions are small in size and also mostly isolated from others. They only focus on one and only one “thing”. put and delete actions only deal with saving/deleting feed records. cron action only fetches the list of feed URLs from database and pass them to crawl action. crawl action only fetches feeds to parse the items to write them to database. ifttt action is triggered whenever there’s a new item written to items database and only notifies IFTTT with the received feed item URL.

This makes extending the system easy and safe. I later added a web action to return an RSS feed of the latest 20 items. A simple Cloudant query index was created to return the 20 items and the web action simply composes an RSS XML from it. Nothing already there was touched.

This is pretty trivial in functionality, but it does capture the spirit of a serverless architecture.

A few hacks related to Cloudant makes the action code simpler (and more efficient). One is I use feed and feed item URLs used for _id (Cloudant document unique identifier). After a feed is parsed, all the feed items are simply written to items database with a single bulk API. I didn’t bother to check each feed item to see if already existed (that would require 2 API calls per feed item). Since feed item URL is used as _id, Cloudant simply skips those items already exist. After all, I only want one notification for any feed item ever. BTW, the notification from OpenWhisk Cloudant feed only contains _id. Because the need URL is already available, a second GET request to Cloudant is saved.

Another hack is in Cloudant, when documents are deleted, they are only marked so by setting a special field _deleted to be true. If later another document with the same _id is added, the original document would be reused (with _deleted cleared and a new _rev). To avoid duplicate notification, the action only notifies when _rev starts with “1-” (which means it is the very first time added).

One more hack is that crawl action actually does batch processing (multiple feeds at a time). Initially, cron action simply dispatches one feed per crawl action. Most of what crawl action does is waiting for the return of feed retrieval and typically it takes about 2~3 seconds. Batching multiple feeds together for one crawl action invocation does not increase its processing time. Remember, OpenWhisk is billed by actual usage (time and memory). With 20 for the batch size, that becomes 1/20 in cost. That’s a significant cost saving.

You might notice that I use Cloudant npm module directly in my actions, instead of using OpenWhisk’s pre-installed Cloudant package.

Latency is not the reason as my Cloudant usage is at most one level (as I found out previously). The main reason is simply the lack of functionality. For example, the pre-install Cloudant package only supports read/write by document ID. I can’t use bulk operation with it, nor can I use views with it.

The usage of a single cron action to iterate through all feeds is fine for my personal usage. After all, I only got like 10 feeds registered. What if you have a million feeds?

With a batch size of 20, that means 50 thousand crawl action invocations in parallel. I think it should still scale because FaaS is supposed to handle this well.

Okay, we might want to be nicer to the OpenWhisk platform so that it is not crashed or slowed down for our unnecessary workload. Or maybe, we just want to be … more fancy, like using different crawl frequency for different feed. We could actually create one cron per feed, by using per feed trigger/action:

> wsk trigger create feed-aggregator-dev-crawl-trigger-0001 \
  --feed /whisk.system/alarms/alarm \
  -p cron "0 * * * *" \
  -p trigger_payload "{\"url\":\"https://bryantsai.com/feed\"}"

> wsk rule create feed-aggregator-dev-crawl-trigger-0001-rule \
  feed-aggregator-dev-crawl-trigger-0001 \
  feed-aggregator-dev-crawl

This needs to be performed in the put action. When a new feed is registered, a cron trigger and a rule are created. When the trigger is … triggered, the payload would be passed as the parameter to the associated action. delete action needs to delete the corresponding trigger and rule. The benefit is we can now spread the workload across the hour (random minute/second). We can also adjust the schedule based on individual feed’s publishing frequency.

The downside of this, of course, is the extra complexity of managing all these per-feed triggers and rules. It is really cumbersome to set these up using the openwhisk npm module as 3 API calls are needed for both creation and deletion. The error handling is just … unmanageable as manual rollback is necessary when things go wrong in the middle. The other downside is cost as we are now back to one feed per action.

I’m happy with a single cron action.

Serverless Framework with OpenWhisk

Basically, Serverless Framework a tool making it much easier to manage your FaaS “projects”. If you have more than just a few actions, using it saves the trouble of creating lots of shell scripts.

Everything, except code, is declared in one file serverless.yml:

Action default parameter
Feeds: for now, you can define /whisk.system/alarms, /whisk.system/cloudant, /whisk.system/messaging directly as events for actions and Serverless Framework automatically creates/manages the corresponding OpenWhisk triggers and rules for you.
Triggers and their parameters: for other OpenWhisk feeds, you can still define custom triggers and Serverless Framework still can automatically create/manage the corresponding OpenWhisk triggers and rules for you.
The association of actions and triggers: this is by defining the actions’ events. Again, Serverless Framework manages the corresponding OpenWhisk rules for you.

I can simply issue the command serverless deploy and everything will be set up appropriately on OpenWhisk. No longer the need to manage all those feeds/triggers/rules individually and manually. Managing all related artifacts has never been easier. Automation is the king.

That’s it! It even supports packaging actions as Node.js modules (see crawl action defined in serverless.yml). There’s really nothing else special about using Serverless Framework with OpenWhisk.

Using Serverless Framework with OpenWhisk is a no-brainer to me.

Note that at this point, I’ve encountered a few issues in using Serverless with OpenWhisk:

It does not support the management of package binding yet. So I have to use a shell scripts create-cloudant-binding.sh which needs to be run first.
It does not support the new API gateway yet, so the http event does not work properly for me. I have to setup the API on OpenWhisk manually for my web action feed.
All action names are prefixed with Serverless Framework project name (crawl becomes feed-aggregator-dev-crawl in OpenWhisk). Rules automatically created are also named based on their associated actions/triggers. This helps a lot to OpenWhisk’s cluttered namespace. Except … custom trigger names are not managed in the same way.

Here’s the project, have fun!

bryantsai/feed-aggregator
Contribute to feed-aggregator development by creating an account on GitHub.github.com

Notes

In case you are interested, here are my must-read feeds:
https://blog.acolyer.org/feed/
http://highscalability.com/rss.xml
http://randsinrepose.com/feed/
http://stratechery.com/feed/
http://waitbutwhy.com/feed
http://steve-yegge.blogspot.com/feeds/posts/default?alt=rss
https://www.confluent.io/feed/

Bryan's Reflective Path

Discussion about this post