I wrote myself a feed aggregator for my front page. And… voila! I’m finally satisfied with it to post it.

Update: I've now published this as a complete standalone rails app on github/sbwoodside/portal. The important bits are app/controllers/portal_controller.rb and config/config.yml.

For me I run this as a standalone rails app, separately from my weblog. You could do that (and redirect requests to / or /index.html with Apache or nginx/etc. Or you could integrate it into your own app. Up to you.

Features:

  • Will aggregate ANY feed, no matter how badly mangled by the creators, using FeedTools (I also tried feed_normalizer and simple rss but they're not as good)
  • Deals with slowness of downloading feeds, RSS, etc., and REXML by caching
  • Deals with need to recache using elegant http/cron periodic system
  • Display the feeds in a facebook-like news feed format, sorted by dated.
  • You can easily re-label the feeds, add and renew feeds (in the code)
  • Only 35 lines of controller code!

The heart of it is the controller, obviously. The best thing? It’s only one page of code! Ruby rocks!

require 'feed_tools'

class PortalController < ApplicationController
layout 'site'
# Instructions: 1. Change @@secret. 2. Add a cron job to regularly call /?recache=yes&secret=XXXXXXX
# This is a feed aggregator that uses FeedTools because it handles practically any feed.
# But FeedTools is super slow in every way so this aggregator stops using it as soon as possible.
# TODO add XML feed output

@@secret = "change_this" # change this to protect your site from DoS attack
# The array of feeds you want to aggregate. If you change this then manually delete the whole cache.
@@uris = ['http://simonwoodside.com:8080/posts/rss', 'http://simonwoodside.com/comments/rss',
'http://semacode.com/posts/rss',
'http://api.flickr.com/services/feeds/photos_public.gne?id=20938094@N00&lang=en-us&format=rss_200',
'http://api.flickr.com/services/feeds/activity.gne?user_id=20938094@N00']
# A map between the "official" feed titles in the XML, and the titles you want to show when rendered.
@@title_map = { "Simon Says" => "Simon Says:", "Simon Says: Comments" => "Simon Says comment:",
"Uploads from sbwoodside" => "Flickr picture:", "Semacode" => "Semacode blog post:",
'Comments on your photostream and/or sets' => 'Flickr comment:' }

def index
if params[:recache] and @@secret == params[:secret]
cache_feeds
expire_fragment(:controller => 'portal', :action => 'index') # next load of index will re-fragment cache
render :text => "Done recaching feeds"
else
@aggregate = read_cache unless read_fragment({})
end
end

private
# This will replace cached feeds in the DB that have the same URI. Be careful not to tie up the DB connection.
def cache_feeds
puts "Caching feeds... (can be slow)"
feeds = @@uris.map do |uri|
feed = FeedTools::Feed.open( uri )
{ :uri => uri, :title => feed.title,
:items => feed.items.map { |item| {:title => item.title, :published => item.published, :link => item.link} } }
end
feeds.each { |feed|
new = CachedFeed.find_or_initialize_by_uri( feed[:uri] )
new.parsed_feed = feed
new.save!
}
end
# Make an array of hashes, each hash is { :title, :feed_item }
def read_cache
@@uris.map { |uri|
feed = CachedFeed.find_by_uri( uri ).parsed_feed
feed[:items].map { |item| {:feed_title => @@title_map[feed[:title]] || feed[:title], :feed_item => item} }
} .flatten .sort_by { |item| item[:feed_item][:published] } .reverse
end
end

It’s actually pretty simple but it took me a while to get the balance just right. What you need to do is set up a cron job or other repetitive task that does an HTTP load on http://mywebsite.com/?recache=yes&secret=XXXXXXXX … every once in a while. You can use wget or curl, or whatever. You might want to recache every minute, five minutes, hour, whatever. Since it’s done as a part of the controller there’s no nonsense about running backgroundRB, RubyCron and all the other nonsense at HowToRunBackgroundJobsInRails. Yay!

Here’s the view:


<div id="feed-stream">
<% cache do %>
<%
lastday = -1
@aggregate.each do |item| %>
<div class="item">
<%
mydate = item[:feed_item][:published].getlocal
if mydate.yday != lastday
%><div class="item_details"><p style="text-align:right"><%= mydate.strftime('%A, %B %e') %></p></div><%
lastday = mydate.yday
end
%>
<div class="item_content">
<%= item[:feed_title] %>
<a href="<%= item[:feed_item][:link] %>"><%= item[:feed_item][:title] %></a>
</div>
</div>
<% end %>
<% end %>
</div>

My cache is all Hashes. I don’t cache the FeedTools object because I discovered that even after FeedTools has parsed your feed, accessing the supposedly “final” data is incredibly slow (like maybe 10x or 100x slower than a hash).

Here’s the model:


require 'feed_tools'
class CachedFeed < ActiveRecord::Base
validates_presence_of :uri, :parsed_feed
validates_uniqueness_of :uri
serialize :parsed_feed, Hash # note that if this exceeds a certain KB size, it will likely fail (thinking it's a String)
end

And the migration:


class CreateCachedFeeds < ActiveRecord::Migration
def self.up
create_table :cached_feeds do |t|
t.column :uri, :string, :limit => 2048
t.column :parsed_feed, :text, :limit => 128.kilobytes # use for serialized object
t.timestamps
end
end

def self.down
drop_table :cached_feeds
end
end

Well, that’s all you need. When I started out to make this I thought I’d find a simple example out there but there wasn’t anything. It turns out that there’s a number of interesting challenges – picking a parser to deal with difficult feeds, XML, and malformatted XML… to deal with caching … to deal with background processing. Took me a while to get it all just right.

It powers my own front page … consider to be under standard ruby open source license. As the vending machine says: Share And Enjoy!