Ruby on Rails Feed/RSS Aggregator (35 lines)
Posted by Simon on December 07, 2008 at 03:17 AM
Categories: code, rails, ruby
I wrote myself a feed aggregator for my front page. And... voila! I'm finally satisfied with it to post it.
Update: I've now published this as a complete standalone rails app on github/sbwoodside/portal. The important bits are app/controllers/portal_controller.rb and config/config.yml.
For me I run this as a standalone rails app, separately from my weblog. You could do that (and redirect requests to / or /index.html with Apache or nginx/etc. Or you could integrate it into your own app. Up to you.
Features:
- Will aggregate ANY feed, no matter how badly mangled by the creators, using FeedTools (I also tried feed_normalizer and simple rss but they're not as good)
- Deals with slowness of downloading feeds, RSS, etc., and REXML by caching
- Deals with need to recache using elegant http/cron periodic system
- Display the feeds in a facebook-like news feed format, sorted by dated.
- You can easily re-label the feeds, add and renew feeds (in the code)
- Only 35 lines of controller code!
The heart of it is the controller, obviously. The best thing? It's only one page of code! Ruby rocks!
require 'feed_tools'
class PortalController < ApplicationController
layout 'site'
# Instructions: 1. Change @@secret. 2. Add a cron job to regularly call /?recache=yes&secret=XXXXXXX
# This is a feed aggregator that uses FeedTools because it handles practically any feed.
# But FeedTools is super slow in every way so this aggregator stops using it as soon as possible.
# TODO add XML feed output
@@secret = "change_this" # change this to protect your site from DoS attack
# The array of feeds you want to aggregate. If you change this then manually delete the whole cache.
@@uris = ['http://simonwoodside.com:8080/posts/rss', 'http://simonwoodside.com/comments/rss',
'http://semacode.com/posts/rss',
'http://api.flickr.com/services/feeds/photos_public.gne?id=20938094@N00&lang=en-us&format=rss_200',
'http://api.flickr.com/services/feeds/activity.gne?user_id=20938094@N00']
# A map between the "official" feed titles in the XML, and the titles you want to show when rendered.
@@title_map = { "Simon Says" => "Simon Says:", "Simon Says: Comments" => "Simon Says comment:",
"Uploads from sbwoodside" => "Flickr picture:", "Semacode" => "Semacode blog post:",
'Comments on your photostream and/or sets' => 'Flickr comment:' }
def index
if params[:recache] and @@secret == params[:secret]
cache_feeds
expire_fragment(:controller => 'portal', :action => 'index') # next load of index will re-fragment cache
render :text => "Done recaching feeds"
else
@aggregate = read_cache unless read_fragment({})
end
end
private
# This will replace cached feeds in the DB that have the same URI. Be careful not to tie up the DB connection.
def cache_feeds
puts "Caching feeds... (can be slow)"
feeds = @@uris.map do |uri|
feed = FeedTools::Feed.open( uri )
{ :uri => uri, :title => feed.title,
:items => feed.items.map { |item| {:title => item.title, :published => item.published, :link => item.link} } }
end
feeds.each { |feed|
new = CachedFeed.find_or_initialize_by_uri( feed[:uri] )
new.parsed_feed = feed
new.save!
}
end
# Make an array of hashes, each hash is { :title, :feed_item }
def read_cache
@@uris.map { |uri|
feed = CachedFeed.find_by_uri( uri ).parsed_feed
feed[:items].map { |item| {:feed_title => @@title_map[feed[:title]] || feed[:title], :feed_item => item} }
} .flatten .sort_by { |item| item[:feed_item][:published] } .reverse
end
end
It's actually pretty simple but it took me a while to get the balance just right. What you need to do is set up a cron job or other repetitive task that does an HTTP load on http://mywebsite.com/?recache=yes&secret=XXXXXXXX ... every once in a while. You can use wget or curl, or whatever. You might want to recache every minute, five minutes, hour, whatever. Since it's done as a part of the controller there's no nonsense about running backgroundRB, RubyCron and all the other nonsense at HowToRunBackgroundJobsInRails. Yay!
Here's the view:
<div id="feed-stream">
<% cache do %>
<%
lastday = -1
@aggregate.each do |item| %>
<div class="item">
<%
mydate = item[:feed_item][:published].getlocal
if mydate.yday != lastday
%><div class="item_details"><p style="text-align:right"><%= mydate.strftime('%A, %B %e') %></p></div><%
lastday = mydate.yday
end
%>
<div class="item_content">
<%= item[:feed_title] %>
<a href="<%= item[:feed_item][:link] %>"><%= item[:feed_item][:title] %></a>
</div>
</div>
<% end %>
<% end %>
</div>
My cache is all Hashes. I don't cache the FeedTools object because I discovered that even after FeedTools has parsed your feed, accessing the supposedly "final" data is incredibly slow (like maybe 10x or 100x slower than a hash).
Here's the model:
require 'feed_tools'
class CachedFeed < ActiveRecord::Base
validates_presence_of :uri, :parsed_feed
validates_uniqueness_of :uri
serialize :parsed_feed, Hash # note that if this exceeds a certain KB size, it will likely fail (thinking it's a String)
end
And the migration:
class CreateCachedFeeds < ActiveRecord::Migration
def self.up
create_table :cached_feeds do |t|
t.column :uri, :string, :limit => 2048
t.column :parsed_feed, :text, :limit => 128.kilobytes # use for serialized object
t.timestamps
end
end
def self.down
drop_table :cached_feeds
end
end
Well, that's all you need. When I started out to make this I thought I'd find a simple example out there but there wasn't anything. It turns out that there's a number of interesting challenges — picking a parser to deal with difficult feeds, XML, and malformatted XML... to deal with caching ... to deal with background processing. Took me a while to get it all just right.
It powers my own front page ... consider to be under standard ruby open source license. As the vending machine says: Share And Enjoy!
Comments
There are 9 comments on this post. Post yours →
This was great, and exactly what I was looking for as an introduction to FeedTools. Thanks!
Why didn't you use the feed tools cache, rather than create your own?
.
@Derrick: I feel certain there was something I didn't like about it. Let me have a look... Just looking quickly at what people have written about it, I suspect that I thought it looked a little more complex that what I wanted, or possibly I already had a cache implemented. Does it matter?
Basically, the problem is that (at least in the recent past) rails did not do well with threads, so the only time I could check to see if a feed was out of date would be while the user was waiting to get their response. That's very non-optimal when dealing with many feeds and/or feeds that are slow to respond. So, my solution was not to check, and take the re-caching offline with a cron-job.
Maybe with the new improved threads support things have changed but I haven't checked that out.
hii
i have install your application but no feeds comming in content why..
please help its urgent
thanks
rahul
Rahul can you be more specific? Are you sure the feeds are valid?
Rahul in your email you said "MissingSourceFile (no such file to load—uuidtools)"
I would try:
sudo gem install uuidtools
hii
thanks for reply
i have runned your app and want to integrate in my app but lisiting of feed is not coming ..
in my controller ...
@aggregate = readcache unless readfragment({}) render :file => "#{Rails.root}/app/views/public/feed.html.erb"
and declare the function below this
.......read_cache..
and your index.html.erb in my feed.html.erb...
want to solve this problem..
thanks
rahul
Tough to diagnose without seeing your full source. The best I can say is, install it as written and get it to work that way, then transfer it into your app without any changes. If that works, then alter it inside your app until it works the way you want.
Post a comment
Required fields in bold.