MapReduce for Ruby: Ridiculously Easy Distributed Programming

Google's MapReduce is now available for Ruby (via gem install starfish). MapReduce is the technique used by Google to do monstrous distributed programming over 30 terabyte files.

Here is the basic code that will get you up and running with MapReduce in Starfish.

    # item.rb
    ActiveRecord::Base.establish_connection(
      :adapter  => "mysql",
      :host     => "localhost",
      :username => "root",
      :password => "",
      :database => "some_database"
    )

    class Item < ActiveRecord::Base; end

    server do |map_reduce|
      map_reduce.type = Item
    end

    client do |item|
      logger.info item.id
    end

Now just run:

    starfish item.rb

and Starfish takes care of the rest. The code above does the following:

  • The server grabs all the items via: Item.find(:all)
  • Each of the clients grab an item from the collection
  • When there are no more items to be grabbed, everything shuts down

Just add REST (and it's come by default with the Edge Rails) and you'll have your own S3 for free ;)