Skip to content

Ruby DSL – Fetching a Table

by Iain on February 3rd, 2010

Why Ruby Rocks

Ruby is a great programming language for writing DSL languages. It allows for core modifications, and, notably, supports blocks of code to be passed into methods as parameters. This, along with a loose and intuitive syntax, makes it easier to write a clean and easy-to-use interface for a class.

The problem with Tables

Tables are often used to show large amounts of information in HTML, but they aren’t easily searchable, combinable, and querable. Tables in a database are all of the above. If a table online could easily be converted to a database table, then it would be easier to analyze, and view over time. Copying the table by hand is impractical, but tables can be parsed by a number of libraries.

Why Nokogiri Rocks

Nokogiri is Ruby library that parses XML-based formats, including html, quite well.
It allows for xpath expressions, and css expressions. For this implementation of a table parser, I chose to use xpath. It is a more powerful parent of css, and it allows for more complex queries, which make dealing with unmarked and messy html easier. If you don’t know xpath, a good walk-through can be found at W3schools.

So, gimme the code

Here’s the library code for the parser:
(Download)

#!/usr/bin/ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'time'
require 'pp'
 
class TableParser
  attr_accessor :rows
  alias :to_s :rows
  def initialize(doc)
    @doc = doc
    @cond = []
  end
  def go(&block)
    self.instance_eval(&block)
    @output = get_rows
    run_after_hook
    @output
  end
  def to_s
    @output
  end
  def save!(*args)
    @db = has_db(*args)
  end
  def run_after_hook
    @db.insert(@output) if @db
  end
  def has_db(db, collection) 
    require 'mongo'
    Mongo::Connection.new.db(db)[collection]
  end
  def get_rows
      @rows.collect do |row|
       row_levels = {}
       flag = false
       @cond.collect do |name, xpath, block|
         unless ((xpath.nil? || xpath.empty?) && !!block)
          column = row.at_xpath(xpath+'/text()').to_s.strip
          flag = true if column.empty?
        end
        column = convert_column(block, column) if block 
        row_levels[name] = column
       end
       next if flag
       row_levels
     end.compact!
  end
  def convert_column(block, column)
    if block.is_a? String
     column.instance_eval(block)
    elsif block.is_a? Proc
       block.call column
    else
      case block
        when :int
          column.gsub!(/[^0-9]/, '')
          column.to_i
        when :float
          column.gsub!(/[^0-9\.]/, '')
          column.to_f
        else
          column
      end
    end
  end
  def using_table(xpath)
    @rows = @doc.xpath(xpath)
  end
  def fetch(name, location, block = nil)
    @cond.push [name, location, block]
  end
  def reject(xpath, no = true)
    eval "@rows.reject{|tr|#{no ? '!' : '!!'}tr.at(xpath)"
  end
end
 
class TableFetcher
  attr_accessor :table, :doc
  def get_page(uri)
    @table = []
    page = open(uri)
    @doc = Nokogiri::HTML(page)
  end
  def get_table(&block)
    table_parser = TableParser.new(@doc)
    @table.push table_parser
    table_parser.go(&block)
  end
end
 
if __FILE__ == $0
  unless File.exists? ARGV[0]
    puts 'Usage: Use another file to specify rules.'
    puts 'You can use an argument to include a file.'
  else
    load ARGV[0]
  end
end

It’s the library that takes your instructions, and provides an interface to Nokogiri, suited to mongodb and HTML tables.

Sample Usage:

Say, I needed to capture an HTML table to a mongodb database for quick searching, I could use the following code:

require 'get-table.rb'
 
fetcher = TableFetcher.new
fetcher.get_page('http://www.science.co.il/PTelements.asp')
elements = fetcher.get_table do
  fetch :number, 'td[1]', :int
  fetch :weight, 'td[3]', :float
  fetch :name, 'td[4]'
  fetch :symbol, 'td[5]'
  fetch :electron_configuration, 'td[12]'
  fetch :ionization_energy, 'td[13]', :float
  using_table '//table[@class="tabint8"]/tr[td[13]]'
  save! 'chemistry', 'periodic_elements'
end
pp elements

What does this do?

The first line requires the beforementioned get-table.rb library.
The second line is grabs the webpage, and the third and remaining calls the actual parser.
The lines after the get_table do block specify the table, in xpath, and provide types for conversion.

There is some hidden power in the third argument. If you pass the symbols: :float, :int, it’ll format the database row, or your returned hash, to that format. If it’s a strong, it’ll eval that string in the current context. Therefore, providing chomp will result in column.chomp, and leading and trailing whitespace will be removed. If you want to use a block instead, pass a lambda { |column| Time.parse(column) } or a Proc.new { |col| col.split(',') } into the parameter, to format that row.

And that’s all! I’ll be posting a more detailed overview on how to write a DSL in ruby at a later date. I hope you liked it.

VN:F [1.4.8_745]
Rating: 10.0/10 (1 vote cast)

From → Uncategorized

No comments yet

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS