Ruby DSL – Fetching a Table
Why Ruby Rocks
Ruby is a great programming language for writing DSL languages. It allows for core modifications, and, notably, supports blocks of code to be passed into methods as parameters. This, along with a loose and intuitive syntax, makes it easier to write a clean and easy-to-use interface for a class.
The problem with Tables
Tables are often used to show large amounts of information in HTML, but they aren’t easily searchable, combinable, and querable. Tables in a database are all of the above. If a table online could easily be converted to a database table, then it would be easier to analyze, and view over time. Copying the table by hand is impractical, but tables can be parsed by a number of libraries.
Why Nokogiri Rocks
Nokogiri is Ruby library that parses XML-based formats, including html, quite well.
It allows for xpath expressions, and css expressions. For this implementation of a table parser, I chose to use xpath. It is a more powerful parent of css, and it allows for more complex queries, which make dealing with unmarked and messy html easier. If you don’t know xpath, a good walk-through can be found at W3schools.
So, gimme the code
Here’s the library code for the parser:
(Download)
#!/usr/bin/ruby require 'rubygems' require 'nokogiri' require 'open-uri' require 'time' require 'pp' class TableParser attr_accessor :rows alias :to_s :rows def initialize(doc) @doc = doc @cond = [] end def go(&block) self.instance_eval(&block) @output = get_rows run_after_hook @output end def to_s @output end def save!(*args) @db = has_db(*args) end def run_after_hook @db.insert(@output) if @db end def has_db(db, collection) require 'mongo' Mongo::Connection.new.db(db)[collection] end def get_rows @rows.collect do |row| row_levels = {} flag = false @cond.collect do |name, xpath, block| unless ((xpath.nil? || xpath.empty?) && !!block) column = row.at_xpath(xpath+'/text()').to_s.strip flag = true if column.empty? end column = convert_column(block, column) if block row_levels[name] = column end next if flag row_levels end.compact! end def convert_column(block, column) if block.is_a? String column.instance_eval(block) elsif block.is_a? Proc block.call column else case block when :int column.gsub!(/[^0-9]/, '') column.to_i when :float column.gsub!(/[^0-9\.]/, '') column.to_f else column end end end def using_table(xpath) @rows = @doc.xpath(xpath) end def fetch(name, location, block = nil) @cond.push [name, location, block] end def reject(xpath, no = true) eval "@rows.reject{|tr|#{no ? '!' : '!!'}tr.at(xpath)" end end class TableFetcher attr_accessor :table, :doc def get_page(uri) @table = [] page = open(uri) @doc = Nokogiri::HTML(page) end def get_table(&block) table_parser = TableParser.new(@doc) @table.push table_parser table_parser.go(&block) end end if __FILE__ == $0 unless File.exists? ARGV[0] puts 'Usage: Use another file to specify rules.' puts 'You can use an argument to include a file.' else load ARGV[0] end end
It’s the library that takes your instructions, and provides an interface to Nokogiri, suited to mongodb and HTML tables.
Sample Usage:
Say, I needed to capture an HTML table to a mongodb database for quick searching, I could use the following code:
require 'get-table.rb' fetcher = TableFetcher.new fetcher.get_page('http://www.science.co.il/PTelements.asp') elements = fetcher.get_table do fetch :number, 'td[1]', :int fetch :weight, 'td[3]', :float fetch :name, 'td[4]' fetch :symbol, 'td[5]' fetch :electron_configuration, 'td[12]' fetch :ionization_energy, 'td[13]', :float using_table '//table[@class="tabint8"]/tr[td[13]]' save! 'chemistry', 'periodic_elements' end pp elements
What does this do?
The first line requires the beforementioned get-table.rb library.
The second line is grabs the webpage, and the third and remaining calls the actual parser.
The lines after the get_table do block specify the table, in xpath, and provide types for conversion.
There is some hidden power in the third argument. If you pass the symbols: :float, :int, it’ll format the database row, or your returned hash, to that format. If it’s a strong, it’ll eval that string in the current context. Therefore, providing chomp will result in column.chomp, and leading and trailing whitespace will be removed. If you want to use a block instead, pass a lambda { |column| Time.parse(column) } or a Proc.new { |col| col.split(',') } into the parameter, to format that row.
And that’s all! I’ll be posting a more detailed overview on how to write a DSL in ruby at a later date. I hope you liked it.