There is a screencast that accompanies this article. If you are interested in the process behind many of the thoughts and code in this article, watch the screencast. If you just want the facts, read the article.
Data scraping is the process of extracting data from output that was originally intended for humans. A web page is an example of output originally intended for humans in contrast to an API intended for use by other programs.
Nokogiri is a Ruby Gem that extracts data from web pages using CSS selectors. Additionally, it provides methods to help parse (make sense of) the results. The use of CSS selectors allows you to easily target the data you wish to extract from a URL.
Let’s look at an example:
<meta charset="UTF-8" /></pre> <div id="price">$32.11</div> <div id="time">in 6 hours</div> <div id="stock">in stock</div> <pre>
In this page, there are three bits of interesting information: price, time, and inventory status. The CSS selectors for these are what we would use were we were trying to style their divs, ‘#price’, ‘#time’, and ‘#stock’. Let’s look at a sample Nokogiri script that will extract this information:
# nokogiri is our scraping/parsing library # you will need to install it with "gem install nokogiri" require 'nokogiri' # open-uri is part of the standard library and allows you to # download a webpage require 'open-uri' # I am hosting interesting.html on a local server. This is the URL. url = "http://localhost:4567/interesting.html" # Here we load the URL into Nokogiri for parsing downloading the page in # the process data = Nokogiri::HTML(open(url)) # We can now target data in the page using css selectors. The at_css method # returns the first element that matches the selector. puts data.at_css("#price").text.strip # The text method returns the text from inside the element. puts data.at_css("#time").text.strip # The strip method is a standard ruby method for strings and removes # extraneous whitespace from the output puts data.at_css("#stock").text.strip
The above example is very simple, but hopefully gets the point across about how you can target content for extraction using CSS selectors. Now, for a real world example: The 9:30 club is a music venue in DC. Let’s figure out how to scrape concert information from the concerts listing page. We will try and target the name of the headlining band, the date they are playing, the time of the show, the price for a ticket and whether or not the show is sold out.
</pre> <div class="concert_listing" id="event_107773"><!-- START CONCERT LISTING --> ..... <h2 class="event">Balkan Beat Box</h2> ..... <div class="buy"> <a href="http://www.ticketfly.com/purchase/event/107773">Buy Tix</a> <div class="price">$22.00</div> </div> <div class="date">FRI 6/15</div> <div class="doors">8pm Doors</div> .............
Using the same technique as the previous example, we find the name of the band located in ‘.event’, the date of the show located in ‘.date’, the time of the show located in ‘.doors’ and the price of the show located in ‘.price’. If a show is sold out, the div with class ‘.price’ is missing. This is good, however our current method at_css is only going to return information for the first concert.
What we need is a way to target all concerts. Nokogiri provides such a method with .css(‘selector’). This method returns a Nokogiri enumerable object that holds all objects matching the provided selector. In the 9:30 club markup, each concert has its own div with class ‘.concert_listing’. We can now use css(‘.concert_listing’) combined with an each iterator to extract information about “each” concert.
require 'nokogiri' require 'open-uri' url = "http://www.930.com/concerts/#/930/" data = Nokogiri::HTML(open(url)) # Here is where we use the new method to create an object that holds all the # concert listings. Think of it as an array that we can loop through. It's # not an array, but it does respond very similarly. concerts = data.css('.concert_listing') concerts.each do |concert| # name of the show puts concert.at_css('.event').text # date of the show puts concert.at_css('.date').text # time of the show puts concert.at_css('.doors').text # show price or sold out # Remember, when a show is sold out, there is no div with the selector .price # What we are doing here is setting price = to that selector. We then test # to see whether it is nil or not which let's us know if the show is SOLD OUT. price = concert.at_css('.price') if !price.nil? puts price.text else puts "SOLD OUT" end # blank line to make results prettier puts "" end
hunter@i7:code_samples ruby concerts_930.rb Balkan Beat Box FRI 6/15 8pm Doors $22.00 Destroyer SAT 6/16 8pm Doors $20.00 Santigold MON 6/18 7pm Doors SOLD OUT ...
Let’s take it a step farther and turn our app into a full Sinatra application. The main thing we are going to do is separate our “business logic” and “display logic”. Business logic will stay in our application as part of a “route” and display logic will movie into a view. This paradigm isn’t entirely accurate, but it does represent the gist of what we are after. If we weren’t moving so fast, it might be nice (read necessary) to abstract a layer with a custom Class to separate the two.
# The primary requirement of a Sinatra application is the sinatra gem. # If you haven't already, install the gem with 'gem install sinatra' require 'sinatra' require 'nokogiri' require 'open-uri' # sinatra allows us to respond to route requests with code. Here we are # responding to requests for the root document - the naked domain. get '/' do # the first two lines are lifted directly from our previous script url = "http://www.930.com/concerts/#/930/" data = Nokogiri::HTML(open(url)) # this line has only be adjusted slightly with the inclusion of an ampersand # before concerts. This creates an instance variable that can be referenced # in our display logic (view). @concerts = data.css('.concert_listing') # this tells sinatra to render the Embedded Ruby template /views/shows.erb erb :shows end
For the view, I added a bit of HTML and linked to a hosted bootstrap stylesheet. The rest of the code should look very familiar. The only new thing here should be the introduction of ERB syntax which allows us to evaluate Ruby in our HTML document.
The two basic tags are <%= %> and <% %>. The difference between the two is, the first one with the equal sign renders the return value to the HTML, while the second is used primarily to evaluate a statement.
<!DOCTYPE HTML> <html lang="en-US"> <head> <meta charset="UTF-8"> <title>9:30 Show</title> <link rel="stylesheet" href="http://current.bootstrapcdn.com/bootstrap-v204/css/bootstrap-combined.min.css"> </head> <body> <div class="span8"> <!-- This is table layout is pulled directly from twitter bootstrap --> <table class="table table-striped"> <thead> <tr> <th>Date</th> <th>Event</th> <th>Time</th> <th>Price</th> </tr> </thead> <tbody> <% @concerts.each do | concert | %> <tr> <% price = concert.at_css('.price') %> <% if !price.nil? %> <td><%= concert.at_css('.date').text %></td> <td> <!-- This next line targetting the :HREF is new. --> <!-- The first part should seem familiar. We are targetting --> <!-- the first anchor link inside an element with the class --> <!-- '.buy'. The next bit [:href], tells Nokogiri to extract --> <!-- the href value from the anchor link. Our ERB tag then --> <!-- outputs that value to the string. See if you can figure --> <!-- out why we are extracting this link by reviewing 930.html --> <a href="<%= concert.at_css('.buy a')[:href] %>"> <%= concert.at_css('.event').text %> </a> </td> <td><%= concert.at_css('.doors').text %></td> <td><%= price.text %></td> <% else %> <td><%= concert.at_css('.date').text %></td> <td><del><%= concert.at_css('.event').text %></del></td> <td><%= concert.at_css('.doors').text %></td> <td> SOLD OUT </td> <% end %> </tr> <% end %> </tbody> </table> </div> </body> </html>
Let’s take this a step further and deploy to Heroku. All we need are three new files and a bit of version control. We actually only need to create two of the three files. Bundler will take care of the third.
# The Gemfile tells bundler which gems our app is using. # Where the gems are from source :rubygems # Which gems are needed # You might note the ommision of open-uri, this is because it is part of the # Ruby standard library. The remaining two are simply copied from the # require statements in app.rb gem 'sinatra' gem 'nokogiri'
# tell Heroku what to load require './app' # tell Heroku what to do run Sinatra::Application
After creating the two files, run ‘Bundle’ in the application folder. You will need to have Bundler installed.
hunter@i7:code bundle Fetching gem metadata from http://rubygems.org/....... Using nokogiri (1.5.4) Using rack (1.4.1) Using rack-protection (1.2.0) Using tilt (1.3.3) Using sinatra (1.3.2) Using bundler (1.1.4) Your bundle is complete! Use `bundle show [gemname]` to see...
The bundle command creates our third file, Gemfile.lock. Next we will put our application under version control with git. You will need to have git installed.
heroku create git push heroku master
Our site is live ( http://nine30.heroku.com ). If some of the steps seem “glossed over”, watch the screencast. I do everything in real-time, solving many of the problems above for the first time. Still have something to say, shoot me 140 characters @TheHunter.
Complete Source Available on Github