June 1, 2017
Scraping web with Ruby Tutorial

Scraping web with Ruby

Web scraping is the one of most important skills every good developer should have. Sometimes we want to use the data that exist on someone else’s website and build our own application. For example, we want to build a movie rating website, then we are not supposed to update the contents of movies every single time some new movie releases. We want a piece of code that runs in the background and goes to some other famous sites that does the same thing and take the data from them. This way, we are making the cost of application a lot cheaper because what a man should manually do things is done by using scraper.

There are also better options than using scrapers. Most big companies like Facebook, Google, Instagram provide data to users through API. But sometimes we won’t get API for the application we are building. All the websites doesn’t have APIs. This is because of either lack of technical knowledge or they don’t need it. Before you continue to scrape the web pages make sure you check if the website has developer section or not. If it has the developer section, API is the right way to go. If not you scraping is what you should do. For this tutorial we are going to scrape the web page using Ruby. We are going to make use of nokogiri gem.



Nokogiri is bundled as the Ruby gem. So for most distributions the following command work. If you have some dependencies to install you can refer to the original doc here.


$ gem install nokogiri

Don’t get confused, nokogiri is a parser. But what is the parser? Well, a parser is something that turns some kind of data, usually a HTML string into a another type of data structure. This is one of the most popular Ruby gem. Some of the examples where you can use nokogiri are

  • Collect search links from search engines
  • Collect prices from websites
  • Collect web data


CSS Selectors

You must have done CSS before. It’s rule contains selectors where you give instructions to apply rule on a element. You can use those selectors to target the HTML elements in a web page. Let’s take a example, go to . It contains list of questions. If we want to get the questions our css selector expression should be

div.summary > h3 > a

Which means we have div tag with class summary and under it there exists h3 tag then the question which is present in the form of link in anchor tag.


xPath Expressions

They are more powerful than basic css selectors. Both css selectors and xpath expressions help to target the element but this one is recommended. If you don’t already know xPath, I am going to guide you through the basics. Let’s convert the above css selector expression into the xPath expression and see what it looks like.



Let’s break this expression

// => This tells to start parsing anywhere in the document

div => Get the div tag

[@class=’summary’] => where class is summary

h3 => get h3 tag

A => get a tag who is child of h3 tag


You can also use single / to start searching. But you will need to specify in the following way

/html/body/div/h3… and so on


First nokogiri program

You can parse the web page in two ways. One is by downloading the page and then running parser in it. Another, we can specify the link to the page which we want to parse, then we can store it into the variable and run parser into it. Let’s write the simple program to scrape stackoverflow questions.

require ‘nokogiri’

require ‘open-uri’
html_data = open(‘’).read

nokogiri_object = Nokogiri::HTML(html_data)

my_elements = nokogiri_object.xpath(“//div[@class=’summary’]/h3/a”)
my_elements.each do |my_element|

 puts my_element.text



Now run this and you will see the following results. The ‘open-uri’ gem is used when we don’t want to download the page and want to run the parser directly from the web.

These are all the questions available in the stackoverflow right when I run the parser.

If you want to use css selectors instead of xPath you can specify the following rule

nokogiri_object.css(“div.summary > h3 > a”)


Some other methods used with nokogiri are given below

my_elements.each do |my_element|

 puts my_element.parent

 puts my_element.children

 puts my_element.next_sibling

 puts my_element.previous_sibling




Now you have the power of xPath and Nokogiri to scrape any website you want to. You can always refer to the nokogiri documentation in case of any confusion.