Home Ruby API Scraping web with Ruby Tutorial

Scraping web with Ruby Tutorial

144
0
SHARE

Scraping web with Ruby

Web scraping is the one of most important skills every good developer should have. Sometimes we want to use the data that exist on someone else’s website and build our own application. For example, we want to build a movie rating website, then we are not supposed to update the contents of movies every single time some new movie releases. We want a piece of code that runs in the background and goes to some other famous sites that does the same thing and take the data from them. This way, we are making the cost of application a lot cheaper because what a man should manually do things is done by using scraper.

There are also better options than using scrapers. Most big companies like Facebook, Google, Instagram provide data to users through API. But sometimes we won’t get API for the application we are building. All the websites doesn’t have APIs. This is because of either lack of technical knowledge or they don’t need it. Before you continue to scrape the web pages make sure you check if the website has developer section or not. If it has the developer section, API is the right way to go. If not you scraping is what you should do. For this tutorial we are going to scrape the web page using Ruby. We are going to make use of nokogiri gem.

 

Installation

Nokogiri is bundled as the Ruby gem. So for most distributions the following command work. If you have some dependencies to install you can refer to the original doc here.

 

$ gem install nokogiri

Don’t get confused, nokogiri is a parser. But what is the parser? Well, a parser is something that turns some kind of data, usually a HTML string into a another type of data structure. This is one of the most popular Ruby gem. Some of the examples where you can use nokogiri are

  • Collect search links from search engines
  • Collect prices from websites
  • Collect web data

 

CSS Selectors

You must have done CSS before. It’s rule contains selectors where you give instructions to apply rule on a element. You can use those selectors to target the HTML elements in a web page. Let’s take a example, go to Stackoverflow.com . It contains list of questions. If we want to get the questions our css selector expression should be

div.summary > h3 > a

Which means we have div tag with class summary and under it there exists h3 tag then the question which is present in the form of link in anchor tag.

 

xPath Expressions

They are more powerful than basic css selectors. Both css selectors and xpath expressions help to target the element but this one is recommended. If you don’t already know xPath, I am going to guide you through the basics. Let’s convert the above css selector expression into the xPath expression and see what it looks like.

//div[@class=’summary’]/h3/a

 

Let’s break this expression

// => This tells to start parsing anywhere in the document

div => Get the div tag

[@class=’summary’] => where class is summary

h3 => get h3 tag

A => get a tag who is child of h3 tag

 

You can also use single / to start searching. But you will need to specify in the following way

/html/body/div/h3… and so on

 

First nokogiri program

You can parse the web page in two ways. One is by downloading the page and then running parser in it. Another, we can specify the link to the page which we want to parse, then we can store it into the variable and run parser into it. Let’s write the simple program to scrape stackoverflow questions.

require ‘nokogiri’

require ‘open-uri’
html_data = open(‘https://stackoverflow.com/’).read

nokogiri_object = Nokogiri::HTML(html_data)

my_elements = nokogiri_object.xpath(“//div[@class=’summary’]/h3/a”)
my_elements.each do |my_element|

 puts my_element.text

end

 

Now run this and you will see the following results. The ‘open-uri’ gem is used when we don’t want to download the page and want to run the parser directly from the web.

Dot (‘.’) property accessor in PHP?

how can we rename _id field returned from groupby query in mongodb

HTTP Error 503. The service is unavailable while hosting VB.Net application

In YAML, how do I break a string over multiple lines?

Python-Continue script only after bat file is finished

How to get a persistent python session from command line?

Python error updating a SQL DB

Unable to select SSL certificate for Site binding in IIS programmatically using Microsoft.Web.Administration

Replacing/removing square brackets in a string

-Wcast-qual: cast discards ‘__attribute__((const))’ qualifier from pointer target type

Button not working on iOS,but does work on Android

Geolocation/visualization of people in certain municipalities (DK)

Spring ResponseEntity return custom Status Code

Why partition is needed in Shopify Sarama consumer to consume messages

Bluebird Promise : why returned value is not as expected?

Set Xcode archivePath other then project setting for OS X in Jenkins

Bound service gets stuck – Android

Adding scopes to parent model in Laravel

Angular2 datepicker input data-binding

Android gridView loading images using adapter and Glide

How can I build array dynamically for 21 fields per element?

fastest way to write spark dataframe to phoenix table

Pytest : Use Different XML Output File in Same TestSuite?

When docker run, an error occurs. “ValueError: Unable to configure handler ‘watchtower’: You must specify a region.”

How can I launch this Agora WebRTC sample node.js server?

Scrolling is not smooth as native app for ionic 2 in iOS platform.

Leaflet Error: Invalid LatLng object: (, NaN)

Get loop controller actuel counter value

Drupal 8: Custom Module Block default configuration not working

Creating an Umbraco widget that contains JavaScript

What is wrong with this statement?

Django migration fails using FileField with dynamic upload path

What is the difference between test and include in Webpack 2?

Extract SVO triples from preprocessed text

C# Way to enforce all classes have a method of a given name

Client certificate is not sent by .Net App using TLS1.2

sparklyr can’t see databases created in Hive and vice versa

Integrate Office 365 authorization API in Asp Net Mvc application

Label with button addTarget event not fire

Apache Spark – Parquet to Hive/Impala DDL

How to combine lists with respect to precedence

How can CodeMirror display an ‘javascript object’ string with object structure

V8 evaluate command truncates my strings

How to include mapping details in search result?


These are all the questions available in the stackoverflow right when I run the parser.

If you want to use css selectors instead of xPath you can specify the following rule

nokogiri_object.css(“div.summary > h3 > a”)

 

Some other methods used with nokogiri are given below

my_elements.each do |my_element|

 puts my_element.parent

 puts my_element.children

 puts my_element.next_sibling

 puts my_element.previous_sibling

end

 

Conclusion

Now you have the power of xPath and Nokogiri to scrape any website you want to. You can always refer to the nokogiri documentation in case of any confusion.

SHARE
Previous articleSolidus (Ruby on Rails Tutorial)
Next articleRuby on Rails Tutorial: Actioncable
mm
Taichi lover, book addicts, lost himself in the realm of writing code and poetry. Want to talk more? Need help on Ruby on Rails? we can help :)

LEAVE A REPLY

Please enter your comment!
Please enter your name here