Scraping web with Ruby
Web scraping is the one of most important skills every good developer should have. Sometimes we want to use the data that exist on someone else’s website and build our own application. For example, we want to build a movie rating website, then we are not supposed to update the contents of movies every single time some new movie releases. We want a piece of code that runs in the background and goes to some other famous sites that does the same thing and take the data from them. This way, we are making the cost of application a lot cheaper because what a man should manually do things is done by using scraper.
There are also better options than using scrapers. Most big companies like Facebook, Google, Instagram provide data to users through API. But sometimes we won’t get API for the application we are building. All the websites doesn’t have APIs. This is because of either lack of technical knowledge or they don’t need it. Before you continue to scrape the web pages make sure you check if the website has developer section or not. If it has the developer section, API is the right way to go. If not you scraping is what you should do. For this tutorial we are going to scrape the web page using Ruby. We are going to make use of nokogiri gem.
Nokogiri is bundled as the Ruby gem. So for most distributions the following command work. If you have some dependencies to install you can refer to the original doc here.
|$ gem install nokogiri|
Don’t get confused, nokogiri is a parser. But what is the parser? Well, a parser is something that turns some kind of data, usually a HTML string into a another type of data structure. This is one of the most popular Ruby gem. Some of the examples where you can use nokogiri are
- Collect search links from search engines
- Collect prices from websites
- Collect web data
You must have done CSS before. It’s rule contains selectors where you give instructions to apply rule on a element. You can use those selectors to target the HTML elements in a web page. Let’s take a example, go to Stackoverflow.com . It contains list of questions. If we want to get the questions our css selector expression should be
|div.summary > h3 > a|
Which means we have div tag with class summary and under it there exists h3 tag then the question which is present in the form of link in anchor tag.
They are more powerful than basic css selectors. Both css selectors and xpath expressions help to target the element but this one is recommended. If you don’t already know xPath, I am going to guide you through the basics. Let’s convert the above css selector expression into the xPath expression and see what it looks like.
Let’s break this expression
|// => This tells to start parsing anywhere in the document
div => Get the div tag
[@class=’summary’] => where class is summary
h3 => get h3 tag
A => get a tag who is child of h3 tag
You can also use single / to start searching. But you will need to specify in the following way
|/html/body/div/h3… and so on|
First nokogiri program
You can parse the web page in two ways. One is by downloading the page and then running parser in it. Another, we can specify the link to the page which we want to parse, then we can store it into the variable and run parser into it. Let’s write the simple program to scrape stackoverflow questions.
nokogiri_object = Nokogiri::HTML(html_data)
my_elements = nokogiri_object.xpath(“//div[@class=’summary’]/h3/a”)
Now run this and you will see the following results. The ‘open-uri’ gem is used when we don’t want to download the page and want to run the parser directly from the web.
|Dot (‘.’) property accessor in PHP?
how can we rename _id field returned from groupby query in mongodb
HTTP Error 503. The service is unavailable while hosting VB.Net application
In YAML, how do I break a string over multiple lines?
Python-Continue script only after bat file is finished
How to get a persistent python session from command line?
Python error updating a SQL DB
Unable to select SSL certificate for Site binding in IIS programmatically using Microsoft.Web.Administration
Replacing/removing square brackets in a string
-Wcast-qual: cast discards ‘__attribute__((const))’ qualifier from pointer target type
Button not working on iOS,but does work on Android
Geolocation/visualization of people in certain municipalities (DK)
Spring ResponseEntity return custom Status Code
Why partition is needed in Shopify Sarama consumer to consume messages
Bluebird Promise : why returned value is not as expected?
Set Xcode archivePath other then project setting for OS X in Jenkins
Bound service gets stuck – Android
Adding scopes to parent model in Laravel
Angular2 datepicker input data-binding
Android gridView loading images using adapter and Glide
How can I build array dynamically for 21 fields per element?
fastest way to write spark dataframe to phoenix table
Pytest : Use Different XML Output File in Same TestSuite?
When docker run, an error occurs. “ValueError: Unable to configure handler ‘watchtower’: You must specify a region.”
How can I launch this Agora WebRTC sample node.js server?
Scrolling is not smooth as native app for ionic 2 in iOS platform.
Leaflet Error: Invalid LatLng object: (, NaN)
Get loop controller actuel counter value
Drupal 8: Custom Module Block default configuration not working
What is wrong with this statement?
Django migration fails using FileField with dynamic upload path
What is the difference between test and include in Webpack 2?
Extract SVO triples from preprocessed text
C# Way to enforce all classes have a method of a given name
Client certificate is not sent by .Net App using TLS1.2
sparklyr can’t see databases created in Hive and vice versa
Integrate Office 365 authorization API in Asp Net Mvc application
Label with button addTarget event not fire
Apache Spark – Parquet to Hive/Impala DDL
How to combine lists with respect to precedence
V8 evaluate command truncates my strings
How to include mapping details in search result?
These are all the questions available in the stackoverflow right when I run the parser.
If you want to use css selectors instead of xPath you can specify the following rule
|nokogiri_object.css(“div.summary > h3 > a”)|
Some other methods used with nokogiri are given below
|my_elements.each do |my_element|
Now you have the power of xPath and Nokogiri to scrape any website you want to. You can always refer to the nokogiri documentation in case of any confusion.