Web Scraping with JS

Stock prices, product details, company information, sports stats you name it
If you wanted to access this information, you’d either have to use whatever format the website uses or copy-paste the information manually into a new document.
The process can be tedious and mundane and time-consuming to format and get what you want. Here’s where web scraping can help.
Web scraping or scraping are two terms that are often used in the web world and there are a number of tools for scraping.
What is web scraping?
It is the process of extracting content and data from a website. This information is collected and then exported into a format that is more useful for the user like API
Although scraping can be done manually, it is tedious and mundane work so automated tools are preferred when scraping as there are enormous benefits from cost to fast access to information.
But in most cases, web scraping is not a simple task. Websites come in many shapes and forms, as a result, web scrapers vary in functionality and features.
There are a lot of tools for scraping. In this article, I will be discussing web scraping with JavaScript.
There are a bunch of libraries for scraping:
Puppeteer (Headless Chrome Browser for Automation)
Cheerio (Not a browser)
Osmosis (The Parser)
Apify SDK The Complete Web Scraping Framework)
The most common and widely used is a puppeteer and by any chance, if you happen to use it but if you are someone who doesn't know much about cloud functions, or serverless architecture or doesn't consider using paid virtual machine for deploying the project, there are no better ways then learning them but is what deviates you from your core motive is scraping.
Most of the web is full of scraping with Puppeteer and when working in development on the local machine there are not many problems you may face when using puppeteer but when it comes to deployment and someone who is not familiar with the concepts like cloud function, serverless, docker, virtual machines will you will face problem using puppeteer.
Each of the tools has there own benefits and drawbacks but in this article, we will be discussing heavily on cheerio and puppeteer and will be comparing them with each other
Cheerio
Fast, flexible & lean implementation of core jQuery designed specifically for the server.
Features
Blazingly fast
Cheerio works with a very simple, consistent DOM model. As a result, parsing, manipulating, and rendering are incredibly efficient. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript which is common for a SPA (single-page application)
If your use case requires any of this functionality, you should consider browser automation software like PuppeteerIncredibly flexible
Cheerio wraps around the parse5 parser. Cheerio can parse nearly any HTML or XML document. Cheerio works in both browser and Node environments.
Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API. If you are familiar with JQuery it's going to be easy for you folks out there. It has a similar syntax to JQuery at the same time extremely fast
Drawbacks
The only use of cheerio is web scraping as it has limited functionality and does not provide all the features of a full-fledged web scraping library, such as support for JavaScript execution or handling of redirects and cookies.
No XPath: Cheerio only works with CSS selectors and not other types of selectors such as XPath.
Snippet
const cheerio = require('cheerio');
const $ = cheerio.load('<h2 class="title">Hello world</h2>');
$('h2.title').text('Hello there!');
$('h2').addClass('welcome');
$.html();
//=> <html><head></head><body><h2 class="title welcome">Hello there!</h2></body></html>
Puppeteer
Puppeteer is a Node.js library that offers a simple but efficient API that enables you to control Google’s Chrome or Chromium browser. It is developed and maintained by Google
Features
Control headless Chrome or Chromium
It also enables you to run Chromium in headless mode (useful for running browsers on servers) and can send and receive requests without the need for a user interface.
Automate form submissions and UI testing.
Puppeteer allows you to automate form submissions, which can be useful for testing or for automating repetitive tasks. It also makes it easy to run UI tests by interacting with web pages just like a user would. Puppeteer is also widely used in UI testingInteract with web pages via a devtools protocol
Puppeteer uses the Chrome DevTools Protocol to interact with web pages, which allows you to perform a wide range of actions such as clicking elements, filling out forms, and manipulating the DOM. This makes it easy to automate browser tasks.
Bypass CORS and capture network traffic: Using Pupeeter one can bypass the CORS and access sites that are otherwise impossible. moreover, it allows one to capture the network traffic and intercept all the requests made by the site for monitor and debugging network requests.
Fun Fact: " Most of Youtube Downloaders use the Network Interceptor to intercept the request made by the page and download the video."
Drawbacks
The main con of Puppeteer as a JavaScript scraping library is that it requires more technical knowledge than other libraries.
Moreover, when you use Puppeteer, you need more expensive infrastructure since it needs to launch browsers, unlike ZenRows, which does it for you
Snippet
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://testpage.com');
await page.screenshot({path: 'hello.png'});
await browser.close();
})();
Testing
Below are the testing of both tools with a Codekavya Website and the response time for both libraries
Scraping the team members' names and roles from codekavya.com took respective response time
Cheerio: ~ 1 seconds
Puppeteer : ~ 3 seconds

The source code and the test file are present in the repo (Project Repository). Give it a try yourself
Conclusion
Web scraping is set to grow as time progresses. As web scraping applications abound, JavaScript libraries will grow in demand. Every library has its pros and cons and it all depends on what you want.





