How to scrape a website using JavaScript?

Web scraping is a technique to fetch valuable data from websites. The data collected through web scraping is of potential use and can be used in various forms like datasets for Machine learning, monitoring for continuous customer feedback, SEO monitoring, and lots of other use cases. This scrape is what we are trying to cover in this blog with a use-case using Javascript.

What is Web Scraping❓

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.
The web scraping technique may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser.
While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.
It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Implementation⚒️

Let’s break the tasks into smaller pieces and then perform each of them:

Setting up the environment
Installing required packages
Working on scraping sites
Display the results

Setting up the environment?️

We can use the codedamn playground without any installation locally or use the local Node.js developer environment if we already have one.

Further steps assume that you have Node.js installed and not necessary if you are using codedamn playground as they come for you pre-installed?. Hit the terminal and create a new directory with any name of your choice using the command:

mkdir <folder-name>Code language: Bash (bash)

Then move to the newly created folder using the command:

cd <folder-name>Code language: Bash (bash)

Type this command to initialize a new node project at the root:

npm init

Press enter for all the fields and you can see a package.json file generated. Now with that being set let’s move on to installing the required packages.

Installing required packages?

The list of required packages are:

Express – Backend framework for node.js
Axios – Make HTTPS requests from node.js
Cheerio – Subset of jQuery for fast parsing
CORS – Express middleware

We can install all these packages using a single command:

npm i express axios cheerio corsCode language: Bash (bash)

After successful installation, we can find slight addition in the package.json file:

"dependencies": {
    "axios": "^1.1.3",
    "cheerio": "^1.0.0-rc.12",
    "cors": "^2.8.5",
    "express": "^4.18.2"
}
Code language: JSON / JSON with Comments (json)

It contains the list of all packages as dependencies and whenever others try to use the same project as yours then they could simply type the command?? and get the script running in their environment:

npm installCode language: Bash (bash)

There is a slight addition that needs to be done to package.json the file for ease of running the code. Make sure to add the command? inside the scripts object:

"start": "nodemon index.js"Code language: JSON / JSON with Comments (json)

This gives us the privilege to use just a small command to run the project and listen to the changes:

npm startCode language: Bash (bash)

Now in the root directory, we create a file named index.js where the heart of the program lies. With all things done until now let’s work on the code to scrape websites.

Working on scraping sites?

Here we are going to scrape the data from the codedamn courses site to get the list of all the courses with their respective URL. So this is how our final index.js file that resides in the root would look:

const PORT = 8000;
const baseURL = "http://codedamn.com";
const axios = require("axios");
const cheerio = require("cheerio");
const express = require("express");
const cors = require('cors')
const app = express();
app.use(cors())

const url = "https://codedamn.com/courses";

app.get("/", function (req, res) {
  res.json({ message : "Web Scrapping Codedamn Courses" });
});

app.get("/courses", (req, res) => {
  axios(url)
    .then((response) => {
      const html = response.data;
      const $ = cheerio.load(html);
      const courses = [];

      $(".relative", html).each(function () {
        const title = $(this).find("h2").text();
        const url = baseURL + $(this).find("a").attr("href");

        courses.push({
          title,
          url,
        });
      });
      res.json(courses);
    })
    .catch((err) => console.log(err));
});

app.listen(PORT, () => console.log(`Server running on PORT ${PORT}`));
Code language: JavaScript (javascript)

Initially, we import all the installed packages and set the port, and baseURL values. Then we create the express app using the snippet:

const app = express();

app.get("/", function (req, res) {
  res.json({ message : "Web Scrapping Codedamn Courses" });
});

app.listen(PORT, () => console.log(`Server running on PORT ${PORT}`));
Code language: JavaScript (javascript)

After saving the file run the npm start command in the terminal. If there are no errors then head over to http://localhost:8000/ and we can find this output:

If you get this hurray?? then we could proceed further. If not no worries feel free to follow the steps again then you could catch it.

Then we could create a new endpoint /courses that we could use to display the results fetched from scraping the website. So we create a variable named url, the website that needs to be scrapped. Now head over to the website and right click on any of the course names and click on inspect. Then we will be able to view the HTML and CSS related to that page(element). This is how it looks:

Codedamn courses inspect view result — Image by author

Now we have to scrape the same content Ultimate React.js Design Patterns and all other courses with their respective course links for which we are going to make use of Axios and Cheerio as well.

Now we will create a new GET endpoint /courses using express:

app.get("/courses", (req, res) => {
  // add code for scraping
});
Code language: PHP (php)

Inside this endpoint using Axios, we will request the respective URL and then gather the HTML elements with the help of a response. Now with the help of cheerio, we will look only for the specific class attribute that contains the course title and the course link. In our case it is relative. Then for every relative class, we gather the title and URL which are stored in the courses array. Then we return the response as a JSON. If there are any errors then they are console logged that need to be rectified. The code which acts is:

axios(url)
  .then((response) => {
    const html = response.data;
    const $ = cheerio.load(html);
    const courses = [];

    $(".relative", html).each(function () {
      const title = $(this).find("h2").text();
      const url = baseURL + $(this).find("a").attr("href");

      courses.push({
        title,
        url,
      });
    });
    res.json(courses);
  })
  .catch((err) => console.log(err));
Code language: JavaScript (javascript)

Note: This chunk of code needs to fit inside the comment: //add code for scraping.

That’s it now we could spin up the server using the command npm start and head to http://localhost:8000/courses/ we can see the response in the form of JSON. The output is shown here:

Postman /courses endpoint response — Image by author

Note: This output is being tested using Postman which is an API-building, testing, and using platform. Now in the index.js the file below the initialization of the app, add this script so that we can use the fetched data on our front-end:

app.use(cors())Code language: JavaScript (javascript)

Display the results?

We are going to use the scrapped data to show in our front end without much styling. This is where cors packing comes to use. First, create a index.html file in the root. Then let’s create a new folder inside the root and name it src. It contains all the code required for the front end. Inside the src folder, create two files named app.js and styles.css:

touch index.html
mkdir src
cd src
touch app.js styles.css
Code language: Bash (bash)

Let’s add this script in index.html:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>List of Courses Available in Codedamn</title>
    <link rel="stylesheet" href="src/styles.css">
</head>
<body>
    <div id="courses"></div>

    <script src="src/app.js"></script>
</body>
</html>Code language: HTML, XML (xml)

Move to app.js and add the following content:

const courseDisplay = document.querySelector("#courses");
const baseURL = "http://codedamn.com";

fetch("http://localhost:8000/courses")
  .then((response) => response.json())
  .then((data) => {
    data.forEach((course) => {
      const title = `
      <div id='course'>
            <h5><a href="` + course.url + `" target="_blank">` + course.title + `</a></h5>
        </div>
        
        <br>`;
      courseDisplay.insertAdjacentHTML("beforeend", title);
    });
  });
Code language: JavaScript (javascript)

Then with a minimal amount of styling to look cleaner:

#courses {
    margin: 20px 20px;
}

#course {
    border: 10px ridge rgba(68, 69, 53, 0.6);
}Code language: CSS (css)

If the server isn’t running then run the start script again:

npm startCode language: JavaScript (javascript)

Head to http://localhost:3000/index.html and you could see the output which contains the course name and its respective URL linked as an anchor.

Output from scrapped data — Image by author

Conclusion?

Web scrapping is highly potential and permissible if it doesn’t violate the terms and conditions of the organization. So make use of it wisely. Test the same on different sites and URLs and play with it.

If you are stuck in between any of the steps then you could use the code which is available from GitHub.

Fork the repository, clone it locally, head to the project folder, install the required dependencies and start the server:

git clone https://github.com/<YOUR-USERNAME>/web-scraping-js.git
cd web-scraping-js
npm i
npm startCode language: Bash (bash)

Else, head to this codedamn playground:

Then type just two commands to get the output:

npm i
npm startCode language: Bash (bash)

Just if you are new to Node.js this NPM Basics course from codedamn will be of great use.