Python | Creating a Spider Class in Python

A Classy Spider

A Classy Spider in Scrapy is a Python class that defines how to perform the crawling and extraction process for a specific website, but with a more organized structure. It is a recommended way to write Spiders in Scrapy as it helps to keep the code organized and easier to maintain.

To create a Classy Spider in Scrapy, you need to inherit from the scrapy.Spider class and define some attributes and methods:

name: A string that identifies the Spider. It must be unique within a Scrapy project.
start_urls: A list of URLs to start crawling from.
allowed_domains: A list of domain names that the Spider is allowed to crawl.
start_requests(): A method that generates the initial requests to start crawling.
parse(): A method that is called for each response from the initial requests and subsequent requests made by following links found in the response.

Here's an example of a Classy Spider in Scrapy:

import scrapy

class MySpider(scrapy.Spider):

name = 'myspider'

start_urls = ['http://example.com']

allowed_domains = ['example.com']

def start_requests(self):

for url in self.start_urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

# Extract data from the web page

data = {

'title': response.css('title::text').get(),

'body': response.css('body::text').get(),

}

yield data

# Follow links to other pages

for link in response.css('a'):

yield response.follow(link, callback=self.parse)

In this example, we define a Classy Spider named MySpider that starts with a single URL (http://example.com). We also specify the domain name that the Spider is allowed to crawl using the allowed_domains attribute.

The start_requests() method generates the initial requests to start crawling. In this example, we simply yield a scrapy.Request object for each URL in the start_urls list.

The parse() method is called for each web page visited by the Spider, and it extracts the title and body text from the web page using CSS selectors. The extracted data is then yielded as a Python dictionary using the yield keyword.

The Spider also follows any links found on the web page by using the response.follow() method. This method takes a link as an argument, and visits the linked web page, calling the parse() method for that page as well.

You can run a Classy Spider in Scrapy using the scrapy crawl command in your terminal. For example, if you save the above code in a file named myspider.py, you can run it with the following command:

scrapy crawl myspider -o data.json

This command will run the Spider defined in myspider.py, and save the extracted data as a JSON file named data.json.