Crawling refers to the process of systematically visiting web pages and extracting data from them. Crawling is a fundamental concept in web scraping, and it involves the following steps:
In Scrapy, you can implement crawling behavior using Spiders. A Spider is a Python class that defines how to perform the crawling and extraction process for a specific website. Here's an example of a simple Spider in Scrapy:
import scrapyclass MySpider(scrapy.Spider): name = 'myspider' start_urls = ['http://example.com'] def parse(self, response): # Extract data from the web page data = { 'title': response.css('title::text').get(), 'body': response.css('body::text').get(), } yield data # Follow links to other pages for link in response.css('a'): yield response.follow(link, callback=self.parse) |
In this example, we define a Spider named MySpider that starts with a single URL (http://example.com). The parse() method is called for each web page visited by the Spider, and it extracts the title and body text from the web page using CSS selectors. The extracted data is then yielded as a Python dictionary using the yield keyword.
The Spider also follows any links found on the web page by using the response.follow() method. This method takes a link as an argument, and visits the linked web page, calling the parse() method for that page as well.
You can run a Scrapy Spider using the scrapy runspider command in your terminal. For example, if you save the above code in a file named myspider.py, you can run it with the following command:
scrapy runspider myspider.py -o data.json |
This command will run the Spider defined in myspider.py, and save the extracted data as a JSON file named data.json.