Text extraction is the process of retrieving textual data from HTML or XML documents. When scraping web pages, text extraction is a common task as it allows you to extract the content of interest from a web page and use it for further analysis or processing.
In Scrapy, you can use the extract() method to extract the text content of an HTML or XML element. Here's an example:
from scrapy import Selectorhtml = '<div><p>First paragraph.</p><p>Second paragraph.</p></div>'selector = Selector(text=html)# Extract the text content of the first p element inside a div elementtext = selector.css('div p:first-of-type').extract_first() |
In this example, we use the extract_first() method to extract the first p element inside a div element, and then extract its text content. The resulting text variable would contain the string "First paragraph.".
If you want to extract all text content from an element, you can use the extract() method instead:
from scrapy import Selectorhtml = '<div><p>First paragraph.</p><p>Second paragraph.</p></div>'selector = Selector(text=html)# Extract all text content inside a div elementtext = selector.css('div').extract_first() |
In this example, we use the extract_first() method to extract the first div element, and then extract all its text content. The resulting text variable would contain the string "First paragraph. Second paragraph.".
In addition to the extract() method, you can also use the get() method to extract the text content of an element as a string:
from scrapy import Selectorhtml = '<div><p>First paragraph.</p><p>Second paragraph.</p></div>'selector = Selector(text=html)# Extract the text content of the first p element inside a div elementtext = selector.css('div p:first-of-type::text').get() |
In this example, we use the get() method to extract the text content of the first p element inside a div element. The resulting text variable would contain the string "First paragraph.".
By default, text extraction in Scrapy removes any HTML tags and entities from the text content, so you get a plain text string. However, you can also use the extract() method to extract HTML tags along with the text content, if needed.