Scrapy is a web scraping framework that allows you to extract data from web pages in a structured way. It is an easy to use tool that is based on the Python language and provides various modules for different purposes.
Autothrottle
The autothrottle extension is a powerful feature in the Scrapy library that lets you control the number of concurrent requests to be sent by the crawler. It allows you to set a maximum download delay between each request, and limits the amount of concurrent requests per domain or IP to improve performance and politeness of your crawls.
You can also control how often the crawler retries to download a page (by setting RETRY_ENABLED=False in the settings) and how long it takes to complete a download by setting AVERAGE_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS_PER_IP options. It is recommended that you set a higher AVERAGE_REQUESTS_PER_DOMAIN value to increase the throughput and load on remote servers.
This extension can be useful for crawling websites that are prone to network congestion. It will automatically adjust the number of concurrent requests being sent by Scrapy based on traffic.
During a crawl, it will make HTTP requests to the specified urls in the callback function (the default). The spider will then follow the links on the urls and return the scraped items on the response.
If the urls in the callback have a homepage, the spider will try to get that value first. If this value is not available, the spider will move on to the next url in the list.
You can create multiple spiders in your project and define the URLs that they will scrape. It is important to specify a unique name for the spiders in your project so that they are distinguished from each other.
To do this, you can use the -spider_name parameter. The name you give to the spider is what will be used as a link in the scraped item. You can also specify a URL to display the scraped items.
It is a useful feature to test your XPath expressions and CSS get help snippets. It gives you a fast and interactive environment to test your code without having to run the scraper in order to get feedback.
The autothrottle extension is based on a simple algorithm that considers the number of concurrent requests being sent to each site and compares it against the average amount of time needed to receive a response from each site. The crawler will then attempt to send more or less concurrent requests than the suggested value if the average rate is high and less if the average is low.
This extension is a great tool to have when you need to crawl large websites in a short amount of time. It will automatically adjust the number of concurrent request being sent by Scrapy based on traffic and will limit it to the values you specify.
You can use the AutoThrottle debug mode to see how the throttling parameters are being adjusted in real time. It will also display stats on every response received and can help you find out why a particular crawl was unable to reach its target.