Oct 2020

Simplified Web Scraping Using Python Requests & Scraper API

_{Disclosure: if you purchase a paid version of Scraper API using links in this post, I may earn an affiliate commission.}

In the past I’ve tried out some different web scraping techniques, experimenting with different Python libraries such as BeautifulSoup, Scrapy, and Selenium.

While I was able to extract the data I was looking for from the specific webpages my code was crawling, the internet is a vast place and not all websites are created equal. When doing web scraping at scale, sometimes it’s difficult to scrape sites that are protected by anti-bots like Distil, Akamai, or Cloudflare.

I recently came across a service called Scraper API that simplifies the web scraping process and these types of challenges that come with it. Scraper API acts as a proxy solution when performing web scraping - its REST API can be consumed in any programming language. For this post I’m going to demonstrate using the Python Requests library.

In order to start using this service, you need to create an account with Scraper API. You’ll receive an API key to use in your code. Scraper API provides 1000 free API calls per month (up to 5 concurrent requests) to get started with testing the platform.

My simple code example here demonstrates the primary benefit of doing web scraping with the Scraper API service - different rotating proxy IP addresses each time you establish a client connection. This API will take your requests and pass them through to the API which will take care of proxy rotation, captchas and retries.

When executing heavy volumes of web scraping, you may eventually hit a wall if the target site blocks your IP address. This can be expensive to overcome but fortunately, Scraper API maintains standard proxy pools containing millions of IPs. Its proxy pools are optimized to use the cleanest IPs for each specific target website. For a few particularly difficult to scrape sites, a private internal pool of residential and mobile IPs is also maintained.

Beyond this proxy solution allowing you to bypass the most complex anti-bots, if you are crawling a page that requires you to render the JavaScript on the page, these types of pages can be fetched using a headless browser. If you’d like to test other features, you can do so by simply adding a flag to your request:

JS Rendering → “&render=true”
Geotargetting → “&country_code=us”
Custom Headers → “&keep_headers=true"
Premium Proxies → “&premium=true"

Scraper API is a powerful, extremely easy-to-use service for your scraping toolbox. If the free tier does not meet your needs, this promo code - ANDREW10 - will get you a 10% discount towards any of the paid subscription plans when scaling up for more API calls, additional concurrency, and more features.

_{Disclosure: if you purchase a paid version of Scraper API using links in this post, I may earn an affiliate commission.}

About Me

I'm a data leader working to advance data-driven cultures by wrangling disparate data sources and empowering end users to uncover key insights that tell a bigger story. LEARN MORE >>

Andrew Goss

About

Resume

Resources

Tags

Simplified Web Scraping Using Python Requests & Scraper API

About Me