Web Scraping

How to use the Scrapy Python crawler framework to fetch HTML which Beautiful Soup parsers, on a Mac for Machine Learning visualizations

Overview

Scrape Quotes with exports
Generate scrape
Scrapy Python coding
References
More about Python

This is a step-by-step hands-on tutorial explaining how to scrape websites for information.

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.

PROTIP: If an API is not available, scrape (extract/mine) specific information by parsing HTML from websites using the Scrapy web scraping (Spider) framework. See blog.

inside a virtual environment
Install by pip install Scrapy

Verify by scrape with parameters. The response:

Scrapy 1.8.0 - no active project
 
Usage:
  scrapy <command> [options] [args]
 
Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy
 
  [ more ]      More commands available when run from project directory
 
Use "scrapy <command> -h" to see more info about a command

Notice that there are more commands when the command is run inside a Scrapy folder.

Manually verify that the websites provided by Scrapy framework developers still operate:

https://quotes.toscrape.com

https://books.toscrape.com
Download a sample project using Spyder, assembled from a video tutorial from Pluralsight:
```
git clone https://github.com/wilsonmar/scrapy.git
cd scrapy
ls
```
The repo contains several projects (books-export, quoting).

PROTIP: The pycache (cache) are created by the Python3 compiler to make subsequent executions a little faster in production code. In that folder, a .pyc file contains bytecode associated with each import statement in the code. They are specified in .gitignore for the repo so they don’t get stored in GitHub.

PROTIP: On a Mac, hide all such folders with this command:
```
find . -name '__pycache__' -exec chflags hidden {} \;
```
On Windows:
```
dir * /s/b | findstr __pycache__ | attrib +h +s +r
```

See what commands when in an active project folder:

cd books-export
scrapy

Additional commands are:

   check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  list          List available spiders
  parse         Parse URL (using its spider) and print the results

List what crawlers Scrapy recognizes:
```
scrapy list
```

Still in folder books-export, run the crawl script defined in the lower folder spiders:

scrapy crawl BookCrawler

The output from the command are console messages ending with something like this:

2019-12-25 14:22:53 [scrapy.extensions.feedexport] INFO: Stored json feed (1807 items) in: books.json
2019-12-25 14:22:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 47252,
 'downloader/request_count': 145,
 'downloader/request_method_count/GET': 145,
 'downloader/response_bytes': 786302,
 'downloader/response_count': 145,
 'downloader/response_status_count/200': 144,
 'downloader/response_status_count/404': 1,
 'dupefilter/filtered': 7372,
 'elapsed_time_seconds': 23.466027,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 12, 25, 21, 22, 53, 201722),
 'item_dropped_count': 453,
 'item_dropped_reasons_count/DropItem': 453,
 'item_scraped_count': 1807,
 'log_count/DEBUG': 1953,
 'log_count/INFO': 11,
 'log_count/WARNING': 453,
 'memusage/max': 52436992,
 'memusage/startup': 52436992,
 'request_depth_max': 51,
 'response_received_count': 145,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 144,
 'scheduler/dequeued/memory': 144,
 'scheduler/enqueued': 144,
 'scheduler/enqueued/memory': 144,
 'start_time': datetime.datetime(2019, 12, 25, 21, 22, 29, 735695)}
2019-12-25 14:22:53 [scrapy.core.engine] INFO: Spider closed (finished)

Switch to a text editor to see books.json.

This contains each book’s title, price, imageurl, bookurl.
View the file BookCrawler.py file in the spiders folder.

Functions (from the bottom up) are: parsepage, extractData, writeTxt.

These are the result of edits after a template was generated.

Scrape Quotes with exports

Run:

cd quoting
scrapy crawl QuoteCrawler

2019-12-25 04:09:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 34936,
 'downloader/request_count': 122,
 'downloader/request_method_count/GET': 122,
 'downloader/response_bytes': 176221,
 'downloader/response_count': 122,
 'downloader/response_status_count/200': 121,
 'downloader/response_status_count/404': 1,
 'dupefilter/filtered': 1897,
 'elapsed_time_seconds': 6.066887,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 12, 25, 11, 9, 38, 225122),
 'log_count/DEBUG': 123,
 'log_count/INFO': 10,
 'memusage/max': 52887552,
 'memusage/startup': 52887552,
 'request_depth_max': 4,
 'response_received_count': 122,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 121,
 'scheduler/dequeued/memory': 121,
 'scheduler/enqueued': 121,
 'scheduler/enqueued/memory': 121,
 'start_time': datetime.datetime(2019, 12, 25, 11, 9, 32, 158235)}
2019-12-25 04:09:38 [scrapy.core.engine] INFO: Spider closed (finished)

Switch to a text editor to view the file created: “quotes.toscrape.txt”.

Generate scrape

Running the above avoids using these commands to generate the project:
```
scrapy startproject quotes
cd quotes
scrapy genspider QuoteSpider quotes.toscrape.com
```
The response:
```
 Created spider 'QuoteSpider' using template 'basic' in module:
  quotes.spiders.QuoteSpider
```
… and then edit the generated code.

Scrapy Python coding

Now let’s examine the Python code.

Scrapy uses the twisted Python networking engine to visit multiple urls Asynchronously (processing each request in a non-blocking way, without waiting for one request to finish before sending another request).
Scrapy can set and rotate proxy, User Agent, and other HTTP headers dynamically.
Scrapy automatically handles cookies passed between browser and server.
Scrapy’s Spider extract a pipeline of “items” (attributes of a website) to process, such as pushing data to a Neo4j or mysql database.
Scrapy electors uses lxml, which is faster than the Python Beautiful Soup (BS4) library to parse data from inside HTML and XML markup scraped from websites.
Scrapy can export data in various formats (CSV, JSON, jsonlines, XML).

References

https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3

“Automate the Boring Stuff” (free at https://inventwithpython.com) was among the most popular of all tech books. Its author Al Sweigart (@AlSweigart), in VIDEO: “Automating Your Browser and Desktop Apps” [deck] shows Selenium for web browsers. He also shows his VIDEO: pyautogui (pip install pyautogui) open-sourced in GitHub automates MS Paint and Calc on Windows, and Flash apps (non-browser apps). Moving the mouse to the top left corner (0,0) raises the FailSafeException to stop the script running. That’s since there is no hotkey recognition yet.

More about Python

This is one of a series about Python:

Wilson Mar

Web Scraping

Scrape Quotes with exports

Generate scrape

Scrapy Python coding

References

More about Python

You might also enjoy (View all posts)