How to use the Scrapy Python crawler framework to fetch HTML which Beautiful Soup parsers, on a Mac for Machine Learning visualizations
Overview
This is a step-by-step hands-on tutorial explaining how to scrape websites for information.
NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.
PROTIP: If an API is not available, scrape (extract/mine) specific information by parsing HTML from websites using the Scrapy web scraping (Spider) framework. See blog.
- inside a virtual environment
-
Install by pip install Scrapy
-
Verify by scrape with parameters. The response:
Scrapy 1.8.0 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command
Notice that there are more commands when the command is run inside a Scrapy folder.
-
Manually verify that the websites provided by Scrapy framework developers still operate:
-
Download a sample project using Spyder, assembled from a video tutorial from Pluralsight:
git clone https://github.com/wilsonmar/scrapy.git cd scrapy ls
The repo contains several projects (books-export, quoting).
PROTIP: The pycache (cache) are created by the Python3 compiler to make subsequent executions a little faster in production code. In that folder, a .pyc file contains bytecode associated with each import statement in the code. They are specified in .gitignore for the repo so they don’t get stored in GitHub.
PROTIP: On a Mac, hide all such folders with this command:
find . -name '__pycache__' -exec chflags hidden {} \;
On Windows:
dir * /s/b | findstr __pycache__ | attrib +h +s +r
-
See what commands when in an active project folder:
cd books-export scrapy
Additional commands are:
check Check spider contracts crawl Run a spider edit Edit spider list List available spiders parse Parse URL (using its spider) and print the results
-
List what crawlers Scrapy recognizes:
scrapy list
-
Still in folder books-export, run the crawl script defined in the lower folder spiders:
scrapy crawl BookCrawler
The output from the command are console messages ending with something like this:
2019-12-25 14:22:53 [scrapy.extensions.feedexport] INFO: Stored json feed (1807 items) in: books.json 2019-12-25 14:22:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 47252, 'downloader/request_count': 145, 'downloader/request_method_count/GET': 145, 'downloader/response_bytes': 786302, 'downloader/response_count': 145, 'downloader/response_status_count/200': 144, 'downloader/response_status_count/404': 1, 'dupefilter/filtered': 7372, 'elapsed_time_seconds': 23.466027, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 12, 25, 21, 22, 53, 201722), 'item_dropped_count': 453, 'item_dropped_reasons_count/DropItem': 453, 'item_scraped_count': 1807, 'log_count/DEBUG': 1953, 'log_count/INFO': 11, 'log_count/WARNING': 453, 'memusage/max': 52436992, 'memusage/startup': 52436992, 'request_depth_max': 51, 'response_received_count': 145, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/404': 1, 'scheduler/dequeued': 144, 'scheduler/dequeued/memory': 144, 'scheduler/enqueued': 144, 'scheduler/enqueued/memory': 144, 'start_time': datetime.datetime(2019, 12, 25, 21, 22, 29, 735695)} 2019-12-25 14:22:53 [scrapy.core.engine] INFO: Spider closed (finished)
-
Switch to a text editor to see books.json.
This contains each book’s title, price, imageurl, bookurl.
-
View the file BookCrawler.py file in the spiders folder.
Functions (from the bottom up) are: parsepage, extractData, writeTxt.
These are the result of edits after a template was generated.
Scrape Quotes with exports
-
Run:
cd quoting scrapy crawl QuoteCrawler
2019-12-25 04:09:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 34936, 'downloader/request_count': 122, 'downloader/request_method_count/GET': 122, 'downloader/response_bytes': 176221, 'downloader/response_count': 122, 'downloader/response_status_count/200': 121, 'downloader/response_status_count/404': 1, 'dupefilter/filtered': 1897, 'elapsed_time_seconds': 6.066887, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 12, 25, 11, 9, 38, 225122), 'log_count/DEBUG': 123, 'log_count/INFO': 10, 'memusage/max': 52887552, 'memusage/startup': 52887552, 'request_depth_max': 4, 'response_received_count': 122, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/404': 1, 'scheduler/dequeued': 121, 'scheduler/dequeued/memory': 121, 'scheduler/enqueued': 121, 'scheduler/enqueued/memory': 121, 'start_time': datetime.datetime(2019, 12, 25, 11, 9, 32, 158235)} 2019-12-25 04:09:38 [scrapy.core.engine] INFO: Spider closed (finished)
-
Switch to a text editor to view the file created: “quotes.toscrape.txt”.
Generate scrape
Running the above avoids using these commands to generate the project:
scrapy startproject quotes cd quotes scrapy genspider QuoteSpider quotes.toscrape.com
The response:
Created spider 'QuoteSpider' using template 'basic' in module: quotes.spiders.QuoteSpider
… and then edit the generated code.
Scrapy Python coding
Now let’s examine the Python code.
-
Scrapy uses the twisted Python networking engine to visit multiple urls Asynchronously (processing each request in a non-blocking way, without waiting for one request to finish before sending another request).
-
Scrapy can set and rotate proxy, User Agent, and other HTTP headers dynamically.
-
Scrapy automatically handles cookies passed between browser and server.
-
Scrapy’s Spider extract a pipeline of “items” (attributes of a website) to process, such as pushing data to a Neo4j or mysql database.
-
Scrapy electors uses lxml, which is faster than the Python Beautiful Soup (BS4) library to parse data from inside HTML and XML markup scraped from websites.
-
Scrapy can export data in various formats (CSV, JSON, jsonlines, XML).
References
https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3
“Automate the Boring Stuff” (free at https://inventwithpython.com) was among the most popular of all tech books. Its author Al Sweigart (@AlSweigart), in VIDEO: “Automating Your Browser and Desktop Apps” [deck] shows Selenium for web browsers. He also shows his VIDEO: pyautogui (pip install pyautogui) open-sourced in GitHub automates MS Paint and Calc on Windows, and Flash apps (non-browser apps). Moving the mouse to the top left corner (0,0) raises the FailSafeException to stop the script running. That’s since there is no hotkey recognition yet.
More about Python
This is one of a series about Python:
- Python install on MacOS
- Python install on MacOS using Pyenv
- Python tutorials
- Python Examples
- Python coding notes
- Pulumi controls cloud using Python, etc.
- Test Python using Pytest BDD Selenium framework
- Test Python using Robot testing framework
- Python REST API programming using the Flask library
- Python coding for AWS Lambda Serverless programming
- Streamlit visualization framework powered by Python
- Web scraping using Scrapy, powered by Python
- Neo4j graph databases accessed from Python