Wilson Mar bio photo

Wilson Mar

Hello!

Calendar YouTube Github

LinkedIn

How to use the Scrapy Python crawler framework to fetch HTML which Beautiful Soup parsers, on a Mac for Machine Learning visualizations

US (English)   Norsk (Norwegian)   Español (Spanish)   Français (French)   Deutsch (German)   Italiano   Português   Estonian   اَلْعَرَبِيَّةُ (Egypt Arabic)   Napali   中文 (简体) Chinese (Simplified)   日本語 Japanese   한국어 Korean

Overview

This is a step-by-step hands-on tutorial explaining how to scrape websites for information.

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.

PROTIP: If an API is not available, scrape (extract/mine) specific information by parsing HTML from websites using the Scrapy web scraping (Spider) framework. See blog.

  1. inside a virtual environment
  2. Install by pip install Scrapy

  3. Verify by scrape with parameters. The response:

    Scrapy 1.8.0 - no active project
     
    Usage:
      scrapy <command> [options] [args]
     
    Available commands:
      bench         Run quick benchmark test
      fetch         Fetch a URL using the Scrapy downloader
      genspider     Generate new spider using pre-defined templates
      runspider     Run a self-contained spider (without creating a project)
      settings      Get settings values
      shell         Interactive scraping console
      startproject  Create new project
      version       Print Scrapy version
      view          Open URL in browser, as seen by Scrapy
     
      [ more ]      More commands available when run from project directory
     
    Use "scrapy <command> -h" to see more info about a command
    

    Notice that there are more commands when the command is run inside a Scrapy folder.

  4. Manually verify that the websites provided by Scrapy framework developers still operate:

    https://quotes.toscrape.com

    python-api-quotes

    https://books.toscrape.com

    python-api-books

  5. Download a sample project using Spyder, assembled from a video tutorial from Pluralsight:

    git clone https://github.com/wilsonmar/scrapy.git
    cd scrapy
    ls

    The repo contains several projects (books-export, quoting).

    PROTIP: The pycache (cache) are created by the Python3 compiler to make subsequent executions a little faster in production code. In that folder, a .pyc file contains bytecode associated with each import statement in the code. They are specified in .gitignore for the repo so they don’t get stored in GitHub.

    PROTIP: On a Mac, hide all such folders with this command:

    find . -name '__pycache__' -exec chflags hidden {} \;

    On Windows:

    dir * /s/b | findstr __pycache__ | attrib +h +s +r
  6. See what commands when in an active project folder:

    cd books-export
    scrapy
    

    Additional commands are:

       check         Check spider contracts
      crawl         Run a spider
      edit          Edit spider
      list          List available spiders
      parse         Parse URL (using its spider) and print the results
    
  7. List what crawlers Scrapy recognizes:

    scrapy list
    
  8. Still in folder books-export, run the crawl script defined in the lower folder spiders:

    scrapy crawl BookCrawler
    

    The output from the command are console messages ending with something like this:

    2019-12-25 14:22:53 [scrapy.extensions.feedexport] INFO: Stored json feed (1807 items) in: books.json
    2019-12-25 14:22:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 47252,
     'downloader/request_count': 145,
     'downloader/request_method_count/GET': 145,
     'downloader/response_bytes': 786302,
     'downloader/response_count': 145,
     'downloader/response_status_count/200': 144,
     'downloader/response_status_count/404': 1,
     'dupefilter/filtered': 7372,
     'elapsed_time_seconds': 23.466027,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2019, 12, 25, 21, 22, 53, 201722),
     'item_dropped_count': 453,
     'item_dropped_reasons_count/DropItem': 453,
     'item_scraped_count': 1807,
     'log_count/DEBUG': 1953,
     'log_count/INFO': 11,
     'log_count/WARNING': 453,
     'memusage/max': 52436992,
     'memusage/startup': 52436992,
     'request_depth_max': 51,
     'response_received_count': 145,
     'robotstxt/request_count': 1,
     'robotstxt/response_count': 1,
     'robotstxt/response_status_count/404': 1,
     'scheduler/dequeued': 144,
     'scheduler/dequeued/memory': 144,
     'scheduler/enqueued': 144,
     'scheduler/enqueued/memory': 144,
     'start_time': datetime.datetime(2019, 12, 25, 21, 22, 29, 735695)}
    2019-12-25 14:22:53 [scrapy.core.engine] INFO: Spider closed (finished)
  9. Switch to a text editor to see books.json.

    This contains each book’s title, price, imageurl, bookurl.

  10. View the file BookCrawler.py file in the spiders folder.

    Functions (from the bottom up) are: parsepage, extractData, writeTxt.

    These are the result of edits after a template was generated.

    Scrape Quotes with exports

  11. Run:

    cd quoting
    scrapy crawl QuoteCrawler
    
    2019-12-25 04:09:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 34936,
     'downloader/request_count': 122,
     'downloader/request_method_count/GET': 122,
     'downloader/response_bytes': 176221,
     'downloader/response_count': 122,
     'downloader/response_status_count/200': 121,
     'downloader/response_status_count/404': 1,
     'dupefilter/filtered': 1897,
     'elapsed_time_seconds': 6.066887,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2019, 12, 25, 11, 9, 38, 225122),
     'log_count/DEBUG': 123,
     'log_count/INFO': 10,
     'memusage/max': 52887552,
     'memusage/startup': 52887552,
     'request_depth_max': 4,
     'response_received_count': 122,
     'robotstxt/request_count': 1,
     'robotstxt/response_count': 1,
     'robotstxt/response_status_count/404': 1,
     'scheduler/dequeued': 121,
     'scheduler/dequeued/memory': 121,
     'scheduler/enqueued': 121,
     'scheduler/enqueued/memory': 121,
     'start_time': datetime.datetime(2019, 12, 25, 11, 9, 32, 158235)}
    2019-12-25 04:09:38 [scrapy.core.engine] INFO: Spider closed (finished)
    
  12. Switch to a text editor to view the file created: “quotes.toscrape.txt”.

    Generate scrape

    Running the above avoids using these commands to generate the project:

    scrapy startproject quotes
    cd quotes
    scrapy genspider QuoteSpider quotes.toscrape.com
    

    The response:

     Created spider 'QuoteSpider' using template 'basic' in module:
      quotes.spiders.QuoteSpider

    … and then edit the generated code.


Scrapy Python coding

Now let’s examine the Python code.

  • Scrapy uses the twisted Python networking engine to visit multiple urls Asynchronously (processing each request in a non-blocking way, without waiting for one request to finish before sending another request).

  • Scrapy can set and rotate proxy, User Agent, and other HTTP headers dynamically.

  • Scrapy automatically handles cookies passed between browser and server.

  • Scrapy’s Spider extract a pipeline of “items” (attributes of a website) to process, such as pushing data to a Neo4j or mysql database.

  • Scrapy electors uses lxml, which is faster than the Python Beautiful Soup (BS4) library to parse data from inside HTML and XML markup scraped from websites.

  • Scrapy can export data in various formats (CSV, JSON, jsonlines, XML).

References

https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3

“Automate the Boring Stuff” (free at https://inventwithpython.com) was among the most popular of all tech books. Its author Al Sweigart (@AlSweigart), in VIDEO: “Automating Your Browser and Desktop Apps” [deck] shows Selenium for web browsers. He also shows his VIDEO: pyautogui (pip install pyautogui) open-sourced in GitHub automates MS Paint and Calc on Windows, and Flash apps (non-browser apps). Moving the mouse to the top left corner (0,0) raises the FailSafeException to stop the script running. That’s since there is no hotkey recognition yet.

More about Python

This is one of a series about Python:

  1. Python install on MacOS
  2. Python install on MacOS using Pyenv
  3. Python install on Raspberry Pi for IoT

  4. Python tutorials
  5. Python Examples
  6. Python coding notes
  7. Pulumi controls cloud using Python, etc.
  8. Jupyter Notebooks provide commentary to Python

  9. Python certifications

  10. Test Python using Pytest BDD Selenium framework
  11. Test Python using Robot testing framework
  12. Testing AI uses Python code

  13. Microsoft Azure Machine Learning makes use of Python

  14. Python REST API programming using the Flask library
  15. Python coding for AWS Lambda Serverless programming
  16. Streamlit visualization framework powered by Python
  17. Web scraping using Scrapy, powered by Python
  18. Neo4j graph databases accessed from Python