Jump in and drown in all the data
Overview
Here is a list of data available on the internet.
“Data is the crude oil of the 21st Century, and analytics is the combustion engine.” –Gartner
I’d like to see how different people work on the same set of data:
Images
unsplash.com
Natural Language Processing datasets at PapersWithCode.com
Computer Vision datasets at PapersWithCode.com
Speech / Voice
Speech datasets at PapersWithCode.com
Audio datasets at PapersWithCode.com
Music
Music datasets at PapersWithCode.com
Spotify’s API was used to identify the sadest Radiohead song.
Lyrics
Pandora music?
Amazon music?
Computer Code
Computer Code datasets at PapersWithCode.com
Medical
Medical datasets at PapersWithCode.com
Robots
Robots datasets at PapersWithCode.com
Microsoft’s COCO
VIDEO: Joseph Redmon’s YOLO (You Only Look Once) algorithm recognizes over 80 categories of objects real-time in videos. On a laptop. Based on the University of Washington’s open-source Darknet system.
UCF YouTube Action Data Set
http://crcv.ucf.edu/data/UCF_YouTube_Action.php 11 action categories: basketball shooting, biking/cycling, diving, golf swinging, horse back riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking, and walking with a dog.
MNIST Number Images
Instead of downloading yourself, note that the Floydhub.com has these image datasets already on their servers for Machine Learning code use:
http://yann.lecun.com/exdb/mnist
On the website of the “Godfather of ML”, Yann Lecun)</a>
is the “hello world” of deep learning –
55,000 28x28 pixel images of hand-written numbers (from 0 thru 9).
Each image is labeled with the number written in the image.
The “NIST” in “MNIST” is for the US National Institute of Technology.
-
this lists methods by their error rate.
-
MNIST using a “flashlight” visualization by Tensorboard by Dandelion at the TensorFlow Dev Summit Feb. 2017.
-
The MNIST dataset comes pre-loaded in Keras, in the form of a set of four Numpy arrays, loaded using this code that references two sets of data – the training set and testing set.
from keras.datasets import mnist (train_images, train_labels), (test_images, test_labels) = mnist.load_data()
The “shape” of an array is the number of items and pixel height and width:
train_images.shape (60000, 28, 28)
http://mscoco.org/dataset/#download
COCO is a new image recognition, segmentation, and captioning dataset.
It has 300,000 images containing multiple objects per image.
80,000 object categories.
http://www.robots.ox.ac.uk/~vgg/research/very_deep
Imagenet VGG Very Deep 19
19 weight layers pre-trained Convnet model
http://www.vision.caltech.edu/Image_Datasets/Caltech101
CALTECH 101/256
contains pictures of objects belonging to 101/256 categories
http://www.cs.utoronto.ca/~kriz/cifar.html
CIFAR 10/100
Subset of 80 million tiny images dataset (cats, horses, airplanes, etc.)
https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition
Cats vs Dogs Redux: Kernels Edition
Dataset for Kaggle’s famous Dogs vs Cats competition
https://archive.ics.uci.edu/ml/datasets/Iris
Iris Data
http://konect.uni-koblenz.de
KONECT (the Koblenz Network Collection)
from the Institute of Web Science and Technologies at the University of Koblenz–Landau
collects large network datasets of all types in order to perform research in network science and related fields.
Words
Google digitized (scanned) all the books in the 20th century and turned them into n-grams at
https://books.google.com/ngrams/
with counts how often each word occurred in all books.
Wordnet defined affect scores – a mood score.
Data
data.gov is home to the U.S. Government’s open data. Their QA stats by department provides data quality metrics such as:
-
* % Valid metadata
* % Working Download URLs
* % Correct Format
Visibly missing is data from the VA and EPA, plus most agencies. us_budget has dollar outlays of each bureau within all agency (branch) of the US government, by year from 1962 to 2021.
COVID-19 Tracking Data exposed by the API, featuring hourly updates shown on CovidTracking.com.
https://parking.api.smgov.net has Santa Monica parking meters API data analyzed by http://www.memdump.io/about/ Sam Abrahams, course instructor at Metis’s Deep Learning with Tensorflow.
http://www.makeovermonday.co.uk/data has one (of 52) visualization makeover every week.
IEX (Investors Exchange) has real-time stock exchange.
archive.ics.uci.edu/ml/datasets.html
Amazon Cloud
Azure - Community content are in the Cortana Gallery.
Google Big Data
GitHub
Wikipedia
IMDB
Kaggle
Allen Institute (ai2) - http://allenai.org/data.html
OpenSecrets.org provides datasets related to US political campaign finance.
News
US Census
http://news.google.com/archivesearch has 200 years of archives
http://www.ibiblio.org/slanews/internet/archives.html
http://www.ibiblio.org/slanews/internet/intarchives.htm has links to global archives
http://searches.rootsweb.ancestry.com/ssdi.html Roots web
http://search.ancestry.com/search/db.aspx?dbid=3693 US Social Security Death Masterfile Index goes from 1935-2014
http://www.worldcat.org/default.jsp “lets you search the collections of libraries in your community and thousands more around the world.”
Maps of Geography
https://waymo.com/open from Alphabet’s (Google’s) self-driving car company Waymo has data collected by Waymo self-driving cars. As of this writing, it had 1,950 segments for 20s each, collected by high resolution LIDAR cameras at 10Hz (200,000 frames) in diverse geographies and conditions. Their code is at https://github.com/waymo-research/waymo-open-dataset
Country codes
City
Street Names
Zip codes by state, latitude, longitude
Waypoints
Weather
[*] TimeAndDate.com provides a webpage you can personalize with your favorite cities, with weather information and local time.
[?] OpenWeatherMap.org API is free and based on 40,000 crowd-sourced weather stations. *
[x] Weatherbit API
[x] Weather2020 API provides a 12-week forecast.
[x] ClimaCell Microweather API
[x] Weatherbit uses Machine Learning to predict weather.
[*] Metrogroup specializes in nautical data around the UK.
[*] Weatherstack in the UK.
- Dark Sky API</a> closed down, thanks to Apple.
World Meteorological Organization at https://worldweather.wmo.int provides weather throughout the world, but for mostly cities.
[*] National Weather Service (weather.gov)
[*] Weather Channel (weather.com) (an IBM business)
Domains
First names registered in each state, by year, in the US from Google Big Data
Musicbase from a game
Using data
- Cleaning
- Transformation
- Reduction (generalize synonyms)
More
This is one of a series on AI, Machine Learning, Deep Learning, Robotics, and Analytics:
- AI Ecosystem
- Machine Learning
- Microsoft’s AI
- Microsoft’s Azure Machine Learning Algorithms
- Microsoft’s Azure Machine Learning tutorial
- Python installation
- Image Processing
- Tessaract OCR using OpenCV
- Multiple Regression calculation and visualization using Excel and Machine Learning
- Tableau Data Visualization