Wilson Mar bio photo

Wilson Mar

Hello. Hire me!

Email me Calendar Skype call 310 320-7878

LinkedIn Twitter Gitter Google+ Youtube

Github Stackoverflow Pinterest

Jump in and drown in all the data

Overview

Here is a list of data avaiable.

“Data is the crude oil of the 21st Century, and analytics is the combustion engine.” –Gartner

I’d like to see how different people work on the same set of data:

watson visualizations

Images

Microsoft’s COCO

https://www.ted.com/talks/joseph_redmon_how_a_computer_learns_to_recognize_objects_instantly#t-286801”> VIDEO</a>: Joseph Redmon’s YOLO (You Only Look Once) algorithm recognizes over 80 categories of objects real-time in videos. On a laptop. Based on the University of Washington’s open-source Darknet system.

MNIST Number Images

Instead of downloading yourself, note that the Floydhub.com has these image datasets already on their servers for Machine Learning code use:

http://yann.lecun.com/exdb/mnist
On the website of the “Godfather of ML”, Yann Lecun)</a> is the “hello world” of deep learning – 55,000 28x28 pixel images of hand-written numbers (from 0 thru 9). Each image is labeled with the number written in the image. The “NIST” in “MNIST” is for the US National Institute of Technology.

  • this lists methods by their error rate.

  • MNIST using a “flashlight” visualization by Tensorboard by Dandelion at the TensorFlow Dev Summit Feb. 2017.

  • The MNIST dataset comes pre-loaded in Keras, in the form of a set of four Numpy arrays, loaded using this code that references two sets of data – the training set and testing set.

from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
   

The “shape” of an array is the number of items and pixel height and width:

train_images.shape
(60000, 28, 28)

http://mscoco.org/dataset/#download
COCO is a new image recognition, segmentation, and captioning dataset. It has 300,000 images containing multiple objects per image. 80,000 object categories.

http://www.robots.ox.ac.uk/~vgg/research/very_deep
Imagenet VGG Very Deep 19 19 weight layers pre-trained Convnet model

http://www.vision.caltech.edu/Image_Datasets/Caltech101
CALTECH 101/256 contains pictures of objects belonging to 101/256 categories

http://www.cs.utoronto.ca/~kriz/cifar.html
CIFAR 10/100 Subset of 80 million tiny images dataset (cats, horses, airplanes, etc.)

https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition
Cats vs Dogs Redux: Kernels Edition Dataset for Kaggle’s famous Dogs vs Cats competition

State Farm

https://archive.ics.uci.edu/ml/datasets/Iris
Iris Data

http://konect.uni-koblenz.de
KONECT (the Koblenz Network Collection) from the Institute of Web Science and Technologies at the University of Koblenz–Landau collects large network datasets of all types in order to perform research in network science and related fields.

Words

Google digitized (scanned) all the books in the 20th century and turned them into n-grams at
https://books.google.com/ngrams/ with counts how often each word occurred in all books.

Wordnet defined affect scores – a mood score.

Data

https://parking.api.smgov.net has Santa Monica parking meters API data analyzed by http://www.memdump.io/about/ Sam Abrahams, course instructor at Metis’s Deep Learning with Tensorflow.

http://www.makeovermonday.co.uk/data has one (of 52) visualization makeover every week.

IEX (Investors Exchange) has real-time stock exchange.

archive.ics.uci.edu/ml/datasets.html

data.gov

Amazon Cloud

Azure - Community content are in the Cortana Gallery.

data sources ml azure cortana gallery 620x718

Google Big Data

GitHub

Wikipedia

IMDB

us_budget has dollar outlays of each bureau within all agency (branch) of the US government, by year from 1962 to 2021

Kaggle

Allen Institute (ai2)

http://allenai.org/data.html

News

http://news.google.com/archivesearch has 200 years of archives

http://www.ibiblio.org/slanews/internet/archives.html

http://www.ibiblio.org/slanews/internet/intarchives.htm has links to global archives

http://searches.rootsweb.ancestry.com/ssdi.html Roots web

http://search.ancestry.com/search/db.aspx?dbid=3693 US Social Security Death Masterfile Index goes from 1935-2014

http://www.worldcat.org/default.jsp “lets you search the collections of libraries in your community and thousands more around the world.”

Geography

Street Names

Zip codes by state, latitude, longitude

Weather

Music

Pandora music

Spotify’s API was used to identify the sadest Radiohead song.

Domains

First names registered in each state, by year, in the US from Google Big Data

Musicbase from a game

Using data

  1. Cleaning
  2. Transformation
  3. Reduction (generalize synonyms)