Commit f8e51c1e by Danyel Fisher

added ipython notebooks for IMDB example, and apologies

parent bcb7dc47
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"In this notebook, we'll take the existing files from last time and turn them into a pandas dataframe, then use Altair (=vega-lite) to explore a little\n",
"\n",
"To install altair:\n",
"\n",
" $ conda install altair --channel conda-forge\n",
"\n",
"4/26: for some reason, this is really incredibly slow. And there's a LOT of crappy movies here no one cares about\n",
"Worth noting: there are 279304 movies here. We could probably drop one or two without high cost"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from altair import *\n",
"from math import *\n",
"import pandas as pd\n",
"import os"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"file = r\"F:\\danyelf\\OneDrive - Microsoft\\Projects\\2016 Making Sense of Data\\datasets\\cleanratings.list\"\n",
"data = pd.read_csv( file , sep='\\t', encoding = 'Windows-1252')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Looking at the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"Chart(data).mark_bar().encode(\n",
" x = X('raters', bin = Bin( maxbins = 20)),\n",
" y = 'count(*):Q',\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks like almost every rating is a very small number. What's the real curve look like? Can we get a bin?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data['lograters'] = data['raters'].map( lambda x : log(x))\n",
"Chart(data).mark_bar().encode(\n",
" x = X('lograters', bin = Bin( maxbins = 20)),\n",
" y = 'count(*):Q',\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"medianraters = data['raters'].median()\n",
"topquarter = data['raters'].quantile ( q= 0.75)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"Chart(data).mark_bar().encode(\n",
" x = X('score', bin = Bin( maxbins = 100)),\n",
" y = 'count(*):Q',\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sanity check: is there a correlation between score & # of raters? That would seem to be biasing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"Chart(data).mark_circle().encode( \n",
" x = X('score'), bin = Bin(maxbins = 10),\n",
" y = Y('lograters', bin = Bin(maxbins = 10)),\n",
" size = 'count(*):Q'\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"smalldata = data['score'] > topquarter"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [Root]",
"language": "python",
"name": "Python [Root]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
The ipython notebooks in this section are based on the versions of the IMDB data that was attainable by FTP as of mid-July 2017.
Since then, IMDB has changed to distributing this data via Amazon Web Services. I have not yet had an opportunity to update the notebooks. IMDB's license does not allow for redistribution of their data. My apologies for the rotten bits. -DAF
http://www.imdb.com/interfaces/
https://s3-ap-southeast-2.amazonaws.com/scico-labs/docs/lab-jupyter-aws.pdf
......@@ -18,7 +18,7 @@
{ "name": "static", "value": true,
"bind": {"input": "checkbox"} },
{ "description": "throttles the number of nodes rendered; 77 covers 'all'",
"name": "numNodes", "value": 77 },
"name": "numNodes", "value": 200 },
{
"description": "State variable for active node fix status.",
"name": "fix", "value": 0,
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment