Commit bfd7894b authored by Armen Donigian's avatar Armen Donigian

build v3.

parent 2a09154b
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -369,7 +369,8 @@
"import requests, json, numpy as np\n",
"print(\"Model predict for a batch of instances via Python requests POST request...\")\n",
"headers = {\"Content-type\": \"application/json\"}\n",
"\"http://localhost:1337/xgboost-airlines/predict\", headers=headers, data=json.dumps({\"input_batch\": get_test_points(0,2)})).json()"
"\"http://localhost:1337/xgboost-airlines/predict\", headers=headers, \n",
" data=json.dumps({\"input_batch\": get_test_points(0,2)})).json()"
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"## Underfitting vs. Overfitting\n",
"The exercise below is to illustrate the problems of underfitting and overfitting and how we can use linear regression with polynomial features to approximate nonlinear functions. \n",
"The plot shows the function that we want to approximate, which is a part of the cosine function. In addition, the samples from the real function and the approximations of different models are displayed. \n",
"**Determine three degrees** which lead to **underfitting**, **correctly fitted** and **overfitted** models to the given data.\n",
"**Hint 1**: Is a linear function (polynomial with degree 1) sufficient to fit the training samples?\n",
"**Hint 2**: Does a polynomial of degree 3 approximate the true function correctly? Can you think of a better one?\n",
"**Hint 3**: Which degrees higher than 3 would best show an overfitted model to the training data? i.e. learns the noise of the training data.\n",
"We evaluate quantitatively **overfitting** / **underfitting** by using cross-validation. We calculate the mean squared error (MSE) on the validation set, the higher, the less likely the model generalizes correctly from the training data."
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# fill in 3 values, first to show underfitting, second to show correctly fitted, third to show overfitting\n",
"degrees = [?, ?, ?]"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import PolynomialFeatures\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.model_selection import cross_val_score\n",
"def true_fun(X):\n",
" return np.cos(1.5 * np.pi * X)\n",
"n_samples = 30\n",
"X = np.sort(np.random.rand(n_samples))\n",
"y = true_fun(X) + np.random.randn(n_samples) * 0.1\n",
"plt.figure(figsize=(14, 5))\n",
"for i in range(len(degrees)):\n",
" ax = plt.subplot(1, len(degrees), i + 1)\n",
" plt.setp(ax, xticks=(), yticks=())\n",
" polynomial_features = PolynomialFeatures(degree=degrees[i],\n",
" include_bias=False)\n",
" linear_regression = LinearRegression()\n",
" pipeline = Pipeline([(\"polynomial_features\", polynomial_features),\n",
" (\"linear_regression\", linear_regression)])\n",
"[:, np.newaxis], y)\n",
" # Evaluate the models using crossvalidation\n",
" scores = cross_val_score(pipeline, X[:, np.newaxis], y,\n",
" scoring=\"neg_mean_squared_error\", cv=10)\n",
" X_test = np.linspace(0, 1, 100)\n",
" plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label=\"Model\")\n",
" plt.plot(X_test, true_fun(X_test), label=\"True function\")\n",
" plt.scatter(X, y, edgecolor='b', s=20, label=\"Samples\")\n",
" plt.xlabel(\"x\")\n",
" plt.ylabel(\"y\")\n",
" plt.xlim((0, 1))\n",
" plt.ylim((-2, 2))\n",
" plt.legend(loc=\"best\")\n",
" plt.title(\"Degree {}\\nMSE = {:.2e}(+/- {:.2e})\".format(\n",
" degrees[i], -scores.mean(), scores.std()))\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"For more additional ways, look at plotting [validation curve]( & [learning curve]("
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"nbformat": 4,
"nbformat_minor": 2
This diff is collapsed.
#!/usr/bin/env python3
import argparse
import csv
from sys import stdin, stdout, stderr, exit
import itertools
def main():
parser = argparse.ArgumentParser(
epilog="""If both --classes and --auto-relabel are omitted,
label values are left as-is. By default, features with value 0 are not
printed. This can be overridden with --null""",
usage="""%(prog)s [OPTION]... [FILE]
Convert CSV to Vowpal Wabbit input format.
# Leave label values as is:
$ csv2vw spam.csv --label target
# Relabel values 'ham' to 0 and 'spam' to 1:
$ csv2vw spam.csv --label target --classes ham,spam
# Relabel values 'ham' to -1 and 'spam' to +1 (needed for logistic loss):
$ csv2vw spam.csv --label target --classes ham,spam --minus-plus-one
# Relabel first label value to 0, second to 1, and ignore the rest:
$ csv2vw iris.csv -lspecies --auto-relabel --ignore-extra-classes
# Relabel first label value to 1, second to 2, and so on:
$ <iris.csv csv2vw -lspecies --multiclass --auto-relabel
# Relabel 'versicolor' to 1, 'virginica' to 2, and 'setosa' to 3
$ <iris.csv csv2vw -lspecies --multiclass -cversicolor,virginica,setosa""")
parser.add_argument("file", nargs="?", type=argparse.FileType("r"),
help="""Input CSV file. If omitted,
read from standard input.""",
parser.add_argument("-d", "--delimiter",
help="""Delimiting character of the input CSV file
(default: ,).""",
parser.add_argument("-l", "--label",
help="""Name of column that contains the class
parser.add_argument("-c", "--classes",
help="""Ordered, comma-separated list of possible
class labels to relabel. If not specifying all possible
class labels, use --auto-relabel.""",
parser.add_argument("-n", "--null",
help="""Comma-separated list of null values (default:
nargs="?", default="0")
parser.add_argument("-a", "--auto-relabel",
help="""Automatically relabel class labels in the order
in which they appear in the CSV file.""",
parser.add_argument("-m", "--multiclass",
help="""Indicates more than two classes; will start
counting at 1 instead of 0.""",
parser.add_argument("-+", "--minus-plus-one",
help="""Instead of relabeling to integers, relabel to
'-1' and '+1'. Needed when using VW with logistic or
hinge loss.""", action="store_true")
parser.add_argument("-i", "--ignore-extra-classes",
help="""If there are more than two classes found, when
not using --multiclass, include the example with no
label instead of giving skipping it.""",
parser.add_argument("-t", "--tag",
help="""Name of column that contains the tags.""")
args = parser.parse_args()
auto_relabel = args.auto_relabel
label_column = args.label
tag_column = args.tag
null_values = args.null.split(",")
multiclass = args.multiclass
minus_plus_one = args.minus_plus_one
if minus_plus_one:
new_classes = iter(["-1", "+1"])
elif multiclass:
new_classes = (str(i) for i in itertools.count(1))
elif args.classes or auto_relabel:
new_classes = iter(["0", "1"])
new_classes = None
if args.classes:
old_classes = args.classes.split(",")
relabel = dict(zip(old_classes, new_classes))
relabel = dict()
reader = csv.DictReader(args.file, delimiter=args.delimiter)
for row in reader:
label = row.pop(label_column, "")
tag = row.pop(tag_column, "")
if auto_relabel or new_classes:
if auto_relabel:
if label not in relabel:
relabel[label] = next(new_classes)
except StopIteration:
if args.ignore_extra_classes:
relabel[label] = ""
stderr.write("Found too many different classes;"
" skipping example. Use "
"--multiclass or "
label = relabel[label]
features = " ".join([k + ":" + v for k, v in sorted(row.items())
if v not in null_values])
line = label + " " + tag + "| " + features + "\n"
except (IOError, KeyboardInterrupt, BrokenPipeError):
if __name__ == "__main__":
This diff is collapsed.
This diff is collapsed.
cmd: python ../src/models/
- md5: 7c91a8ec4c1f13dd7f4e34511bf1392c
path: ../src/models/
- md5: a7a55040700152425ebf759d80c5bc9d
path: ../data/external/allyears2k.csv
md5: 6e48d56b20a6f87dedc0e5e236c6e2cf
- cache: true
md5: bd86fd31ede4d494851fab78e8f340a0.dir
path: ../data/processed
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment