examples/demos/yelp_demo/notebooks
Michelle Casbon fa1311833c Update instructions and setup for yelp demo (#376)
* Update instructions and setup for yelp demo

Update kubeflow version to v0.3.4-rc.1
Add pipelines version v0.1.3-rc.2
Add simple pipelines example using GPUs
Conform cluster name, secrets, and ks app directory name to click-to-deploy standard
Update ks_app directory to v0.3.4-rc.1
Pin bokeh package to v0.13.0 in yelp notebook
Fix bug in secret creation

* Port-forward to svcs instead of pods

Add clarification for using kfctl & updating component params
2018-12-03 22:39:51 -08:00
..
README.md Update demo script & add notebook (#248) 2018-09-11 11:17:02 -07:00
noise.png Update demo script & add notebook (#248) 2018-09-11 11:17:02 -07:00
semantics.png Update demo script & add notebook (#248) 2018-09-11 11:17:02 -07:00
signal.png Update demo script & add notebook (#248) 2018-09-11 11:17:02 -07:00
simple_ml_model.ipynb Add Yelp restaurant review demo files (#220) 2018-08-15 22:49:00 -07:00
yelp.ipynb Update instructions and setup for yelp demo (#376) 2018-12-03 22:39:51 -08:00

README.md

Yelp Sentiment Notebook Demo

yelp.ipynb

We're trying to create a neural network that detects sentiment, and predicts whether a Yelp review (all 5 million of them) is positive or negative. Reviews come from words. Maybe we can look at the set of words in a review and make a guess from that?

Well let's see. How frequently do some words occur? Let's grab a sample of 100k positive and negative reviews, 200k total.

Lots of noise

The chart on the left shows that most words from 200,000 reviews are indistinguishable en masse, so that's why naive approaches may take a long time to converge. One approach is to look at words that occur more frequently. When we cut off words that occur less than 50 times, we get the chart on the right. We see the beginning of a distro, but it's weak. We have to clean up the signal for our network using a hypothesis of what's important.

Signal

So we calculate a ratio, how many words appear in positive vs. negative reviews? A value of 0 means all negative reviews, 1.0 means equal distribution between positive and negative, greater than 1.0 means more positive than negative. Neural networks like values between -1 and 1 centered at 0. Take the log(x)! log(1) = 0, log(1/x) = negative, log(n) = small. Then lets cut out those that don't help much, the area around 0. Voila. A signal around 0, on the right! Now let's train that on a simple network.

Results

The network converges in a few minutes with 94% accuracy. The weights on the input nodes are a vector representation of semantics, so vectors close together are similar. Here we see words that are somewhat similar, with some noise, as we project from a 64-dimensional space to 2 dimensions. Neat!