2013-05-10-Real-World

Table of Contents

1 Datamining IRL    slide

2 General vs. Practical    slide

  • "the language in which you'll spend most of your working life hasn't been invented yet, so we can't teach it to you. Instead we have to give you the skills you need to learn new languages as they appear."
  • Brian Harvey, Why SCIP matters

2.1 Important Parts    notes

  • Important: understanding your domain, asking interesting questions, answering them with data, fitting questions to mathematical concepts
  • Not as important: scipy lingregress, d3

3 But    slide

  • Who is doing interviews?
  • Who is starting their own company?
  • Who wants to build their portfolio?

3.1 Real World    notes

  • I realize many of you may be needing to apply this stuff very soon
  • So here's the lecture where I try to tell you what I would do when practicing data mining

4 Product or Company Ideas    slide

  • Understand exponential growth
  • Get the timing right
  • Execute, execute, execute

4.1 Idea required but not sufficient    notes

  • Wide variation of thoughts on ideas
  • One of the biggest road blocks for wanna be entrepreneurs, but most derided
  • Timing is mostly luck, is the world ready for your ideas?
  • I think the vision is important, it's what drives the company, keeps people working
  • The plan for the idea less so
  • The concrete product most important
  • Let's talk about these elements

5 Toys    slide

  • "the next big thing always starts out being dismissed as a “toy.”"
  • Chris Dixon, Blog

img/cdixon.jpg

5.1 Why?    notes

5.2 Exponential Improvements    slide two_col

  • The future is changing at a faster rate than ever before
  • Every field is being quantized, instrumented
  • "You're not analyzing your data?"

img/cost_per_megabase.jpg

5.2.1 Change to Data    notes

  • You think your parents are bad at not operating devices? Your habits will become out of date > twice as fast.
  • So what happens when we combine exponential improvements and instrumented fields? Overwhelming amounts of data
  • Winners will be the ones
  • Obviously web recommendations: data, but also

5.3 Exercise    slide two_col

img/strava.jpg

img/strava-ride.png

5.3.1 Strava    notes

  • No device, just data

5.4 Thermostats    slide center

img/nest.jpg

5.4.1 Nest    notes

  • OK, they have a device… but what separates them is data

5.5 Biology    slide center

img/cost_per_megabase.jpg

5.5.1 Gene sequencing    notes

6 Execute    slide

  • Users are hiring you to do a job: what is it?
  • "Institutions will try to preserve the problem to which they are the solution." – Clay Shirky
  • Make your product so easy to use, people do it by accident.

6.1 Do the job    notes

  • All of these company examples, you're typically not thinking of them as "data processors"… they are solving a specific problem for you
  • Strava isn't doing any crazy SVM analytics (at least on the consumer facing side): they're showing you min/max, avg speed. Simple, but effective, stuff.
  • Disruption most often comes from using established technologies in new ways or areas
  • Can disrupt by completely simplified, often crappy at first, solutions to an even more fundamental problem
  • Dell did a great job selling cheap computers, then more expensive computers
  • But now Amazon is saying: "you don't even need to own computers!" (Cloud)
  • More info: Clayton Christensen
  • Focus on that one thing that is important and do it very, very well

7 Specifics (The Joel Test)    slide

  • Do you use source control?
  • Can you make a build in one step?
  • Do you make daily builds?
  • Do you have a bug database?
  • Do you fix bugs before writing new code?
  • Do you have an up-to-date schedule?
  • Do you have a spec?
  • Do programmers have quiet working conditions?
  • Do you use the best tools money can buy?
  • Do you have testers?
  • Do new candidates write code during their interview?
  • Do you do hallway usability testing?

7.1 Joel on Software    notes

  • When developing software, please follow as many of these as reasonable
  • Joel Spolksy wrote this in 2000! Still a great guide!
  • This is what I'd suggest to quickly get moving on the right foot
  • If you're managing a team, make sure these are happening

7.2 Source Control    slide center

img/git.png

7.2.1 Surprised?    notes

  • Github will solve a few problems on this list, just use it, even if you're developing alone

7.3 One step build    slide

  • Data mining exploration often involves manual commands
  • Don't do that in production
  • Should have scripts which extract features, build model, verify, deploy

7.3.1 Area for Improvement    notes

  • This is actually a big area solutions
  • Deploying websites has solutions like Heroku, but no equivilant for storing, processing, serving data

7.4 Bug Database    slide

  • Easy to loose track of problems
  • Also good way to prioritize issues
  • Use Github Issues

7.4.1 Managing Up    notes

  • Good defense

7.5 Write a Spec    slide

  • Alternatively, write the press release
  • Don't write a novel
  • Disagreements can be solved with code, but after talking

7.5.1 Bad rap    notes

  • Developers don't like writing them much
  • But it helps nail down issues
  • Yelp uses CEP process
  • If you get to the "agree to disagree" point, data or code can solve differences

7.6 Testers    slide

  • Use unit tests to test code (eg. unittest2 in Python)
  • Use cross-validation to test models
  • Very easy to skip, will bite you within 6 months

7.7 Differences    notes

  • Joel suggests having and paying testers
  • I don't think this is best use of resources for small companies
  • Economics change when developers can effectively write tests
  • Must allocate time to this
  • Add tests when you fix bugs
  • Helps if developers use product daily

7.8 Tools    slide

  • Right tool for the job
  • Text Editor: Use vim or emacs
  • virtualenv (Python); RVM (Ruby)
  • Learn the command line

7.8.1 Woodworker    notes

  • (slightly off topic from Joel's list)
  • Woodworkers don't hammer stuff in with their shoe
  • Make their own tools as first part of job
  • When a custom problem comes up, make a custom tool
  • These slides, written with mappings in vim
  • Text Editor
    • Syntax Highlighting
    • Macros
    • Interact with other tools
    • Find across files

8 How to Use Recommendations    slide two_col

  • Start with them as default
  • If you understand why something is better for your case, use it
  • Understand trade-offs

img/grain-of-salt.jpg

8.1 Trade-offs    notes

  • One of the themes of this course
  • Trying to provide you with a starting point
  • My point of view: user driven behavior, engineers implementing solutions

9 Data Storage    slide

  • S3 for unstructured data
  • PostgreSQL for structured
  • Hive on S3 for very large structured data

9.1 Data most important asset    notes

  • S3 is a pay-as-you go model, opens up many data processing possibilities
  • Don't have to worry about how to connect
  • PostgreSQL solid database, but also offers many improvements like storing geo data
  • Once you get beyond PostgreSQL limits, use Hive to structure data in S3

10 Exploration    slide two_col

  • Python
  • IPython Notebook, matplotlib

img/ipython-notebook.jpg

10.1 Py    notes

  • Main reason: it is convenient and practical to stay in the same language as production
  • Using production libraries, settings, to extract data
  • R, matlab/octive, Tableau are typically not used in large production code
  • SAS also effective for exploration, can be used in production, but skill set not as transferable for smaller companies

11 Public Visualizations    slide

  • D3 for visualizations
  • HTML is sharable, universal
  • (Adventurous: Vega)

11.1 Visualization    notes

  • Vega more directly maps to grammar of graphics, but is very new library

12 Processing    slide two_col

  • Hadoop + mrjob
  • Elastic MapReduce
  • (Adventurous: Spark

img/hadoop.png

12.1 Scaling    notes

  • Hadoop scales up and down fairly well, especially with mrjob
  • Constraints are going to be on your time, not necessary to eek out every bit of computing poser
  • Spark is a new model out of Berkeley that does a better job of keeping data in memory, but doesn't have the maturity of Hadoop

13 Models    slide

  • Text: Naive Bayes
  • Numeric Classification: SVMlight
  • General: sklearn/RandomForrestClassifier

13.1 Even then    notes

  • Start with simple stats to understand your data
  • Next: use heuristics, they are easy to understand and change
  • Next: use third party models that you can drop in
  • Often heuristics with understanding of false postive/negative costs will get you far

14 Practice    slide

14.1 Other services    notes

  • Dataset challenge is open ended, so it lets you practice all elements
  • Kaggle has many great competitions
  • Collective Intelligence has many good examples
  • Keep in mind trade-offs: that's what interviewers will ask

15 Work    slide

15.1 Topic Change    notes

  • Jumping topics a bit, what if you'd like to work at a web company instead of build one?

16 Hiring    slide two_col

  • Learn about the company
  • Ask questions to learn about their problems
  • Provide solutions

img/briefcase.jpg

16.1 Experience    notes

  • Use experience to answer questions
  • Make sure you continue asking questions in the interview
  • Ramit Sethi calls this the Briefcase Technique
  • Know what's on your resume (Why is it applicable? Why is it interesting?)
  • Think of the "interview" as a conversation, what would you say if you met in a coffee shop?

17 Resume    slide

  • Use quantitative data
  • Describe the difference you made in a company/project, not what you did
  • Include your side projects!

17.1 Unique    notes

  • What makes you a unique candidate?
  • Your side projects set you apart. All students here have made a mobile page. How is yours different?

18 Resume is a Formality    slide

  • Be recognized independently of being in the resume pile
  • Present at meetup
  • Use their product in a cool way

18.1 Recognition    notes

  • Catch their attention, then start process
  • Also makes you think "Do I want to work for this company?"
  • Stories

19 Negotiation    slide

  • Always try to have > 2 offers on the table
  • Once a company decides, they've already sunk a lot of resources into you
  • "That would make me comfortable"

19.1 Timing    notes

  • Pace interviews so you can make the decision together

20 Do What it Takes    slide two_col

  • Most essential attribute: asking great questions
  • > 50% of the work will be finding, formatting data
  • Data product must be reliable to be effective
  • Learn about distributed computing, software engineering

img/scrumtshape.jpg

20.1 The Job    notes

  • As Gene said, the thing that can't be taught is to think creatively about all the cool stuff you can do with this data, frame it in a way that is specific, actionable
  • Most jobs require a combination of DM and coding skills
  • Companies don't need just "idea people", need "idea + execution"
  • Don't expect to just put on you DM lab coat and work with Kaggle-style data all day
  • Remember, biggest impact comes from putting together existing technology in a useful way

21 Managing upward    slide

  • Ideal email: "I've done the analysis below and recommend we do X. Sound good?"
  • If no one is in charge, you're in charge
  • Say "yes" but prioritize

21.1 Busy    notes

  • Your boss is busy, you do the work, make sure you're on the right track
  • You shouldn't take on everything, but also shouldn't just start rejecting things.
  • Be a positive person: yes, we can do that after X, Z

22 Engineering Career Paths    slide

Hacker
Very broad, up-to-date. Best suited in very early startups.
Individual Contributor
Reasonably skilled in areas of interest. Best suited in mid-sized to large companies.
Principal Engineer
Company or industry wide recognition for contributions in specific areas. Very strong T-shaped skills.
Manager
Ability and desire to solve people challenges, verify technical solutions.

22.1 Gross Simplification    notes

  • Hacker: just get things done long enough to find a business model
  • IC: majority of engineers, doing solid day-to-day work.
  • Principal: Can include CTO at some companies, "tech leads." Go to person for leading up projects. Must have a history of success,
  • Management: If you like working with people, coaching, growing a team. People are more complex than machines, so are solutions.
  • Big themes: ownership, focus, excellence
  • Joel's Ladder

23 Stay Sharp    slide two_col

  • Long term, expected to combine the best of both:
    • Skills
    • Wisdom
  • So keep building skills

img/stay-sharp.png

23.1 Dig    notes

24 Networking    slide

  • Ask questions
  • Learn from others
  • Help others
  • Don't skip stuff because you're lazy or scared

img/shy-connector.png

24.1 Skipping Stuff    notes

  • There are many good reasons not to go to an event, but being lazy is not one of them
  • Best opportunities are when you do stuff that pushes your boundaries

25 Just Do It    slide

  • Practice
  • Start with any idea
  • Make a website you're proud to show friends
  • Improve it

25.1 Doing is best for learning    notes

  • Employers look for engagement in these areas
  • Almost any are you want to focus in, your website can be your medium

26 Thank You!    slide

Date: 2013-05-10 01:11:00 PDT

Author: Jim Blomo

Org version 7.8.02 with Emacs version 23

Validate XHTML 1.0