
Table of Contents

Search

Decentralization

  • Democratic
  • Resilient
  • Innovative
  • But where is everything?

Trade-offs

  • Put anything online without asking anyone…
  • But how does anyone discover it?

Hyperlinks

  • Link from one document to another
  • But where to start?


What's New?

  • Internet invisioned as moving through hyperlinks
  • A single web page that listed all the new websites
  • "Chess enthusiasts will be interested in the Internet Chess Library"

Yahoo!

  • "Jerry's guide to the world wide web"
  • Collection of hierarchical links
  • Added search, though not as core competency


History

  • Started as Tim Breners-Lee imagined, just a series of hyperlinks
  • But quickly clear that search was helpful
  • But not clear it was a differentiator

Yahoo Search

  • 2000 - Used Google for search
  • 2002 - Inktomi purchased
  • 2003 - Overture (AltaVista) purchased
  • 2004 - Dropped Google, used in-house engine

Took time

  • Yahoo, many other didn't see Search as core to product
  • Search as a business was non-obvious, how to make money differentiate?

Search as Core Copetency

  • 1993 - JumpStation
  • 1995 - AltaVista, Excite
  • 1996 - Inktomi (HotBot)
  • 1998 - Google, MSN Search (Bing)
  • 2008 - DuckDuckGo
  • 2010 - Blekko
  • 2013 - YaCY

Jump Back

  • JumpStation university project, but did not get funding
  • Couple years later, many sprung up, many out of universities
  • A lot of activity between 1998-2008, but mostly false starts and acquisitions, dot-com bust
  • DuckDuckGo: "Better search and no tracking"
  • Blekko: "Spam free search"
  • YaCy: p2p search

Essential Features

Automated loading and processing of web pages
Bulk association of words with web pages
"Run-time" association of query with web pages

Details

  • Web crawler, bot, spider: program that downloads web pages, follows links, downloads web.pages, follows links…
  • Indexing: URL -> words => indexer => word -> URLs
  • Searching: query -> most relevant pages
  • Vocab: I may say pages or documents

Crawling

  • Conceptually easy: start with "seed" pages, download all their links
    • Download all those links
      • Download all those links

Seed Pages


Starting point

  • CERN, mozilla home pages
  • trusted
  • have lots of links

Download


curl

  • using a program, download HTML of all seed links

Analyze Links


New pages

  • These are called the "crawling frontier"
  • Pages you know about, but haven't downloaded yet

Download Frontier


Continue

  • Download, analyze, download
  • Simple, right?

Completion


Done?

  • But when are you done?

Complications


Tracking

  • What happens when pages start linking to each other?
  • Start downloading again?
  • How do you prioritize pages you haven't seen yet?

Policies

Which pages to download?
When to refresh pages that may have changed?
How to run multiple crawlers?
Don't take down a site with your multiple crawlers

Practicalities

  • You can't download the internet on your laptop
  • But how many computers do you need
  • How to coordinate
  • How much bandwidth?
  • Storing state of all these pages?
  • Storing content?

Spider trap


Traps

  • Spam pages, or just mischievous people can try to keep spiders around
  • Generate links on a page
  • Create a page for any URL

Indexing

  • Query for "Web Architecture 253"
  • Search strategy: check page A, check page B, check page C…
  • Will not scale to check all web pages for this phrase

Naïve

  • Just check every page for phrase
  • Instead, must do something more clever that scales with # of words
  • Which is greater: number of words people search for, number of internet pages?

Inverse Index

  • A: "ISchool teach Web tech."
  • B: "Web Architecture 253 is this semester."
  • C: "Internet Architecture is next semester."

Inverse Index

  • A: "ISchool teaches Web."
  • B: "He teaches Web Architecture 253."
  • C: "Internet Architecture is next."
  • ISchool: A
  • teaches: A B
  • Web: A B
  • He: B
  • Architecture: B C
  • 253: B
  • Internet: C
  • is: C
  • next: C

Inverse

  • Takes pages => words, makes words => pages
  • Now when search we can look up Web (A,B), Architecture (B,C), 253 (B)
  • Return B

Challenges

  • Tokenizing
  • Scale
  • Locality
  • Context

Details

  • How to break apart Japanese to words?
  • How to have an index that can't fit on one computer?
    • Partition by words? Documents?
  • How to find phrases; words close together on page?
  • Is the word in a title? Body?

Search

  • Index can be used to finding matching pages
  • But how to find most relevant?
  • Words in title? meta keywords tag?
  • Links to other sites?

PageRank

  • Google's first innovation
  • Trustworthiness of a page varies with inbound links
  • Example of a "static" or "indexing time" feature

Current Use

  • PageRank type algorithms used to score influence on twitter, or groups of friends on Facebook
  • Ironically not used as much by Google any longer because of abuse
  • "Features" are qualities of a document that indicate its relevance
  • Static feature is independent of query
  • Dynamic or "query time" features depend on query
  • Static features ideal since you don't have to recalculate for new queries

PageRank



Incoming Links

  • If we scale size with imporance the links coming into some sites make them more important or trustworthy

Random Walk


Idea

  • Start on a random node
  • Randomly choose a link to follow
  • Randomly decide when you're done
  • What node are you likely to end up on?
  • Turns out it is related to how many and what type inbound edges/links a node has: makes it easier to arrive

Calculation

  • Start on a random node
A 0.25
B 0.25
C 0.25
D 0.25

Adjacency

  • Randomly choose a link to follow
  A B C D
A 0 0 .3 .5
B .5 0 .3 .5
C 0 .5 0 0
D .5 .5 .3 0

Array Multiplication

v = page likelihood
A = Adjacency matrix

A*v = page likelihood after 1 click
A*A*v = page likelihood after 2 click
A*A*A*v = page likelihood after 2 click

Linear

  • This is why linear algebra matters

Calculation

Scale Mattered


  • Eric Brewer, UC Berkeley Professor
  • Started Inktomi, search engine that pioneered operating at scale
  • In order to search websites effectively, must build an effective website
  • Developed CAP theorem

Brewer

  • Rigorous understanding of how to build websites
  • Now at Google working on virtualized infrastructure

Scale Continues to Matter

  • Instrumental in scaling Google's systems
  • Many technologies for scaling follow his examples
  • "Compilers don’t warn Jeff Dean. Jeff Dean warns compilers." -Jeff Dean Facts


Details

  • "Reading" is watching one of his early videos on Google
  • Scale can determine the ability and effectiveness of processing large amounts of data
  • Processing big data can lead to killer features
  • Like Chuck Norris facts, page has some geeky humor about Jeff Dean

