2013-04-12-PageRank

Table of Contents

1 PageRank    slide center

img/PageRanks-Example.svg.png

1.1 Web page importance    notes

  • Also used to model importance of people, places… anything that has a reputation
  • inbound links are important, but scaled by the importance of the source
  • C still important, even though it only has one inbound

2 Random Walks    slide

  • Starting from a random page, what is the likelihood of winding up on a target page?
  • Starting point captured by initial constant
  • Stopping captured by "damping factor" (0.85)

img/pagerank.png

2.1 Original    notes

  • Original paper did not divide by N
  • This gives relative weights of pages, but not a formal probability because sum will not add up to N
  • Either way is fine for our purposes

3 Example    slide

  • B, C, D, all link to A
  • B has PageRank of 0.5, 4 links
  • C has PageRank of 0.7, 4 links
  • D has PageRank of 0.2, 1 link

3.1 Calculations    notes

  • 0.15 + 0.85 * sum(PR/links for (pr,links) in pages)
  • 0.15 + 0.85 * sum(0.5/4, 0.7/4, 0.2/1)
  • 0.15 + 0.85 * 0.465
  • 0.554525
  • From Programming Collective Intelligence

4 Other Pages    slide

  • But how did we know the PageRank of other pages?
  • Similar to SimRank, we calculate iteratively until convergence

5 Representing Graphs    slide

Adjacency Matrix
Represent graph edges in a matrix
VABCD
A0000
B1010
C1001
D1100

5.1 Diversion    notes

  • Take a step back so we can motivate how to express these calculations as linear algebra
  • Using linear algebra can help us translate graph concepts to fairly elegant code, as well as realize some optimizations
  • Draw
  • Symmetric? When?

6 Representing Graphs    slide

Adjacency List
For a vertex, list all connections
A []
B [A,C]
C [A,D]
D [A,B]

6.1 Diversion    notes

  • You can think of this as keys (vertex) and values (list of vertices)
  • When would thinking in key-values be useful? MapReduce
  • Back to matrix representation

7 Eigenvector    slide

  • PageRank formula divides by number of links
  • Modify adjacency matrix typically also normalized such that all rows sum to 1
  • PageRank scores are entries in the largest eigenvector of the matrix representation

img/pagerank-eigen.png

8 Eigenvector centrality    slide

  • Another measurement for graphs, using the simple adjacency matrix
  • Relative influence of a node (no normalization)

9 Adversarial    slide

  • Source does not want to be discovered
  • Patterns are purposefully hidden: so discover the patterns of hiding
  • If adversary knows your techniques, they can take advantage of weakness

9.1 Weakness    notes

  • Reading: paper discovering hiding patterns
  • Weakness of pagerank?
  • We assume that these links are legitimate.
  • What happens if the links are not conveying authority?

10 Google Bomb :slide:center

  • Milder forms of adversarial work

img/Google_Bomb_Miserable_Failure.png

10.1 Link farms    notes

11 Hubs & Authorities    slide two_col

  • Earlier in the web, more structure
  • Hubs: collected links to different resources
  • Authorities: Gave out specific information
  • Score separately?

img/early-yahoo.jpg

11.1 Alternatives    notes

  • Some other interesting network analysis tools

12 HITS    slide

Authority score
sum(hub(i) for i in inbound_links)
Hub score
sum(authority(i) for i in outbound_links)
Normalize
to ensure convergence, square root sum of squares of scores

12.1 Iterative    notes

  • Sill iterative, but now using inbound and outbound links to judge
  • Hubs have outbound links to authoritive pages
  • Authorities have inbound links from good hubs

13 Connections    slide

Connected
there exists a path from one vertex to another
Connectivity
minimum number of vertices to remove to disconnect remaining vertices
Clustering Coefficient
Measure of how connected a vertex or group of vertices are

13.1 Robustness    notes

  • This is used to understand robustness of a system: if an earthquake damaged the Bay Bridge, could we still travel from one point to another?
  • What is the connectedness of Oakland and SF?
  • Closely related to min-cuts, which is discussed in the book
  • Network topology: what happens if a router fails?

14 Clustering Coefficient    slide animate

  • How many directed edges are possible between 3 vertices?
  • 4 vertices?
  • v*(v-1)
  • Undirected?
  • v*(v-1)/2
  • Clustering Coefficient: Ratio of actual edges to possible edges

14.1 Reading    notes

  • Used in Reading this week
  • v*(v-1) connection to every other node but yourself
  • /2 undirected, don't double count connections

14.2 Example    slide

img/Directed_acyclic_graph.png

Connectivity Coefficient of 1; 4

14.2.1 Answer    notes

  • Neighbors of 1: 5 2
  • 2*(2-1) / 2 = 1
  • Actual links = 1
  • Neighbors of 4: 3,5,6
  • 3*(3-1) / 2 = 3
  • Actual: 0
  • If 3-5 connected? 1/3

15 Break    slide

Date: 2013-04-12 13:50:16 PDT

Author: Jim Blomo

Org version 7.8.02 with Emacs version 23

Validate XHTML 1.0