Ok, so this isn't really physics as such, but it's pretty fascinating. There's a very large online community called Reddit in which users submit links which interest them. These links come with two little arrows beside them, and the users can vote the link up or down. Here's a screenshot of how the website looks to me at the time of this writing:
As I visit on different days or on different times on the same day, the links and their order changes. This keeps the site fresh and news-y, at least if you like your news full of cat memes. It's pretty clear that the ordering of these links is both a function of when they were submitted by the users and of the votes they receive, but how exactly does this work?
The algorithm itself is explained in this very informative post by Amir Salihefendic. In short, every post is assigned a number given by the function:
$latex \displaystyle f(n, t) = \log_{10}(n) + \frac{t}{45000} &s=1$
Here n is the net number of upvotes. For 10 votes up and 0 votes down, n = 10. For 50 votes up and 40 votes down, n also equals 10. Next, t is the time in seconds after an arbitrary moment that happens to be in 2005. The choice of that arbitrary moment doesn't matter - what matters are the differences in scores. This function f(n, t) is calculated for each link, and they are sorted in order from the greatest to the least value of f. I've slightly simplified the equation by dropping a coefficient that makes no difference for positive n.
Ok, great. Now what does this all mean? Amir's post gives some examples, but I want to dig a little bit into the interpretation of this equation. In physics it's very often the case that an equation isn't just some abstract mathematical machine, but rather it's a natural statement which has an intuitive interpretation we can understand. For instance, $latex \nabla \cdot \mathbf{E} = \rho / \epsilon_0$ is an abstract vector calculus statement, but physicists see that equation and understand it intuitively as the idea that electric field lines diverge outward from sources of electric charge. That's a more useful way of thinking of it than "Ok, now we have to solve some horrible partial differential equation before we can know anything at all." Intuition gives us a qualitative picture, and from there we can do the hard work to get a numerical answer when required.
Since Reddit's equation is just used to generate an ordering, an overall multiplicative factor doesn't matter. If a score of 20 is ranked ahead of 15, then 200 will be ranked ahead of 150. So let's multiply Reddit's equation by 45000 seconds.
$latex \displaystyle f(n, t) = 45000 \log_{10}(n) + t &s=1$
Effectively this just means the posts are sorted in order by t, the time they were posted. Newer posts are higher. But there's that log(n) term - it moves the posts forward in time. Newer posts are listed first, and a post becomes even newer by getting votes. If n = 10, then log(10) = 1 and the post is moved forward 45000 seconds, or 12.5 hours. If n = 100, then log(100) = 2 and the post is moved forward 90000 seconds, or 25 hours. We can plot this for more and more net upvotes:
The returns are diminishing. Logarithms are slowly increasing functions, so each additional upvote moves the post forward in time by a smaller and smaller amount. Even with thousands of votes, a post has only moved about two days into the future, which is why posts never last more than a day or so on the front page. After that it gets overtaken by any new posts, even ones with few upvotes.
In politics we often hear that every vote counts. In Reddit, we can actually figure out how much each vote counts. If I upvote or downvote a post, how far does my individual vote move that post in time? For large n, it's a very accurate to approximate the change in log(n) (for each additional vote) by its derivative:
$latex \displaystyle \log_{10}(n+1) - \log_{10}(n) \approx \frac{0.434}{n}&s=1$
Well that 0.434 is a little annoying but hey, I didn't chose to use base 10 logarithms. (Had they used base e = 2.718... then it would just be 1/n.) What this means is that if a post has 10 votes, your upvote will add about 45000*0.434/10 = 1737.2 seconds, or about 29 minutes. A downvote would move it backwards by that same amount. If a post has 50 votes, your upvote (or downvote) will move it forward or backward by about 5.7 minutes. For a 4700 vote post like one of the ones in the screenshot above, each vote makes a mere 3.7 seconds difference.
This might suggest an improvement on the "subscribe" and "unsuscribe" system - if there's a subreddit you're interested in but not that interested in (/r/aww maybe?), you could give it a handicap by having Reddit subtract (say) a 6 hour penalty on every post from that subreddit. This would require a /r/aww post to get about 3 times as many votes to overtake an unpenalized post which was originally made at the same time. (Homework: given a h hour penalty, how many times more votes does the penalized post require to overtake a simultaneously-posted unpenalized post?) Correspondingly, you could give a bonus for subeddits you want to see more of. Unfortunately this is probably not a feasible suggestion. Separately sorting huge lists for millions of users would probably melt the servers. But it would be a nice feature.
All right, better wrap this one up. As far as user-vote-based ranking goes, Reddit's is unusually interesting from a mathematical standpoint. For what it's worth, I give it my upvote.
[Update: Fixed a mistake in the calculations.]
- Log in to post comments
Man, wish I could comprehend the math in this article. So basically, time travel occurs in this scenario on paper, or is it in actuality?
Jeffrey, the bit about time travel is mostly to get people interested in the article. Posts on Reddit are sorted by time, and the voting fudges the time up or down based on how many votes there are. The fudge factor becomes less significant the more time passes, which is why only the very most popular posts last more than a day or so.
The article is basically saying that for every upvote, time is added to a given link to make it the top link on the front page of reddit. As long as it keeps getting upvoted you will see it on the front page.
Jeffrey, time travel does not literally occur, it's a clever title used to express the fact that upvoting something gives it higher preference in the ordering algorithm than something posted in the future (potentially).
Basically, while time is a part of the equation, it's not a chronological order, he is just explaining it as such using a time travel metaphor.
The term time travel here is what i believe is confusing you, Jeffrey. There is no time travel really. All the author is pointing out is that reddit organizes posts be amount of time in seconds after a a particular second in 2005 they chose. The way up votes function is by increasing this value. So it could be said that its like your post went forward in time. Posts that receive a large number of votes are interpreted by the computer as having been posted from a point in time that is more seconds away from the start date then the current time.
Really interesting perception of how the system works. I've always wondered what was going on behind the scenes that put a post with a relatively small score ahead of a bunch of higher score posts.
It does open the open door to the question "what's the 'oldest' post on Reddit?". Kind of like using telescopes in an attempt to 'look' into the past for the beginning of the universe, finding the lowest ranked post would (in a sense) be like finding Reddit's big bang. Of course, it would most likely be some uninteresting article posted in the first few days of the site, but would be neat nonetheless to find.
That coefficient I left out seems to knock new posts with negative votes all the way to the back, so it might actually just be a bad post from today. Reddit has an API, so maybe an enterprising Redditor could find it out. (I'm not one. I have an account for browsing, but I've never posted any links or comments.)
@Alfred: the oldest publicly-accessible reddit submission is http://www.reddit.com/tb/87.
log(x*n) - h = log(n)
log(x*n) - log(n) = h
log(x*n/n) = h
log(x) = h
x = 10^h
Well, that was a waste of my time. All this stuff is quite obvious, but interesting enough to see the function behind it all. Too bad it had to be covered up by that stupid time travel thing (IT ISN'T EVEN VAGUELY RELATED TO TIME TRAVEL) and a whole bunch of over-explaining.
Sorry to disappoint, but time travel only exists in entertainment. This post is meant as entertainment - but you get to learn a little something about algorithms and math along the way.
If there was a "whole bunch of over-explaining", Espen, then kalvin would have understood it. But he didn't.
Kalvin, you'd have to receive increasingly more upvotes to stay on top. A steady amount of upvotes gets overcome by the time function.
Actually Joshua, Jeffery and Kalvin you're only half right and kinda wrong.
Yes, the more votes the longer a post stays up on the page. But, each new vote after the first doesn't add as much time as the vote before it. So, first vote let's the post stay there for an extra 29 minutes since it was first posted. The second vote adds 7 minutes ( so, 36 mins all up), the third vote might only be less than a minute. Once you're in the 1000's it's probably only adding fractions of a second. So the more votes a link gets the less amount of time each vote adds.
Basically, this means new content will rapidly overtake old content which biases the front page towards new posts. Keeping it fresh .
That's my understanding but I only did the soft sciences that didn't in love too much math.
"A downvote would move it backwards by that same amount."
Downvote moves more than upvote. Only when n large enough are down and up approximately the same.
actually, the top bestof submission of all time is a link to a post pointing this out (and it's by me hi mom)
http://www.reddit.com/r/circlebroke/comments/vqy9y/dear_circlebrokers_w…
but obviously you did a much more rigorous job laying it out
good post, man
. Matt - log(x^n/n) = log(x) is not a valid identity.
"Separately sorting huge lists for millions of users would probably melt the servers."
Depending on how the page is laid out at the moment, it might be something that could be done with javascript. That's run in the browser after the page loads so there's no extra load on the servers.
Were you not disciplined by every math teacher ever every time you didn't label your axes?
Any idea when does this sorting actually happens? and on how many records?