Friday, February 22, 2013

Array Sampling Without Repeating [Ruby]

I've often come across the problem of random sampling in situations where you don't want repeats.  For instance say you've got a list of the user's MP3 files and you are playing them randomly.  You've just played a song and you choose another random one.  It's easy to say "Don't pick the same one that you just picked."  And with a little work you could pick an N and say, "Don't repeat a song in the last N songs."

But you also don't want to say "Don't repeat any songs until all the songs are played", because if the user's got 500 songs in their library it'll be a long time before they hear their favorite song again.  What you really want is just something that says "Make it less likely to pick this song too too frequently."

So, here's the idea I finally came up with.  Say that you've got an array of files to work with, regardless of what is in them.  In Ruby you can just ask for file  = files.sample to get a random element out of the array.

Now what?  We go and do what we want with file, but now we want to assure that this is less likely to be called in the near future (and, preferably, absolutely guarantee that it is not called within the next N samples).

Start by snipping file out of filenames and moving it to the end of the array:

files <<= files.delete(file)

Try it:

(main)> files=["foo", "bar", "baz", "quux"]
=> ["foo", "bar", "baz", "quux"]
(main)> file = files.sample
=> "bar"
(main)> files <<= files.delete(file)
=> ["foo", "baz", "quux", "bar"]

So what?  The sample method is equally likely to pick any item in the array, right?

What if we tell it not to?  What if we tell it to only sample a portion of the array?

file = files[0..files.size>>2].sample

Here, we're saying to only get samples out of half the array.  So when a sample is chosen and moved to the back, we've established that there's no way this sample can be repeated until at least N/2 samples have been taken.

Depending on your N you might want to do something like N*0.90 or something (if N is a thousand maybe only 100 samples have to go by before it's ok to repeat).  

It's a little trick, but I think it's important.  I've got an app I'm working on that chooses random quotes, placing them on random images.  It bothers me when I know that I've got over 500 quotes in the database, and the user sees a repeated quote after 4 refreshes of the page.  So now they won't, now it'll take hundreds of refreshes before they see a repeat. Truthfully I'll probably switch that over to cap at 100.

No comments: