[Lazyweb] Statistics question
Mar. 16th, 2028 @ 11:53 pm
Say I have a set of continuously arriving outcomes that have 'goodness' ratings attached. I don't know the distribution of said ratings, but I have some idea of the range, and I have a 1-level deep history of past ratings for some of them. I want to know how many outcomes I should wait for before grabbing one and calling it good. In this case, the payoff for picking sooner is fairly high.
I know there is some sort of actual method for this. The problem is sort of akin to Amazon's old Gold Box Model: given a price, you pick it now or you let it go and wait, giving up a known gain for a potential better future deal.
Can anyone tell me what I'm trying to think of here? This is terribly hard to Google search for :-p
-- Avani (proof positive that you can do years of ML without learning basic stats...)
Current Mood: puzzled
Current Music: Serj Tankian -=- Sky is Over
|Date:||March 17th, 2008 07:19 am (UTC)|| |
|Date:||March 17th, 2008 07:30 am (UTC)|| |
Ok, I wanna know where people come up with the NAMES for algorithms. Jeez... I mean, sure, it makes sense in the provided story, but then they become the canonical names for that 'type' of problem, and suddenly you can't find something becuase you don't know its origin story.
ETA: Not that I'VE ever been in a position where I've needed to track down an algorithm, mind you. Just seems high on the not-useful scale ;->Edited at 2008-03-17 07:31 am (UTC)
This is better known as the secretary problem, when posed as the somewhat more chaste scenario of picking the best typist out of the secretary pool. Wikipedia can get you started on a few references for that. While this is plenty famous enough to have its own title to go along with the contrived narrative, there are more formalized titles for this class of problems as as well. "Optimal stopping" and "on-line selection" are the ones that spring to mind.
|Date:||March 17th, 2008 05:17 pm (UTC)|| |
Thank You!! 'Online-selection' is exactly the term that was escaping me.
|Date:||March 17th, 2008 12:28 pm (UTC)|| |
That was fun!
Ask more Lazy Web questions! That was fun!
23 is always a good number to stop at.
And now that I've read the link to the sultans daugthers problem...
The strategy is interesting in the version where you have to pick the one with the biggest dowry, but after that they just talk about a strategy for getting one in the top 5 or 10 of beauty. But they're all sisters. Is their beauty going to vary that much? And even if it does, I think my strategy would be to marry the first one that I'm attracted to enough to marry a stranger. If there isn't one, then I'm fine with ending up single. If she's not the best looking one, who cares? She's already really hot because I'm marrying a stranger.
So were there 62 total things to choose from in your problem? Cuz then 23 is the right number to stop at.
Oh... continuously arriving... nevermind.
If the sequence is infinite, I need to know your patience level, or your discount rate. Supposing you're happy being in the top 5%, you can run a non-parametric quantile test to that effect... you pick as soon as it passes the test (i.e. when it's enough of an outlier)Edited at 2008-03-17 04:37 pm (UTC)
|Date:||March 17th, 2008 05:20 pm (UTC)|| |
I thought about that ... but I'd rather have it be independent of patience since this is implemented as a sort of peer selector on an arbitrarily sized network, and so I don't think my patience levels should be constant.
ok. For infinite patience, how about this algorithm:
* wait forever, recording the best one
* after that, when you beat the best one, pick that
Expected running time: 2*infinity
more seriously, if your patience is varying, then you want to vary the parameters of the quantile test accordingly.
|Date:||March 17th, 2008 05:54 pm (UTC)|| |
Its just a level of complication I'd like to avoid: I think adding too many hooks that are dependent on the network size will adversely affect my scalability calculations later on. I'll try it, though, assuming I have time.
I said "patience", not "network size".
|Date:||March 17th, 2008 06:49 pm (UTC)|| |
In my case the two aren't independent. As my network size grows, I'm probably willing to wait longer, but not necessarily scaled to the size of the network, since that will kill scalability at 100k nodes or so. This is something I need to tweak experimentally, again, given time. Stupid project is due much too soon :-)
|Date:||May 18th, 2008 06:09 am (UTC)|| |
If it'll only affect it "later on" then chances are you can collect the data and process it both ways now, and when scalability becomes an issue, drop the one whose outcomes feel inferior. It'd seem like a premature optimisation to pick one way now, if you're not likely to be resource-bound for a little while. You can either keep the other dataset invisible from users entirely, present a subset of users view B instead of view A and see how they feel, offer a certainty on any result based on the different between the answers from the two methods; all sorts of bonus metadata.
|Date:||March 17th, 2008 06:46 pm (UTC)|| |
I know they went over this question in MIT's algorithms course. I just need to remember where it was.
|Date:||March 20th, 2008 03:46 am (UTC)|| |
|(Link)|The Secretary Problem
I knew I'd seen it, but not as the Sultan's Dowry. The standard solution is to wait n/e (number of secretaries over 2.714...) and then stop at the first number that exceeds anything you've seen before, but this can be proven to maximize the probability you'll pick the best, not that you'll pick a "good" one.