Log in

No account? Create an account

5th Time's the Charm - F*cking with Clusters

About 5th Time's the Charm

Previous Entry 5th Time's the Charm Feb. 13th, 2027 @ 09:41 am Next Entry
Say you have around 500 numbers for adult human height that do not have units. Min is around 150, Max is around 350. Mean is about 170.

The units could be:
A) Centimeters, and the data sucks (4.9' to 11')
B) Mixed Units (See Also: "Completely Fucking Wrong").
C) HA-Ha

If this is data you are giving someone would you:
A) Apologize, and give some assurance that the rest is ok
B) Indignently mention that outliers are always meaningless, and its a waste of his time to talk about them.
C) Hari-Kiri
Current Mood: annoyedannoyed
take a penny
[User Picture Icon]
Date:February 13th, 2007 07:23 pm (UTC)
D) Claim it's a new form of exponential unit.
[User Picture Icon]
Date:February 13th, 2007 07:58 pm (UTC)
It's like the Richter scale, but for people!
[User Picture Icon]
Date:February 13th, 2007 08:00 pm (UTC)
...what would they be mixed with?
[User Picture Icon]
Date:February 13th, 2007 08:15 pm (UTC)
I had some height data, that was labeled as being entered in meters. I think there were still entries that were in feet, and some that didn't make sense (too short or tall for any scale).

If its truly an outlier problem, then just mention there were apparent data entry errors.

I also worked on a problem, that had a somewhat bimodal distribution in one attribute. The very small second mode was clearly a data entry error (they entered the wrong type of test, the values matched the mean for the other test). We ignored all data outside two standard deviations from the mean.

As long as your objective isn't to find outliers, for real data I think its fairly common practice to chop off all values that are either significant outliers, or are invalid.

I had a smoking category in some data that had several patients labeled as "<2years,>2years" -- I never did figure out what that was supposed to mean.
[User Picture Icon]
Date:February 14th, 2007 05:17 pm (UTC)
It's very common to have data problems. Sloppy typing or entering mixed types or possibly measuring/reporting by an idiot. I once had data that was supposed to be pay per week but that clearly had some hourly pay rates mixed in. I don't know whether the person at the keyboard or filling out the paper survey needed the smacking, but clearly something was wrong.

This happens so often that as long as you warn the recipient, I don't think you need to feel bad at all. Certainly don't do anything that is both ritualistic and messy.
(take a penny)
Top of Page Powered by LiveJournal.com