There are many summary statistics available in R; this function provides the ones most useful for scale construction and item analysis in classic psychometrics. Range is most useful for the first pass in a data set, to check for coding errors.
describe(x, na.rm = TRUE, interp=FALSE,skew = TRUE, ranges = TRUE,trim=.1, type=3,check=TRUE,fast=NULL,quant=NULL,IQR=FALSE) describeData(x,head=4,tail=4)
| x | A data frame or matrix |
|---|---|
| na.rm | The default is to delete missing data. na.rm=FALSE will delete the case. |
| interp | Should the median be standard or interpolated |
| skew | Should the skew and kurtosis be calculated? |
| ranges | Should the range be calculated? |
| trim | trim=.1 -- trim means by dropping the top and bottom trim fraction |
| type | Which estimate of skew and kurtosis should be used? (See details.) |
| check | Should we check for non-numeric variables? Slower but helpful. |
| fast | if TRUE, will do n, means, sds, ranges for an improvement in speed. If NULL, will switch to fast mode for large (ncol * nrow > 10^7) problems, otherwise defaults to fast = FALSE |
| quant | if not NULL, will find the specified quantiles (e.g. quant=c(.25,.75) will find the 25th and 75th percentiles) |
| IQR | If TRUE, show the interquartile range |
| head | show the first 1:head cases for each variable in describeData |
| tail | Show the last nobs-tail cases for each variable in describeData |
In basic data analysis it is vital to get basic descriptive statistics.
Procedures such as summary and hmisc::describe do so. The describe function in the psych package is meant to produce the most frequently requested stats in psychometric and psychology studies, and to produce them in an easy to read data.frame. The results from describe can be used in graphics functions (e.g., error.crosses).
The range statistics (min, max, range) are most useful for data checking to detect coding errors, and should be found in early analyses of the data.
Although describe will work on data frames as well as matrices, it is important to realize that for data frames, descriptive statistics will be reported only for those variables where this makes sense (i.e., not for alphanumeric data).
If the check option is TRUE, variables that are categorical or logical are converted to numeric and then described. These variables are marked with an * in the row name. This is somewhat slower. Note that in the case of categories or factors, the numerical ordering is not necessarily the one expected. For instance, if education is coded "high school", "some college" , "finished college", then the default coding will lead to these as values of 2, 3, 1. Thus, statistics for those variables marked with * should be interpreted cautiously (if at all).
In a typical study, one might read the data in from the clipboard (read.clipboard), show the splom plot of the correlations (pairs.panels), and then describe the data.
na.rm=FALSE is equivalent to describe(na.omit(x))
When finding the skew and the kurtosis, there are three different options available. These match the choices available in skewness and kurtosis found in the e1071 package (see Joanes and Gill (1998) for the advantages of each one).
If we define \(m_r = [\sum(X- mx)^r]/n\) then
Type 1 finds skewness and kurtosis by \(g_1 = m_3/(m_2)^{3/2} \) and \(g_2 = m_4/(m_2)^2 -3\).
Type 2 is \(G1 = g1 * \sqrt{n *(n-1)}/(n-2)\) and \(G2 = (n-1)*[(n+1)g2 +6]/((n-2)(n-3))\).
Type 3 is \(b1 = [(n-1)/n]^{3/2} m_3/m_2^{3/2}\) and \(b2 = [(n-1)/n]^{3/2} m_4/m_2^2)\).
The additional helper function describeData just scans the data array and reports on whether the data are all numerical, logical/factorial, or categorical. This is a useful check to run if trying to get descriptive statistics on very large data sets where to improve the speed, the check option is FALSE.
The fast=TRUE option will lead to a speed up of about 50% for larger problems by not finding all of the statistics (see NOTE)
A data.frame of the relevant statistics: item name item number number of valid cases mean standard deviation trimmed mean (with trim defaulting to .1) median (standard or interpolated mad: median absolute deviation (from the median) minimum maximum skew kurtosis standard error
For very large data sets that are data.frames, describe can be rather slow. Converting the data to a matrix first is recommended. However, if the data are of different types, (factors or logical), this is not possible. If the data set includes columns of character data, it is also not possible. Thus, a quick pass with describeData is recommended.
For the greatest speed, at the cost of losing information, do not ask for ranges or for skew and turn off check. This is done automatically if the fast option is TRUE or for large data sets.
Note that by default, fast=NULL. But if the number of cases x number of variables exceeds (ncol * nrow > 10^7), fast will be set to TRUE. This will provide just n, mean, sd, min, max, range, and standard errors. To get all of the statistics (but at a cost of greater time) set fast=FALSE.
The problem seems to be a memory limitation in that the time taken is an accelerating function of nvars * nobs. Thus, for a largish problem (72,000 cases with 1680 variables) which might take 330 seconds, doing it as two sets of 840 variable cuts the time down to 80 seconds.
The object returned is a data frame with the normal precision of R. However, to control the number of digits displayed, you can set digits in a print command, rather than losing precision at the descriptive stats level. See the last two examples. One just sets the number of digits, one gives uses signif to make 'prettier' output where all numbers are displayed to the same number of digits.
Joanes, D.N. and Gill, C.A (1998). Comparing measures of sample skewness and kurtosis. The Statistician, 47, 183-189.
describe.by, skew, kurtosi interp.median, read.clipboard. Then, for graphic output, see error.crosses, pairs.panels, error.bars, error.bars.by and densityBy, or violinBy
data(sat.act) describe(sat.act)#> vars n mean sd median trimmed mad min max range skew #> gender 1 700 1.65 0.48 2 1.68 0.00 1 2 1 -0.61 #> education 2 700 3.16 1.43 3 3.31 1.48 0 5 5 -0.68 #> age 3 700 25.59 9.50 22 23.86 5.93 13 65 52 1.64 #> ACT 4 700 28.55 4.82 29 28.84 4.45 3 36 33 -0.66 #> SATV 5 700 612.23 112.90 620 619.45 118.61 200 800 600 -0.64 #> SATQ 6 687 610.22 115.64 620 617.25 118.61 200 800 600 -0.59 #> kurtosis se #> gender -1.62 0.02 #> education -0.07 0.05 #> age 2.42 0.36 #> ACT 0.53 0.18 #> SATV 0.33 4.27 #> SATQ -0.02 4.41describe(sat.act,skew=FALSE)#> vars n mean sd min max range se #> gender 1 700 1.65 0.48 1 2 1 0.02 #> education 2 700 3.16 1.43 0 5 5 0.05 #> age 3 700 25.59 9.50 13 65 52 0.36 #> ACT 4 700 28.55 4.82 3 36 33 0.18 #> SATV 5 700 612.23 112.90 200 800 600 4.27 #> SATQ 6 687 610.22 115.64 200 800 600 4.41describe(sat.act,IQR=TRUE) #show the interquartile Range#> vars n mean sd median trimmed mad min max range skew #> gender 1 700 1.65 0.48 2 1.68 0.00 1 2 1 -0.61 #> education 2 700 3.16 1.43 3 3.31 1.48 0 5 5 -0.68 #> age 3 700 25.59 9.50 22 23.86 5.93 13 65 52 1.64 #> ACT 4 700 28.55 4.82 29 28.84 4.45 3 36 33 -0.66 #> SATV 5 700 612.23 112.90 620 619.45 118.61 200 800 600 -0.64 #> SATQ 6 687 610.22 115.64 620 617.25 118.61 200 800 600 -0.59 #> kurtosis se IQR #> gender -1.62 0.02 1 #> education -0.07 0.05 1 #> age 2.42 0.36 10 #> ACT 0.53 0.18 7 #> SATV 0.33 4.27 150 #> SATQ -0.02 4.41 170describe(sat.act,quant=c(.1,.25,.5,.75,.90) ) #find the 10th, 25th, 50th,#> vars n mean sd median trimmed mad min max range skew #> gender 1 700 1.65 0.48 2 1.68 0.00 1 2 1 -0.61 #> education 2 700 3.16 1.43 3 3.31 1.48 0 5 5 -0.68 #> age 3 700 25.59 9.50 22 23.86 5.93 13 65 52 1.64 #> ACT 4 700 28.55 4.82 29 28.84 4.45 3 36 33 -0.66 #> SATV 5 700 612.23 112.90 620 619.45 118.61 200 800 600 -0.64 #> SATQ 6 687 610.22 115.64 620 617.25 118.61 200 800 600 -0.59 #> kurtosis se Q0.1 Q0.25 Q0.5 Q0.75 Q0.9 #> gender -1.62 0.02 1 1 2 2 2 #> education -0.07 0.05 1 3 3 4 5 #> age 2.42 0.36 18 19 22 29 39 #> ACT 0.53 0.18 22 25 29 32 35 #> SATV 0.33 4.27 450 550 620 700 750 #> SATQ -0.02 4.41 450 530 620 700 750#75th and 90th percentiles describeData(sat.act) #the fast version#> n.obs = 700 of which 687 are complete cases. Number of variables = 6 of which all are numeric TRUE #> variable # n.obs type H1 H2 H3 H4 T1 T2 T3 T4 #> gender 1 700 1 2 2 2 1 1 2 1 1 #> education 2 700 1 3 3 3 4 4 3 4 5 #> age 3 700 1 19 23 20 27 40 24 35 25 #> ACT 4 700 1 24 35 21 26 27 31 32 25 #> SATV 5 700 1 500 600 480 550 613 700 700 600 #> SATQ 6 687 1 500 500 470 520 630 630 780 600#now show how to adjust the displayed number of digits des <- describe(sat.act) #find the descriptive statistics. Keep the original accuracy des #show the normal output, which is rounded to 2 decimals#> vars n mean sd median trimmed mad min max range skew #> gender 1 700 1.65 0.48 2 1.68 0.00 1 2 1 -0.61 #> education 2 700 3.16 1.43 3 3.31 1.48 0 5 5 -0.68 #> age 3 700 25.59 9.50 22 23.86 5.93 13 65 52 1.64 #> ACT 4 700 28.55 4.82 29 28.84 4.45 3 36 33 -0.66 #> SATV 5 700 612.23 112.90 620 619.45 118.61 200 800 600 -0.64 #> SATQ 6 687 610.22 115.64 620 617.25 118.61 200 800 600 -0.59 #> kurtosis se #> gender -1.62 0.02 #> education -0.07 0.05 #> age 2.42 0.36 #> ACT 0.53 0.18 #> SATV 0.33 4.27 #> SATQ -0.02 4.41print(des,digits=3) #show the output, but round to 3 (trailing) digits#> vars n mean sd median trimmed mad min max range skew #> gender 1 700 1.647 0.478 2 1.684 0.000 1 2 1 -0.615 #> education 2 700 3.164 1.425 3 3.307 1.483 0 5 5 -0.681 #> age 3 700 25.594 9.499 22 23.863 5.930 13 65 52 1.643 #> ACT 4 700 28.547 4.824 29 28.843 4.448 3 36 33 -0.656 #> SATV 5 700 612.234 112.903 620 619.454 118.608 200 800 600 -0.644 #> SATQ 6 687 610.217 115.639 620 617.254 118.608 200 800 600 -0.593 #> kurtosis se #> gender -1.625 0.018 #> education -0.075 0.054 #> age 2.424 0.359 #> ACT 0.535 0.182 #> SATV 0.325 4.267 #> SATQ -0.018 4.412print(des, signif=3) #round all numbers to the 3 significant digits#> vars n mean sd median trimmed mad min max range skew #> gender 1 700 1.65 0.48 2 1.68 0.00 1 2 1 -0.61 #> education 2 700 3.16 1.43 3 3.31 1.48 0 5 5 -0.68 #> age 3 700 25.60 9.50 22 23.90 5.93 13 65 52 1.64 #> ACT 4 700 28.50 4.82 29 28.80 4.45 3 36 33 -0.66 #> SATV 5 700 612.00 113.00 620 619.00 119.00 200 800 600 -0.64 #> SATQ 6 687 610.00 116.00 620 617.00 119.00 200 800 600 -0.59 #> kurtosis se #> gender -1.62 0.02 #> education -0.08 0.05 #> age 2.42 0.36 #> ACT 0.53 0.18 #> SATV 0.33 4.27 #> SATQ -0.02 4.41