Here is an easy way to get descriptive statistics results in R, like the basic statistics in Excel. Many of them use R packages, so it is easy to obtain detailed descriptive statistics!
In this article, you can learn descriptive statistics for central tendency, variability, and distribution of continuous variables.
Sample data
We will use some data from Cars93 in the MASS package as sample data.
First, we select "MPG. city", "Horsepower", and "Weight" data from the Cars93 as sample data.
library(MASS)
mydata <- c("MPG.city", "Horsepower", "Weight")
With the head() function, data is displayed only for the top data, as shown below.
head(Cars93[mydata])
MPG.city Horsepower Weight
1 25 140 2705
2 18 200 3560
3 20 172 3375
4 19 172 3405
5 22 208 3640
6 22 110 2880
summary() function
First is the summary() function.
summary(Cars93[mydata])
> summary(Cars93[mydata])
MPG.city Horsepower Weight
Min. :15.00 Min. : 55.0 Min. :1695
1st Qu.:18.00 1st Qu.:103.0 1st Qu.:2620
Median :21.00 Median :140.0 Median :3040
Mean :22.37 Mean :143.8 Mean :3073
3rd Qu.:25.00 3rd Qu.:170.0 3rd Qu.:3525
Max. :46.00 Max. :300.0 Max. :4105
In the summary() function, you can figure out the following data
- minimum value
- First Quartile
- median
- Mean (for numeric vectors), frequency (for factor and ethics vectors)
- 3rd quartile
- maximum value
describe(): psych package
Using the describe() function of the psych package, the following values can be obtained
- Variable number (vars)
- Number of data (n)
- Mean
- Standard deviation (sd)
- Median
- 10% trimmed average (trimmed): default is set to 10%.
- Median absolute deviation (mad): Abbreviation of median absolute deviation.
- Minimum (min)
- Maximum (max)
- range
- skew
- kurtosis
- Standard error (se)
library(psych)
describe(Cars93[mydata])
> describe(Cars93[mydata])
vars n mean sd median trimmed mad min max range
MPG.city 1 93 22.37 5.62 21 21.61 4.45 15 46 31
Horsepower 2 93 143.83 52.37 140 138.95 44.48 55 300 245
Weight 3 93 3072.90 589.90 3040 3080.60 704.24 1695 4105 2410
skew kurtosis se
MPG.city 1.65 3.58 0.58
Horsepower 0.92 0.90 5.43
Weight -0.14 -0.92 61.17
Note: Both the psych package and the Hmisc package, which is not presented here, have a describe() function. I have omitted the description of Hmisc's describe() because it is a bit confusing, but it too gives detailed descriptive statistics. So how does R determine which package the describe() belongs to?
The answer is very simple.
R gives preference to the last loaded package. In other words, if you load the Hmisc package after the psych package, even if both packages are loaded, Hmisc's describe() function will be executed first.
stat.desc(): pastecs package
The pastecs package has a function stat.desc(), which can also be used to obtain various descriptive statistics.
- Number of values (nbr.val)
- Number of null values (nbr.null)
- Number of missing values (nbr.na)
- Minimum value (min)
- Maximum value (max)
- Range (max-min)
- Sum (of all non-missing values)
- Median
- Mean
- Standard error of the mean (SE.mean)
- 95% confidence interval of the mean (CI.mean)
- Variance (var)
- Standard deviation (std.dev)
- Coefficient of variation (coef.var)
library(pastecs)
stat.desc(Cars93[mydata])
> stat.desc(Cars93[mydata])
MPG.city Horsepower Weight
nbr.val 93.0000000 9.300000e+01 9.300000e+01
nbr.null 0.0000000 0.000000e+00 0.000000e+00
nbr.na 0.0000000 0.000000e+00 0.000000e+00
min 15.0000000 5.500000e+01 1.695000e+03
max 46.0000000 3.000000e+02 4.105000e+03
range 31.0000000 2.450000e+02 2.410000e+03
sum 2080.0000000 1.337600e+04 2.857800e+05
median 21.0000000 1.400000e+02 3.040000e+03
mean 22.3655914 1.438280e+02 3.072903e+03
SE.mean 0.5827473 5.430973e+00 6.116942e+01
CI.mean.0.95 1.1573865 1.078638e+01 1.214877e+02
var 31.5822814 2.743079e+03 3.479779e+05
std.dev 5.6198115 5.237441e+01 5.898965e+02
coef.var 0.2512704 3.641462e-01 1.919672e-01
Basic statistics by classification using the aggregate() function
Sometimes you compare descriptive statistics for each group (by classification). In other words, when comparing individuals or groups, you need descriptive statistics for each group, not the entire sample.
This can be done using the aggregate() function to obtain descriptive statistics for each group.
First, the basic form of the aggregate() function is as follows
aggregate(x, by, FUN)
- x: Column to be aggregated
- by: Specify columns to be aggregated by category. In this case, by = list().
- FUN: Function
For example, here is a comparison of "mydata", classified by Origin (USE or non-USA).
aggregate(Cars93[mydata], by = list(Origin = Cars93$Origin), mean)
> aggregate(Cars93[mydata], by = list(Origin = Cars93$Origin), mean)
Origin MPG.city Horsepower Weight
1 USA 20.95833 147.5208 3195.312
2 non-USA 23.86667 139.8889 2942.333
The data is classified as by =list(Origin = Cars93$Origin).
For multiple variables, use the following method
by=list(name1=groupvar1, name2=groupvar2, ... , groupvarN)
On the other hand, aggregate() allows only a single function, so it is not possible to obtain more than one statistic at a time. In the example above, only the mean value can be obtained.
describeBy() function: basic statistics by classification using psych package
Using the describeBy() function of the psych package, you can easily obtain basic statistics by group.
library(psych)
describeBy(Cars93[mydata], Cars93$Origin)
Descriptive statistics by group
group: USA
vars n mean sd median trimmed mad min max range skew
MPG.city 1 48 20.96 3.99 20.0 20.60 4.45 15 31 16 0.80
Horsepower 2 48 147.52 54.45 143.5 141.45 49.67 63 300 237 1.10
Weight 3 48 3195.31 565.23 3282.5 3212.00 641.22 1845 4105 2260 -0.29
kurtosis se
MPG.city 0.09 0.58
Horsepower 1.22 7.86
Weight -0.97 81.58
---------------------------------------------------------------
group: non-USA
vars n mean sd median trimmed mad min max range skew
MPG.city 1 45 23.87 6.67 22 22.86 5.93 17 46 29 1.43
Horsepower 2 45 139.89 50.37 135 136.24 48.93 55 278 223 0.62
Weight 3 45 2942.33 593.75 2950 2940.54 704.24 1695 4100 2405 0.05
kurtosis se
MPG.city 1.89 0.99
Horsepower -0.04 7.51
Weight -0.85 88.51
As shown in the results above, we were able to obtain descriptive statistics classified by USA and non-USA.
summary
In this article, "Descriptive Statistics Made Easy with R!”, I introduced how to easily obtain basic statistics from collected data. Since most of the methods are based on packages, I think you can practice them very easily, so please give them a try.
コメント