PR

Descriptive statistics made easy in R: basic statistics ~summary, describe, stat.desc, aggregate, describeBy

R programming
記事内に広告が含まれています / This article contains ads

Here is an easy way to get descriptive statistics results in R, like the basic statistics in Excel. Many of them use R packages, so it is easy to obtain detailed descriptive statistics!

In this article, you can learn descriptive statistics for central tendency, variability, and distribution of continuous variables.

Sample data

We will use some data from Cars93 in the MASS package as sample data.

First, we select "MPG. city", "Horsepower", and "Weight" data from the Cars93 as sample data.

library(MASS)
mydata <- c("MPG.city", "Horsepower", "Weight")

With the head() function, data is displayed only for the top data, as shown below.

head(Cars93[mydata])
  MPG.city Horsepower Weight
1      25        140  2705

2      18        200  3560

3      20        172  3375

4      19        172  3405

5      22        208  3640

6      22        110  2880

summary() function

First is the summary() function.

summary(Cars93[mydata])
> summary(Cars93[mydata])
    MPG.city       Horsepower        Weight    
 Min.   :15.00   Min.   : 55.0   Min.   :1695  
 1st Qu.:18.00   1st Qu.:103.0   1st Qu.:2620  
 Median :21.00   Median :140.0   Median :3040  
 Mean   :22.37   Mean   :143.8   Mean   :3073  
 3rd Qu.:25.00   3rd Qu.:170.0   3rd Qu.:3525  
 Max.   :46.00   Max.   :300.0   Max.   :4105  

In the summary() function, you can figure out the following data

  • minimum value
  • First Quartile
  • median
  • Mean (for numeric vectors), frequency (for factor and ethics vectors)
  • 3rd quartile
  • maximum value

describe(): psych package

Using the describe() function of the psych package, the following values can be obtained

  • Variable number (vars)
  • Number of data (n)
  • Mean
  • Standard deviation (sd)
  • Median
  • 10% trimmed average (trimmed): default is set to 10%.
  • Median absolute deviation (mad): Abbreviation of median absolute deviation.
  • Minimum (min)
  • Maximum (max)
  • range
  • skew
  • kurtosis
  • Standard error (se)

library(psych)
describe(Cars93[mydata])
> describe(Cars93[mydata])
           vars  n    mean     sd median trimmed    mad  min  max range
MPG.city      1 93   22.37   5.62     21   21.61   4.45   15   46    31
Horsepower    2 93  143.83  52.37    140  138.95  44.48   55  300   245
Weight        3 93 3072.90 589.90   3040 3080.60 704.24 1695 4105  2410
            skew kurtosis    se
MPG.city    1.65     3.58  0.58
Horsepower  0.92     0.90  5.43
Weight     -0.14    -0.92 61.17

Note: Both the psych package and the Hmisc package, which is not presented here, have a describe() function. I have omitted the description of Hmisc's describe() because it is a bit confusing, but it too gives detailed descriptive statistics. So how does R determine which package the describe() belongs to?
The answer is very simple.
R gives preference to the last loaded package. In other words, if you load the Hmisc package after the psych package, even if both packages are loaded, Hmisc's describe() function will be executed first.

stat.desc(): pastecs package

The pastecs package has a function stat.desc(), which can also be used to obtain various descriptive statistics.

  • Number of values (nbr.val)
  • Number of null values (nbr.null)
  • Number of missing values (nbr.na)
  • Minimum value (min)
  • Maximum value (max)
  • Range (max-min)
  • Sum (of all non-missing values)
  • Median
  • Mean
  • Standard error of the mean (SE.mean)
  • 95% confidence interval of the mean (CI.mean)
  • Variance (var)
  • Standard deviation (std.dev)
  • Coefficient of variation (coef.var)
library(pastecs)
stat.desc(Cars93[mydata])
> stat.desc(Cars93[mydata])
                 MPG.city   Horsepower       Weight
nbr.val        93.0000000 9.300000e+01 9.300000e+01
nbr.null        0.0000000 0.000000e+00 0.000000e+00
nbr.na          0.0000000 0.000000e+00 0.000000e+00
min            15.0000000 5.500000e+01 1.695000e+03
max            46.0000000 3.000000e+02 4.105000e+03
range          31.0000000 2.450000e+02 2.410000e+03
sum          2080.0000000 1.337600e+04 2.857800e+05
median         21.0000000 1.400000e+02 3.040000e+03
mean           22.3655914 1.438280e+02 3.072903e+03
SE.mean         0.5827473 5.430973e+00 6.116942e+01
CI.mean.0.95    1.1573865 1.078638e+01 1.214877e+02
var            31.5822814 2.743079e+03 3.479779e+05
std.dev         5.6198115 5.237441e+01 5.898965e+02
coef.var        0.2512704 3.641462e-01 1.919672e-01

Basic statistics by classification using the aggregate() function

Sometimes you compare descriptive statistics for each group (by classification). In other words, when comparing individuals or groups, you need descriptive statistics for each group, not the entire sample.

This can be done using the aggregate() function to obtain descriptive statistics for each group.

First, the basic form of the aggregate() function is as follows

aggregate(x, by, FUN)
  • x: Column to be aggregated
  • by: Specify columns to be aggregated by category. In this case, by = list().
  • FUN: Function

For example, here is a comparison of "mydata", classified by Origin (USE or non-USA).

aggregate(Cars93[mydata], by = list(Origin = Cars93$Origin), mean)
> aggregate(Cars93[mydata], by = list(Origin = Cars93$Origin), mean)
   Origin MPG.city Horsepower   Weight
1     USA 20.95833   147.5208 3195.312
2 non-USA 23.86667   139.8889 2942.333

The data is classified as by =list(Origin = Cars93$Origin).

For multiple variables, use the following method

by=list(name1=groupvar1, name2=groupvar2, ... , groupvarN)

On the other hand, aggregate() allows only a single function, so it is not possible to obtain more than one statistic at a time. In the example above, only the mean value can be obtained.

describeBy() function: basic statistics by classification using psych package

Using the describeBy() function of the psych package, you can easily obtain basic statistics by group.

library(psych)
describeBy(Cars93[mydata], Cars93$Origin)
 Descriptive statistics by group 
group: USA
           vars  n    mean     sd median trimmed    mad  min  max range  skew
MPG.city      1 48   20.96   3.99   20.0   20.60   4.45   15   31    16  0.80
Horsepower    2 48  147.52  54.45  143.5  141.45  49.67   63  300   237  1.10
Weight        3 48 3195.31 565.23 3282.5 3212.00 641.22 1845 4105  2260 -0.29
           kurtosis    se
MPG.city       0.09  0.58
Horsepower     1.22  7.86
Weight        -0.97 81.58
--------------------------------------------------------------- 
group: non-USA
           vars  n    mean     sd median trimmed    mad  min  max range skew
MPG.city      1 45   23.87   6.67     22   22.86   5.93   17   46    29 1.43
Horsepower    2 45  139.89  50.37    135  136.24  48.93   55  278   223 0.62
Weight        3 45 2942.33 593.75   2950 2940.54 704.24 1695 4100  2405 0.05
           kurtosis    se
MPG.city       1.89  0.99
Horsepower    -0.04  7.51
Weight        -0.85 88.51

As shown in the results above, we were able to obtain descriptive statistics classified by USA and non-USA.

summary

In this article, "Descriptive Statistics Made Easy with R!”, I introduced how to easily obtain basic statistics from collected data. Since most of the methods are based on packages, I think you can practice them very easily, so please give them a try.

コメント