Calculating Descriptive Statistics
Hello again and welcome back to Exploratory Data Analysis with R. 24x7 in this module we'll learn how to calculate descriptive statistics for our data. First we'll begin with an introduction to descriptive statistics, that is describing the characteristics of our data in meaningful ways. Next we'll learn about the types of analysis that we can perform on our data. Then we'll learn how to avoid making errors and invalid claims with our descriptive statistics. Finally, we'll see a demo where we'll put all these concepts together.
Descriptive statistics describe data in meaningful ways. In exploratory data analysis, we're typically trying to quickly assess the location, spread, shape, and interdependence of our data. These types of descriptive statistics are often referred to as summary statistics because they summarize the shape and feel of the data. For example, in this table, we have the summary statistics for a single variable, that is the movie runtime variable in our movies data set. We have the Minimum, which is the lowest value in the column when the values are sorted in ascending order. First Quartile, which is the value that cuts off the first 25% of the values, the Median which is the value that separates the lower half from the upper half of the values, the Mean which is the arithmetic average of all of the values in the column, 3rd Quartile which is the cutoff value for the 75th percentile of the values, and maximum which is the highest value in the column. By looking at these summary statistics we can quickly learn about the location and spread of our data. We don't want to go too deep into statistics for this course, but in order to discuss descriptive statistics we need to understand a few basic terms from statistics. First we have observations, which are essentially the rows in the table. They are referred to as observations because in statistics we're typically concerned with observations of some kind of physical phenomenon. For example, if we have a temperature sensor, each recorded temperature over time would correspond to an observation of the temperature. Equivalently, we could also have transactions, like pizza sales transactions, or entities, like feature length films as the phenomena we are observing. However, for our purposes, we'll refer to them all generically as observations. Next, we have variables, which are the columns in the table. They are called variables because their values vary across each observation. For example, in the table to the right, Date, Customer, Product, and Quantity are all variables that can change value across each row in the table. There are two types of variables, first we have qualitative variables. Qualitative variables contain categorical values, for example, customers and products. In addition, they have no nature sense of order. We can, however, impose an arbitrary order upon them, like using the alphabetical sort order of their names. However, this is just an artificial, not a natural means of sorting them. Qualitative variables are often referred to as nominal variables because they are named values. Finally, we have quantitative variables. Quantitative variables contain numeric values, for example, the quantity of products sold. In addition they do posses a natural sense of order, for example, 2 pizzas sold is more than 1 pizza sold. Quantitative variables can also be subdivided into either discrete values, that is whole numbers represented as integers, or continuous values, that is all possible points on the real number line, typically represented as decimal precision numeric values. In addition, quantitative variables can also be subdivided into ordinal, interval, and ratio subtypes. However, these subdivisions, their differences, and statistical limitations are outside of the scope of this course. There are several types of statistical analysis we can do, which are dependent upon the number of variables. We can perform univariate, that is single variable analysis, or bivariate, that is two variable analysis, and the type of variables, whether we have qualitative, that is categorical values, or quantitative, that is numerical values. We'll dig deeper into each of these types of analysis next (Loading).