GROUP BY

This demo uses both the iris and dplyr::starwars datasets. The startwars dataset has list columns removed.

The GROUP BY statement allows data to be aggregated in subsets.

Example: What is the average petal length of each iris species?

sqldf("
  SELECT Species, AVG(`Petal.Length`) AS average_petal_length
  FROM iris
  GROUP BY Species;
")

##      Species average_petal_length
## 1     setosa                1.462
## 2 versicolor                4.260
## 3  virginica                5.552

Example: For the virginica and setosa species what is the median sepal length? List the longest first.

sqldf("
  SELECT MEDIAN(`Sepal.Width`) AS mean_sepal_width, Species FROM iris
  WHERE Species IN ('virginica','setosa')
  GROUP BY Species 
  ORDER BY mean_sepal_width DESC
")

##   mean_sepal_width   Species
## 1              3.4    setosa
## 2              3.0 virginica

Example: From the star wars dataset, what is the average character height from each homeworld? List the 7 with the shortest average.

sqldf("
  SELECT AVG(height) AS average_height, homeworld FROM sw  
  WHERE homeworld NOT NULL AND
  height NOT NULL
  GROUP BY homeworld
  ORDER BY average_height ASC
  LIMIT 7
")

##   average_height   homeworld
## 1             79 Aleen Minor
## 2             88       Endor
## 3             94     Vulpter
## 4            112   Malastare
## 5            122     Troiken
## 6            137    Toydaria
## 7            150   Chandrila

Example: What is the maximum height from species from each homeworld? List the top 3

sqldf("
  SELECT MAX(height) AS max_height, homeworld FROM sw
  WHERE homeworld NOT NULL AND
  height NOT NULL
  GROUP BY homeworld
  ORDER BY max_height DESC
  LIMIT 3
")

##   max_height homeworld
## 1        264   Quermia
## 2        234  Kashyyyk
## 3        229    Kamino

GROUP BY

Adam Caffrey

07/01/2021