Country populations over time (Complete)

Introduction

Goal

Our ultimate goal in this application exercise is to make the following data visualization.

Line plot of country populations for the United States, India, and China between 2000 and 2023.

Packages

We will use the tidyverse and scales packages for data wrangling and visualization.

library(tidyverse)
library(scales)

Data

These data come from The World Bank and reflect population counts for the years 2000 to 2023. The populations given are mid-year estimates.

population <- read_csv("https://data-science-with-r.github.io/data/population.csv")

Let’s take a look at the data.

population

# A tibble: 217 × 28
   series_name series_code country_name country_code `2000` `2001` `2002` `2003`
   <chr>       <chr>       <chr>        <chr>         <dbl>  <dbl>  <dbl>  <dbl>
 1 Population… SP.POP.TOTL Afghanistan  AFG          1.95e7 1.97e7 2.10e7 2.26e7
 2 Population… SP.POP.TOTL Albania      ALB          3.09e6 3.06e6 3.05e6 3.04e6
 3 Population… SP.POP.TOTL Algeria      DZA          3.08e7 3.12e7 3.16e7 3.21e7
 4 Population… SP.POP.TOTL American Sa… ASM          5.82e4 5.83e4 5.82e4 5.79e4
 5 Population… SP.POP.TOTL Andorra      AND          6.61e4 6.78e4 7.08e4 7.39e4
 6 Population… SP.POP.TOTL Angola       AGO          1.64e7 1.69e7 1.75e7 1.81e7
 7 Population… SP.POP.TOTL Antigua and… ATG          7.51e4 7.62e4 7.72e4 7.81e4
 8 Population… SP.POP.TOTL Argentina    ARG          3.71e7 3.75e7 3.79e7 3.83e7
 9 Population… SP.POP.TOTL Armenia      ARM          3.17e6 3.13e6 3.11e6 3.08e6
10 Population… SP.POP.TOTL Aruba        ABW          8.91e4 9.07e4 9.18e4 9.27e4
# ℹ 207 more rows
# ℹ 20 more variables: `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>,
#   `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
#   `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>,
#   `2018` <dbl>, `2019` <dbl>, `2020` <dbl>, `2021` <dbl>, `2022` <dbl>,
#   `2023` <dbl>

Analysis

Tidying

What are the aesthetic mappings in the plot shown above, i.e., what pieces of information do we need represented as columns (variables) in our data frame in order to be able to recreate this plot?

x: year

y: population

color, shape: country_name

Reshape the population data such that it can be used to recreate the plot above. Note: For now, you can keep all the countries in the dataset.

population |>
  pivot_longer(
    cols = `2000`:`2023`,
    names_to = "year",
    values_to = "population"
  )

# A tibble: 5,208 × 6
   series_name       series_code country_name country_code year  population
   <chr>             <chr>       <chr>        <chr>        <chr>      <dbl>
 1 Population, total SP.POP.TOTL Afghanistan  AFG          2000    19542982
 2 Population, total SP.POP.TOTL Afghanistan  AFG          2001    19688632
 3 Population, total SP.POP.TOTL Afghanistan  AFG          2002    21000256
 4 Population, total SP.POP.TOTL Afghanistan  AFG          2003    22645130
 5 Population, total SP.POP.TOTL Afghanistan  AFG          2004    23553551
 6 Population, total SP.POP.TOTL Afghanistan  AFG          2005    24411191
 7 Population, total SP.POP.TOTL Afghanistan  AFG          2006    25442944
 8 Population, total SP.POP.TOTL Afghanistan  AFG          2007    25903301
 9 Population, total SP.POP.TOTL Afghanistan  AFG          2008    26427199
10 Population, total SP.POP.TOTL Afghanistan  AFG          2009    27385307
# ℹ 5,198 more rows

What is the type of the year variable? Why? What should it be?

Character. It should be numeric.

Start over with pivoting, and this time also make sure year is a numerical variable in the resulting data frame. Save the resulting data frame as population_longer.

population_longer <- population |>
  pivot_longer(
    cols = `2000`:`2023`,
    names_to = "year",
    values_to = "population",
    names_transform = as.numeric
  )
population_longer

# A tibble: 5,208 × 6
   series_name       series_code country_name country_code  year population
   <chr>             <chr>       <chr>        <chr>        <dbl>      <dbl>
 1 Population, total SP.POP.TOTL Afghanistan  AFG           2000   19542982
 2 Population, total SP.POP.TOTL Afghanistan  AFG           2001   19688632
 3 Population, total SP.POP.TOTL Afghanistan  AFG           2002   21000256
 4 Population, total SP.POP.TOTL Afghanistan  AFG           2003   22645130
 5 Population, total SP.POP.TOTL Afghanistan  AFG           2004   23553551
 6 Population, total SP.POP.TOTL Afghanistan  AFG           2005   24411191
 7 Population, total SP.POP.TOTL Afghanistan  AFG           2006   25442944
 8 Population, total SP.POP.TOTL Afghanistan  AFG           2007   25903301
 9 Population, total SP.POP.TOTL Afghanistan  AFG           2008   26427199
10 Population, total SP.POP.TOTL Afghanistan  AFG           2009   27385307
# ℹ 5,198 more rows

Visualization

Now we start making our plot, but let’s not get too fancy right away. Create a line plot of populations of the United States, India, and China over the years. Represent the data with points and lines.

population_longer |>
  filter(country_name %in% c("United States", "India", "China")) |>
  ggplot(aes(x = year, y = population, color = country_name)) +
  geom_point() +
  geom_line()

What aspects of the plot need to be updated to go from the draft you created above to the goal plot at the beginning of this application exercise.

x and y axis scales

shapes

theme

labels

placement of legend

Use different shapes for each country’s points.

population_longer |>
  filter(country_name %in% c("United States", "India", "China")) |>
  ggplot(aes(x = year, y = population, color = country_name, shape = country_name)) +
  geom_point() +
  geom_line()

Update x-axis scale such that the years displayed go from 2000 to 2024 in increments of 4 years.

population_longer |>
  filter(country_name %in% c("United States", "India", "China")) |>
  ggplot(aes(x = year, y = population, color = country_name, shape = country_name)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(limits = c(2000, 2024), breaks = seq(2000, 2024, 4))

Update the y-axis so it’s scaled to millions and uses the same breaks as the goal plot.

population_longer |>
  filter(country_name %in% c("United States", "India", "China")) |>
  ggplot(aes(x = year, y = population, color = country_name, shape = country_name)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(limits = c(2000, 2024), breaks = seq(2000, 2024, 4)) +
  scale_y_continuous(
    breaks = seq(250000000, 1250000000, 250000000),
    labels = label_number(scale = 1/1000000, suffix = " mil")
  )

Update colors for each country using the following level / color assignments.
- “United States” = “#0A3161”
- “India” = “#FF671F”
- “China” = “#EE1C25”

population_longer |>
  filter(country_name %in% c("United States", "India", "China")) |>
  ggplot(aes(x = year, y = population, color = country_name, shape = country_name)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(limits = c(2000, 2024), breaks = seq(2000, 2024, 4)) +
  scale_y_continuous(
    breaks = seq(250000000, 1250000000, 250000000),
    labels = label_number(scale = 1/1000000, suffix = " mil")
  ) +
  scale_color_manual(
    values = c(
      "United States" = "#0A3161",
      "India" = "#FF671F",
      "China" = "#EE1C25"
    )
  )

Update the plot labels (title, subtitle, x, y, and caption) and use theme_minimal().

population_longer |>
  filter(country_name %in% c("United States", "India", "China")) |>
  ggplot(aes(x = year, y = population, color = country_name, shape = country_name)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(limits = c(2000, 2024), breaks = seq(2000, 2024, 4)) +
  scale_y_continuous(
    breaks = seq(250000000, 1250000000, 250000000),
    labels = label_number(scale = 1/1000000, suffix = " mil")
  ) +
  scale_color_manual(
    values = c(
      "United States" = "#0A3161",
      "India" = "#FF671F",
      "China" = "#EE1C25"
    )
  ) +
  theme_minimal() +
  labs(
    x = "Year",
    y = "Population (in millions)",
    title = "Country populations over the years",
    subtitle = "2000 to 2023",
    caption = "Data source: The World Bank"
  )

Finally, move the legend to the top of the plot and remove its label.

population_longer |>
  filter(country_name %in% c("United States", "India", "China")) |>
  ggplot(aes(x = year, y = population, color = country_name, shape = country_name)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(limits = c(2000, 2024), breaks = seq(2000, 2024, 4)) +
  scale_y_continuous(
    breaks = seq(250000000, 1250000000, 250000000),
    labels = label_number(scale = 1/1000000, suffix = " mil")
  ) +
  scale_color_manual(
    values = c(
      "United States" = "#0A3161",
      "India" = "#FF671F",
      "China" = "#EE1C25"
    )
  ) +
  labs(
    x = "Year",
    y = "Population (in millions)",
    color = NULL,
    shape = NULL,
    title = "Country populations over the years",
    subtitle = "2000 to 2023",
    caption = "Data source: The World Bank"
  ) +
  theme_minimal() +
  theme(legend.position = "top")