Welcome to out first session on Spatial data analysis! By the end of this session, you will:

Code

library(needs)

needs(sf, # for handling geometries,
      osmdata, # for adding elements from the OpenStreetMap database,
      tidyr,
      ggplot2,
      giscoR, # for Eurostat administrative information,
      haven,
      dplyr,
      geodata, # for data on climate,
      purrr,
      htmlwidgets,
      janitor # for text cleaning
      )

Understand what spatial (geographic) data are and why we should consider spatial data in the social sciences
Learn R’s sf package for storing and manipulating spatial vector data
Load and inspect spatial data formats (e.g., Shapefiles, GeoJSON)
Understand and apply coordinate reference systems (CRS) and projections
Perform basic geometric operations on vector data
Create basic spatial visualizations
Learn how to enrich spatial data with information from web‐based platforms.
Discuss the challenges of using location as a proxy for social phenomena.
Create advanced map visualizations in R.

Spatial thinking in the Social Sciences

There has been a steady growth of interest in spatial concepts and techniques within the social sciences. Much of this work builds on foundational research by geographers, but what distinguishes sociology (and related fields) is the application of spatial data, measures, and models to a wide range of substantive questions drawn from established intellectual traditions. Sociologists are less interested in spatial patterns for their own sake and more concerned with how those patterns reflect and shape social relations (Logan 2012).

Original map made by John Snow in 1854. Cholera cases are highlighted in black, showing the clusters of cholera cases (indicated by stacked rectangles) in the London epidemic of 1854. The map was created in order to better understand the pattern of cholera spread in the 1854 Broad Street cholera outbreak, which Snow would use as an example of how cholera spread via the fecal-oral route through water systems as opposed to the miasma theory of disease spread. The contaminated pump is located at the intersection of Broad Street and Cambridge Street (now Lexington Street), running into Little Windmill Street. The map marks an important part of the development of epidemiology as a field, and of disease mapping as a whole.(Snow 1855)

Spatial data allow us to analyze how social processes—such as inequality, crime, or human behavior -— vary across geographic contexts. By explicitly incorporating location, we can uncover patterns and relationships that non -‐ spatial methods miss. As Abbott argues, “one cannot understand social life without understanding the arrangements of particular social actors in particular social times and places… Social facts are located” (Abbott 1997, 1152).

In spatial thinking we focus on four core concepts:

Distance: How far apart phenomena are in geographic space.
Proximity: The relative closeness of observations to one another or to key features.
Exposure: The degree to which a population encounters an environmental or social hazard.
Access: The ability to reach services, resources, or opportunities based on location.

All of these concepts depend on the underlying geometry of our data and our definition of space (Logan 2012).

Spatial thinking encompasses:

The arrangement of social phenomena in space (points, polygons, networks).
The causes of those locational patterns (e.g., economic forces, policy decisions).
The consequences for individuals and groups (e.g., unequal access, segregation).

Even when we work with areal units (neighborhoods, districts), we must grapple with questions of boundary definition and scale—issues that are substantive, not merely technical.

Incorporating spatial information to our models enables us to examine disparities in resource distributions, service accessibility, and with that information on opportunities and restrictions of individuals. This is critical for understanding and advocating for spatial justice and equity, as well as deriving (functioning) urban policies, that can lead and inform effective interventions on urban safety strategies, crime prevention or equitable resource allocation.

These last weeks we asked: Who is connected to whom, and how do these structures influence relationships and individual opinions, behavior and attitudes? Starting from this week, we ask ourselves: Who is close to whom - and what does this mean?

Spatial thinking is strongly based on the idea that:

“Everything is related to everything else, but near things are more related than distant things.” (First law of geography - (Tobler 1970))

It thus refers to reasoning about location, distance, and spatial relationships and can for example be useful in answering questions such as:

Are healthcare facilities equally accessible to all neighborhoods?
Is crime concentrated in specific urban areas due to overlapping social conditions?
Do voting patterns show spatial clustering due to shared environments?

What is Spatial Data?

Spatial data is any data that has a geographical attribute. It contains information about the location and/or shape of physical features on earth and can include much more than just the coordinates on a standard dateset (Pebesma and Bivand 2023).

For example: A dataset on schools might contain school names, addresses, and exact coordinates (latitude/longitude) — allowing us to map them.

Spatial data is often unstructured:

Formats vary (Shapefiles, GeoJSON, GPKG, KML)
Coordinate systems may be missing or mismatched
Boundaries and resolutions differ across datasets
Files may contain only geometry but no relevant attributes

Thus data must be cleaned, projected and joined to be useful for further analysis.

Coordinate Reference Systems

Spatial data is data characterized by coordinates in a coordinate system. Different coordinate systems can be used for this, and the most important difference is whether coordinates are defined over a 2 dimensional or 3 dimensional space referenced to orthogonal axes (Cartesian coordinates), or using distance and directions (polar coordinates, spherical and ellipsoidal coordinates).

A Coordinate Reference System is a framework used to uniquely define spatial positions on Earth. It acts as the interface between the coordinates of a geographic object and its real-world location.

A CRS consists of two main concepts:

Coordinate system: A set of mathematical rules that specify how coordinates are assigned to points
Datum: Parameters that define the origin, scale and orientation of the coordinate system.
- a geodetic datum is a datum that describes the relationship of a two- or three-dimensional coordinate system to the Earth

We use different datums, because the earths shape is irregular. The geoid — the surface of constant gravitational potential approximating mean sea level — is not a perfect sphere or ellipsoid. To approximate the geoid, ellipsoids of revolution (ellipsoids with two identical minor axes) are used.

Fitting an ellipsoid to the Earth’s surface results in a datum. Different datums arise because ellipsoids can be fit globally (e.g.: WGS84 (World Geodetic System 1984), used worldwide by GPS.) or locally (g.e.: ETRS89 (European Terrestrial Reference System 1989), fixed to the European tectonic plate), resulting in varying levels of accuracy for different regions.

A projection converts geographical coordingates (longitude, latitude) into planar cartesian coordinates (X,Y). Since it is impossible to provide an exact representation of a curved surface on a plane, specialized projections were developed for different regions of the world and different analytical applications.

For example, the images below show how identical circular areas appear at different points of the Earth under a Mercator projection (which preserves shapes), a Lambert equal-area projection (which preserves areas) and a Mollweide projection (preserving area proportions but distorting shape).

Mercator distortion
Mercator projection: preserves angles, distorts size near poles

Lambert distortion
Lambert projection: area-preserving, useful for mid-latitudes

Mollweide distortion
Mollweide projection: compromises shape and area, good for global maps

You can find a list of (in R) implemented projections using sf::sf_proj_info(type = "proj").

Most tools for spatial analysis (including R-spatial packages) use PROJ, an open source C++ library that transforms coordinates from one CRS to another. They can be described in many ways including formalized “proj4 strings” such as +proj=longlat or identifying authority codes like EPSG-codes. The latter is more modern and the corresponding codes can be found on this website

GIS

Geographic Information systems were originally developed in the 1960s-90s as a standalone software to manage spatial data. Today, spatial data tools are more and more integrated into data science languages like R and Python - bringing spatial data analysis into reproducible workflows.

Types of spatial data

In GIS, spatial data is typically represented in one of two formats (Moraga 2024).

While vector formats are great for discrete features, raster format can be more useful for the representation of continuous variables.

Vector data formats

Shapefile (.shp): legacy format, often comes with .dbf, .shx, etc.
GeoJSON (.json): used in web mapping, lightweight, human-readable.
Geopackage (.gpkg): modern format; compact, supports multiple layers in one file.
KML/KMZ: used by Google Earth.

Raster data formats

TIFF (.tif): supports georeferencing and multi-band data; widely used.
NetCDF (.nc): used in climate science and geosciences.
Other image files — JPEG, PNG (used rarely in analysis but common for display)

If needed we can also transform vector data to raster data and vice versa.

https://gsp.humboldt.edu/olm/Lessons/GIS/08%20Rasters/RasterToVector.html

For this lecture (and usually when using spatial data for the social sciences) we will focus on vector representation of data.

Raster data:

Representation of geography as a continous of pixels (gridcells) with associated values. They normally represent high resolution features of the geograpy (like an image)

Vector data:

All geospatial vector data can be described by a set of geometric objects (so called simple features.)

The most common spatial types are points, lines and polygons:

Geometric Entity	Description	R	Example
Points	Discrete locations in space defined by a single pair of coordinates.	`st_point(c(2, 3))`	Location of survey respondents, cities, bus stops
Lines	Ordered sequences of points connected by straight segments.	`st_linestring(rbind(c(1, 2), c(3, 4), c(5, 6)))`	Roads, rivers, subway lines
Polygons	Closed sequences of points defining areas; first and last points must be the same.	`st_polygon(list(rbind(c(1, 1), c(2, 4), c(5, 3), c(1, 1))))`	Country borders, administrative districts, lakes

We can wrap one or more geometries to collections, allowing us to combine different objects and metadata:

Function	Purpose
`st_point()`	Creates a single geometry (e.g., a point)
`st_sfc()`	Creates a vector (collection) of geometries
`st_sf()`	Creates a full spatial data frame (with data + geometry)

point1 <- st_point(c(7.5, 1))
point2 <- st_point(c(4, pi))
point3 <- st_point(c(2, 2))

points_sfc <- st_sfc(point1, point2, point3)


points_sf <- st_sf(
  id = 1:3,
  label = c("A", "B", "C"),
  geometry = points_sfc
)


x_limits <- c(0, 12)
y_limits <- c(0, 12)


ggplot(data = points_sf) +
  geom_sf(color = "lavender", size = 4) +
  geom_sf_text(aes(label = label), nudge_y = 0.5, size = 2) +  # Adds labels above points
  coord_sf(xlim = x_limits, ylim = y_limits) +
  theme_minimal() +
  labs(x = "Longitude", y = "Latitude")




# Define the two lines as matrices
line1 <- st_linestring(matrix(c(6, 1,
                                2, 4,
                                6, 2), ncol = 2, byrow = TRUE))
line2 <- st_linestring(matrix(c(2, 5,
                                5, 4,
                                8, 5), ncol = 2, byrow = TRUE))


# Combine into an sfc with individual features
lines_sfc <- st_sfc(line1, line2)


lines_sf <- st_sf(
  label = c("Route A", "Route B"),
  geometry = lines_sfc
)


ggplot(data = lines_sf) +
  geom_sf(color = "purple", size = 6) +
  geom_sf_text(aes(label = label), size = 2, nudge_y = 0.5) +
  coord_sf(xlim = c(0, 10), ylim = c(0, 10)) +
  theme_minimal() 



square1 <- st_polygon(list(matrix(c(1,8,
                                    4,1,
                                    3,4,
                                    7,3,
                                    1,8), ncol = 2, byrow = TRUE)))
square2 <- st_polygon(list(matrix(c(6,6,
                                    7,7,
                                    9,8,
                                    8,9,
                                    6,6), ncol = 2, byrow = TRUE)))

# Combine into sfc
polygons_sfc <- st_sfc(square1, square2)


polygons_sf <- st_sf(
  label = c("Area 1", "Area 2"),
  geometry = polygons_sfc
)


# Plot
ggplot(data = polygons_sf) +
  geom_sf(fill = "lavender", color = "navy") +
  geom_sf_text(aes(label = label), size = 4, nudge_y = 0.5) +
  coord_sf(xlim = c(0, 10), ylim = c(0, 10)) +
  theme_minimal()



ggplot() +
  # Polygons first (so they sit underneath)
  geom_sf(data = polygons_sf, fill = "lavender", color = "coral", size = 0.8) +
  geom_sf(data = lines_sf, color = "lightblue", size = 3, lineend = "round") +
  geom_sf(data = points_sf, color = "pink", size = 5) +
  theme_minimal()

The `sf package`

A feature is thought of as a thing, or an object in the real world, such as a building or a tree. As is the case with objects, they often consist of other objects. This is the case with features too: a set of features can form a single feature. A forest stand can be a feature, a forest can be a feature, a city can be a feature. A satellite image pixel can be a feature, a complete image can be a feature too.

Features have a geometry describing where on Earth the feature is located, and they have attributes, which describe other properties. The geometry of a tree can be the delineation of its crown, of its stem, or the point indicating its center. Other properties may include its height, color, diameter at breast height at a particular date, and so on.

https://cran.r-project.org/web/packages/sf/index.html is the state of the art package for working with spatial vector data in R. It replaces older packages (that are still available like spand rgdal) with a clean, modern interface that aligns with the tidyverse (Pebesma n.d.).

Every spatial object in sf contains:

Geometry (shapes/coordinates)
Coordinate Reference System (CRS/projection)
Attributes (data about the object)

The most common geometry types supported by the sf package are:

Type	Description
`POINT`	Zero-dimensional geometry containing a single coordinate (e.g., a tree)
`LINESTRING`	One-dimensional sequence of points forming a path (e.g., a road or river)
`POLYGON`	Two-dimensional area enclosed by lines (e.g., a building footprint)
`MULTIPOINT`	Set of points (e.g., tree locations in a park)
`MULTILINESTRING`	Set of lines (e.g., transit routes in a city)
`MULTIPOLYGON`	Set of polygons (e.g., a country with multiple islands)
`GEOMETRYCOLLECTION`	Mixed set of geometry types (e.g., points, lines, polygons combined)

We can also define empty geometries (placeholders) for spatial objects. It is similar to having an NA in a column of a data frame. This is useful to preserve data structure consistency, even if some lack location. We can check st_is_empty() to filter or check for empty geometries before performing operations.

# Load Leipzig Polygon (with Stadtviertel) from Shapefile
leipzig <- sf::st_read(dsn = "Data/Leipzig/ot.shp") # dsn = Data Set Name, 

leipzig

#shapefile in CRS 4326 transformieren
leipzig <- st_transform(leipzig, 4326)

bb <- st_bbox(leipzig)

streets <- opq(bbox = bb) |>
  add_osm_feature(key = "highway") |>
  osmdata_sf()

# Load streets (from OSM)
leipzig_streets <- streets$osm_lines |>
  filter(highway %in% c("motorway", "primary", "secondary", "tertiary")) |>
  select(osm_id, name, highway, geometry) |>
  st_transform(4326) |>
  st_intersection(leipzig) # Clip to Leipzig boundaries

doeni <- st_sf(
  name = "Lieblingsdöner",
  geometry = st_sfc(st_point(c(12.3563, 51.3419)),
                    crs = 4326)
)


gwz <- st_sf(
  name = "GWZ",
  geometry = st_sfc(st_point(c(12.36845,51.3319)),
                    crs = 4326)
)


# Plot everything
ggplot() +
  geom_sf(data = leipzig, fill = "lavender", color = "lightgray") +
  geom_sf(data = leipzig_streets, color = "bisque3", size = 0.3) +
  geom_sf(data = doeni, color = "pink", size = 3) +
  geom_sf_text(data = doeni, aes(label = name), nudge_y = 0.005, size = 3) +
  geom_sf(data = gwz, color = "pink", size = 3) +
  geom_sf_text(data = gwz, aes(label = name), nudge_y = 0.005, size = 3) +
  theme_minimal()

Geometry creation

We can define strings, lines or polygons like above, using st_point(), st_linestring() or st_polygon().

line <- st_sf(
  name = "Connection",
  geometry = st_sfc(
    st_linestring(rbind(
      st_coordinates(doeni),
      st_coordinates(gwz)
    )),
    crs = 4326
  )
)

# Plot everything
ggplot() +
  geom_sf(data = leipzig, fill = "lavender", color = "lightgray") +
  geom_sf(data = leipzig_streets, color = "bisque3", size = 0.3) +
  geom_sf(data = doeni, color = "pink", size = 3) +
  geom_sf_text(data = doeni, aes(label = name), nudge_y = 0.005, size = 3) +
  geom_sf(data = gwz, color = "pink", size = 3) +
  geom_sf_text(data = gwz, aes(label = name), nudge_y = 0.005, size = 3) +
  geom_sf(data = line, color = "lightblue", size = 1) +
  theme_minimal()

Added Linestring between my work and my favorite Döner

Geometric confirmation

# Check if geometries are valid - meaning following the formal rules
st_is_valid(leipzig_streets)

# Check geometry type
st_geometry_type(leipzig_streets)

# Check if geometry is empty
st_is_empty(doeni)

# Get CRS
st_crs(leipzig)
st_crs(doeni)

# Ceck if point lies inside polygon
st_contains(leipzig, doeni)

Geometric operations

# Buffer: create a buffer zone around a point (e.g., 500 meters)
doeni_buffer <- st_buffer(doeni, dist = 500, 4326)  # approx 500m in degrees (~very rough!)


# Intersection: streets intersecting with the buffer
streets_near_doeni <- st_intersection(leipzig_streets, doeni_buffer) |> 
  select(osm_id, name, geometry)



# Difference: parts of streets outside buffer
streets_outside <- st_difference(leipzig_streets, doeni_buffer) |> 
  select(osm_id, name, geometry)

We can check all methods for sf- objects with methods(class = "sf")

Add data to simple features

Even though these plots already look pretty cool, we as sociologists are usually interested not just in looking at the placement of objects, but more on how these relate to further variables. The sf handles data like a data.frame, allowing us to combine it (and add) further data to the frames and thus analyzing the relations between different phenomena.

Imagine if we were interested in analyzing whether the movement of actors in Leipzig is dependent on my Lieblingsdöner. For a descriptive (and visual analysis) we load, clean and add data on the innercity movements of Leipzig.

raw <- read.csv("Data/Leipzig/Bevölkerungsbewegung_Wanderungen.csv")

names(raw)
unique(raw$Sachmerkmal)


zuzuege_innerstaedtisch_2024 <- raw |>
  filter(Sachmerkmal == "Innerstädtische Zuzüge") |>
  pivot_longer(
    cols = X2015:X2024,
    names_to = "Jahr",
    values_to = "Zuzuege"
  ) |>
  mutate(
    Jahr = gsub("^X", "", Jahr),
    Zuzuege = as.numeric(Zuzuege)
  ) |>
  filter(Jahr == "2024")



leipzig_data <- left_join(x = leipzig,                  # Zusammenfügen der Daten über den Ortsteil (Name / Gebiet)
                          y = zuzuege_innerstaedtisch_2024,
                          by = c("Name" = "Gebiet"))

Leipzig sf with Information on Innerstädtische Zuzüge

And if we plot this now with my favorite Döner:

ggplot(leipzig_data) +
  geom_sf(aes(fill = Zuzuege)) +
  scale_fill_gradient(low = "lavender", high = "pink", na.value = "grey90") +
  geom_sf(data = doeni, color = "pink", size = 3) +
  geom_sf_text(data = doeni, aes(label = name), nudge_y = 0.005, size = 3) +
  # Buffer around the Döner location
  geom_sf(data = doeni_buffer, fill = "lightblue", color = NA, alpha = 0.6) +
  theme_minimal() +
  labs(title = "Innerstädtische Zuzüge 2024 und Leos Lieblingsdöner", fill = "Zuzüge")

Sources of data collection

Shapefiles and GeoJSONs can be obtained from a variety of sources, including: - Government agencies (e.g., US Census Bureau, Eurostat) - Open data portals (e.g., OpenStreetMap, Natural Earth) - Commercial providers (e.g., Esri, Mapbox) - Academic repositories (e.g., Harvard Geospatial Library) - International organizations (e.g., UN, World Bank) and many more.

Enriching survey data with administrative boundaries

Until recently, most quantitative, standardized studies in the social sciences have relied on survey data. In Germany, one of the flagship representative surveys is the ALLBUS (German General Social Survey) (GESIS-Leibniz-Institut Für Sozialwissenschaften 2019), which records respondents’ federal state of residence in the variable land.

To map survey responses onto geographic boundaries, we can use the giscoR package—an R client for the GISCO (Geographic Information System of the European Commission) open-data repository. GISCO provides a variety of spatial layers, including country outlines, coastlines, labels, and NUTS regions, at multiple resolutions and in three common projections (EPSG:4326, 3035, and 3857) (Hernangómez 2020).

allbus_df <- read_dta("Data/ALLBUS/ZA5270_v2-0-0.dta")

#Federal state level
nuts1_de <- gisco_get_nuts(year = 2021, nuts_level = 1, country = "DE", resolution = "20")

allbus_df <- allbus_df |> 
  mutate(
    land = case_when(
      land == 10  ~ "Schleswig-Holstein",
      land == 20  ~ "Hamburg",
      land == 30  ~ "Niedersachsen",
      land == 40  ~ "Bremen",
      land == 50  ~ "Nordrhein-Westfalen",
      land == 60  ~ "Hessen",
      land == 70  ~ "Rheinland-Pfalz",
      land == 80  ~ "Baden-Württemberg",
      land == 90  ~ "Bayern",
      land == 100 ~ "Saarland",
      land %in% c(111, 112) ~ "Berlin",         # collapse former West/Ost
      land == 120 ~ "Brandenburg",
      land == 130 ~ "Mecklenburg-Vorpommern",
      land == 140 ~ "Sachsen",
      land == 150 ~ "Sachsen-Anhalt",
      land == 160 ~ "Thüringen",
      TRUE ~ NA_character_
    )
  )


happiness_by_land <- allbus_df |>
  filter(!is.na(land), !is.na(ls01)) |>
  group_by(land) |>
  summarise(mean_happiness = mean(ls01, na.rm = TRUE)) |>
  ungroup()

nuts1_de <- nuts1_de |>
  left_join(happiness_by_land, by = c("NAME_LATN" = "land"))


ggplot(nuts1_de) +
  geom_sf(aes(fill = mean_happiness), color = "white") +
  scale_fill_gradient(
    name     = "Avg Happiness\n(0–10)",
    low      = "lavender",    # pink
    high     = "pink",    # lavender
    na.value = "grey90"
  ) +
  labs(
    title    = "Mean Self-Reported Happiness by Bundesland (ALLBUS)",
    caption  = "Data: ALLBUS ZA5270_v2-0-0; Boundaries: NUTS 2 (giscoR)"
  ) +
  theme_minimal()

NUTS-Regions

The Nomenclature of Territorial Units for Statistics (NUTS) is an EU standard for dividing member states into hierarchical regions:

NUTS 0: Countries
NUTS 1: Major socio-economic regions (German federal states like “Nordrhein-Westphalen”)
NUTS 2: Basic regions for the application of regional policies (e.g., individual Regierungsbezirke)
NUTS 3: Small regions for specific diagnoses (Kreise or kreisfreie Städte).

Code

# Plot the regions
ggplot(nuts1_de) +
  geom_sf(fill = "lightblue", color = "white") +
  labs(title = "NUTS Level 1 Regions in Germany") +
  theme_minimal()

# Download NUTS level 2 regions for Germany
nuts2_de <- gisco_get_nuts(nuts_level = 2, country = "DE", resolution = 1)

# Plot the regions
ggplot(nuts2_de) +
  geom_sf(fill = "lightblue", color = "white") +
  labs(title = "NUTS Level 2 Regions in Germany") +
  theme_minimal()



# Download NUTS level 3 regions for Germany
nuts3_de <- gisco_get_nuts(nuts_level = 3, country = "DE", resolution = 1)

# Plot the regions
ggplot(nuts3_de) +
  geom_sf(fill = "lightblue", color = "white") +
  labs(title = "NUTS Level 3 Regions in Germany") +
  theme_minimal()


# Download NUTS level 4 regions for Germany
nuts0_de <- gisco_get_nuts(nuts_level = 0, country = "DE", resolution = 1)

# Plot the regions
ggplot(nuts0_de) +
  geom_sf(fill = "lightblue", color = "white") +
  labs(title = "NUTS Level 0 Regions in Germany") +
  theme_minimal()

Because surveys must balance spatial precision with respondent confidentiality, location is typically reported only at NUTS 1 or NUTS 2 level. However, aggregated metadata—such as regional averages or counts—can sometimes be accessed at finer scales.

Importantly, joining survey data to shape files is primarily a descriptive exercise: it adds geographic context for visualization and exploratory analysis, even though it does not generate new information about individual respondents.

Lets try this for ourselves:

You can download a GeoJson with Information on the boundaries of world countries, their name, ISO Code, Affiliated countries from here

Load the world boundaries as an sf object using st_read().
Plot the countries using different Coordinate Reference Systems (CRSs). Observe how these different CRSs change the appearance of the world.

From here you can download the free version of the simplemaps information metadata on the world countries. It contains information on different countries population and economic information.

Add the information from the .csv to your sf-object.
Find the five countries with the highest and lowest median age.
Visualize median age using a color gradient.
Identify which countries drive on the left.
Visualize the driving side on a map.

Climatic data

Environmental sociology examines the dynamic relationship between societies and their natural environments. Within this field, the study of climate impacts—and society’s influence on climate—is rapidly expanding. By leveraging new spatial data sources and geographic tools, we can begin to ask questions such as:

How does exposure to extreme weather influence survey responses on well-being or climate anxiety?
Are people in regions experiencing rapid temperature increases more likely to support ambitious climate policies?
Do variations in precipitation patterns correlate with reported trust in governmental climate initiatives?

For data collection, we can use the geodata package in R (Mandel, Barbosa, and Aniruddha Ghosh 2021), which offers programmatic access to a wide range of global raster and vector datasets, including:

Climate layers (e.g., temperature, precipitation via WorldClim)
Elevation and accessibility metrics
Land use, soil, and crop suitability maps
Species occurrence records
Administrative boundaries at multiple levels

With these tools, you can seamlessly integrate environmental variables into your social-science workflows—linking survey data, policy indicators, and demographic information to enrich your analyses and visualizations.

This is just one example, data can of course also be downloaded or collected via weather channels.

d <- worldclim_country(country = "Germany",
                       res = 0.5,
                       var = "tmax",
                       path = tempdir()
                       )

terra::plot(mean(d), plg = list(title = "Max. temperature (C)"))

OpenStreetMap Data with `osmdata`

An API (Application Programming Interface) is a set of rules that allows different programs or computers to communicate with each other and exchange data.

In simple terms, an API tells one program how to ask for information and tells the other program how to send the answer back. This works even if the systems are very different, for example written in different programming languages or located on different computers.

Instead of downloading data manually from a website, we can ask for it directly from R.

Many services provide APIs, for example weather services, map providers, public transport platforms, government data portals, social media platforms, and research databases.

In this example, we use OpenStreetMap data through the Overpass API. OpenStreetMap is especially useful because it is open, community-built, and freely available. This makes it a great data source for learning GIS and working with real-world spatial data.

The R package osmdata acts as a bridge between R and OpenStreetMap. It translates R commands into requests to the Overpass API:

opq() defines where to search.
add_osm_feature() defines what to search for.
osmdata_sf() sends the request and returns the result as sf objects.

This means we can download real OpenStreetMap features, such as streets, hospitals, parks, or buildings, and directly map or analyze them in R.

OpenStreetMap (OSM) is a crowdsourced geographic database maintained by volunteers around the world. With the osmdata package (Padgham et al. 2023), you can pull features like roads, railway stations, schools, supermarkets, and more directly into R as sf objects. A full list of mappable features is on the OSM wiki (Moraga and Baker 2022).

You can inspect which feature keys and tags you might query:

available_features()

When we create an osmdata query we start by defining a geographical area that we wish to include. This is done by defining a bounding box that defines a geographical area by its bounding latitudes and longitudes. (ESR: ) The bounding box for a given place name can be obtained with the getbb()function. For example the bounding box of Leipzig can be obtained (and saved directly as a simple feature object as follows:

placebb <- getbb("Leipzig",
                 format_out = "sf_polygon")

To retrieve the required features of a place defined by the bounding box, we can overpass query this with opq(). Then, the add_osm_feature() function can be used to add the required features to the query. Finally, we use the osmdata_sf() function to obtain a simple feature object of the resultant query.

# turn that polygon into a simple numeric bbox
bb <- st_bbox(placebb)

# run your query using that numeric bbox
hospitals <- opq(bbox = bb) |>
  add_osm_feature(key = "amenity",
                  value = "hospital") |>
  #add_osm_feature(key = "name") |>
  osmdata_sf()

#make sure to check if they share the same crs (in this case WGS 84)


# Extract OSM results
hospital_points <- hospitals$osm_points
hospital_polygons <- hospitals$osm_polygons

# Plot hospitals
ggplot() +
  geom_sf(data = placebb,
          fill = "lavender",
          color = "grey50") +
  geom_sf(data = hospital_points,
          color = "purple",
          size = 1,
          alpha = 0.7) +
  geom_sf(data = hospital_polygons,
          fill = "lavender",
          color = "purple",
          alpha = 0.4) +
  geom_sf_text(
    data = hospital_polygons,
    aes(label = name),
    size = 2,
    color = "navy",
    check_overlap = TRUE
  ) +
  coord_sf(expand = FALSE) +
  labs(
    title = "Hospitals in Leipzig",
    caption = "Source: OpenStreetMap"
  ) +
  theme_minimal()

Explore OpenStreetMap (OSM) data to identify tram stops (“Tramhaltestellen”) and tram connections in Leipzig. Create a map that plots tram stops as points and displays tram route lines.

Code

bb_leipzig <- getbb("Leipzig, Germany")

# Query for public transport stops
pt_stops <- opq(bbox = bb_leipzig) |>
  add_osm_feature(key = "railway", value = "tram_stop") |>
  osmdata_sf()

# Extract points from the result
pt_stops_sf <- pt_stops$osm_points |>
  filter(!st_is_empty(geometry))


pt_lines <- opq(bbox = bb_leipzig) |>
  add_osm_feature(key = "route", value = c("tram")) |>
  osmdata_sf()


# Extract lines from the result
pt_lines_sf <- pt_lines$osm_lines |>
  filter(!st_is_empty(geometry))

unique(pt_lines_sf$name)

# bb_leipzig_bbox <- st_bbox(c(
#   xmin = bb_leipzig[1,1], ymin = bb_leipzig[2,1],
#   xmax = bb_leipzig[1,2], ymax = bb_leipzig[2,2]
# ), crs = 4326)  # WGS84 CRS
# 
# # Then convert bbox to simple feature polygon
# bb_sfc <- st_as_sfc(bb_leipzig_bbox)

# Plot
ggplot() +
  geom_sf(data = placebb, fill = "lavender", color = "grey50") +
  #geom_sf(data = bb_sfc, fill = "lightgrey", color = "grey50") +
  geom_sf(data = pt_stops_sf, color = "pink", size = 1, alpha = 0.7) +
  geom_sf(data = pt_lines_sf, color = "navy", size = 0.8, alpha = 0.7) +
  labs(title = "Public Transport in Leipzig") +
  theme_minimal()


# Boundary box ist nicht sehr genau
# Routen sind nicht immer eingetragen
# Hat man mit seinem Tag wirklich alle Objects identifiziert?
# Es braucht some form of validation

Do you encounter any problems? Why do you think is that?

:::

Measuring space

In social sciences, space is rarely studied as an absolute concept; rather, it is almost always understood relative to a reference point or object — such as the neighborhood around a person’s home or the distance to a resource. For example, measuring distance allows us to map the location of something in relation to something else, or researchers frequently simulate resources to define an egocentric neighborhood — like a 1-kilometer radius around a person’s residence — to study how the built environment influences travel behavior (Frank, Andresen, and Schmid 2004). Due to limited access to detailed (and often confidential) location data, researchers often assign individuals to administrative units like census tracts or zip codes and approximate their location using the centroid of these areas.

Euclidean distance

Euclidean distance is the straight-line distance between two points — the shortest path “as the crow flies.” For \(A_{(x_1, y_1)}\) and \(B_{(x_2, y_2)}\), it is calculated by:

\[ d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} \] deriving from the Pythagorean theorem.

Euclidean distance is especially useful when approximating proximity in open, unobstructed spaces or as a baseline measure before considering more complex routes like travel time or street networks.

Code

bb <- st_bbox(leipzig)

streets <- opq(bbox = bb) |>
  add_osm_feature(key = "highway") |>
  osmdata_sf()

# Load streets (from OSM)
leipzig_streets <- streets$osm_lines |>
  filter(highway %in% c("motorway", "primary", "secondary", "tertiary")) |>  # filter larger streets only! Speed
  st_transform(4326) |>
  st_intersection(st_union(leipzig))


doeni <- st_sf(
  name = "Lieblingsdöner",
  geometry = st_sfc(st_point(c(12.3563, 51.3419)),
                    crs = 4326)
)


gwz <- st_sf(
  name = "GWZ",
  geometry = st_sfc(st_point(c(12.36845,51.3319)),
                    crs = 4326)
)

# Shapefile ins gleiche CRS transformieren
leipzig <- st_transform(leipzig,
                        4326)

# Plot everything
ggplot() +
  geom_sf(data = leipzig, fill = "lavender", color = "lightgray") +
  geom_sf(data = leipzig_streets, color = "bisque3", size = 0.3) +
  geom_sf(data = doeni, color = "pink", size = 3) +
  geom_sf_text(data = doeni, aes(label = name), nudge_y = 0.005, size = 3) +
  geom_sf(data = gwz, color = "pink", size = 3) +
  geom_sf_text(data = gwz, aes(label = name), nudge_y = 0.005, size = 3) +
  theme_minimal()

Using the sf-package we can compute the Euclidean distance using:

dist <- st_distance(doeni, gwz)

Units: [m] [,1] [1,] 1396.006

Manhattan distance

The Manhattan distance is a metric, used to calculate the distance between two points in a gitterartige Pfad. In contrast to the euclidean distance, the manhattan distance measures the sum of absolute differences between two coordinates of points.

The Manhattan distance for n-dimensional vectors is:

\[ |x_1 - x_2| + |y_1-y_2| + ... + |v_{n1}-v_{n2}| \]

Geodesic distance

The geodesic distance is the shortest distance between two points on a curved surface, measured along the surface itself, rather than a straight line through space.

Network distance

Straight-line (Euclidean) distance is simple and fast to compute but doesn’t reflect the actual travel path people use. In real life, movement is constrained by the layout of the street network.

To capture network-constrained distances (e.g., for walking, cycling, or driving), we can use tools like dodgr (distances on directed graphs) or osrm, which allow for routing on real streets.

In this example, we use dodgr to calculate the shortest distance for cyclists between two points in Leipzig.

library(dodgr)

# Download road network for Leipzig
net <- dodgr_streetnet("Leipzig", expand = 0.05)

# Weight the graph (e.g., bycicle profile)
graph <- weight_streetnet(net, wt_profile = "bicycle")


from <- c(12.3731, 51.3397)
to <- c(12.4, 51.35)


path_dist <- dodgr_dists(graph, from = from, to = to)
print(path_dist)

2519.564 meter.

Isochronic distance

Isochronic distance represents how far one can travel from a starting point within a certain amount of time, accounting for actual travel networks, speeds, and possibly time of day.

Isochrones are useful for operationalizing accessibility — for example, measuring how many schools or services are reachable within 10 minutes. They are vital in urban planning, transportation analysis, emergency response, and studying spatial inequalities.

Calculating travel time is more complex than straight-line distance because it requires integrating transportation modes, routes, and speeds.

The osrmpackage provides access to travel time and distance data through routing services, enabling isochrone calculations.

Code

library(osrm)


# Create an sf POINT for Leipzig
leipzig <- data.frame(
  id = "leipzig",
  lon = 12.3731,
  lat = 51.3397
)

leipzig_sf <- st_as_sf(leipzig, coords = c("lon", "lat"), crs = 4326)


# Get isochrones (returns polygons for each time range)
iso <- osrmIsochrone(loc = leipzig_sf, breaks = c(5, 10, 15))



ggplot() +
  geom_sf(data = placebb, fill = NA, color = "grey30", size = 0.8) +
  geom_sf(data = iso, aes(fill = factor(isomax)), color = NA, alpha = 0.6) +
  geom_sf(data = leipzig_sf, color = "black", size = 2) +
  scale_fill_brewer(palette = "YlOrRd", name = "Minutes") +
  labs(
    title = "Isochrones from Leipzig City Center",
    subtitle = "Areas reachable within 5, 10, and 15 minutes",
    caption = "Source: OpenStreetMap via OSRM"
  ) +
  theme_minimal()

For comparison, we can measure the size of these isochrome cells by using the main functions of the sf-package.

Code

library(units)

# Split by isomax value
iso_5 <- iso |> filter(isomax == 5)
iso_10 <- iso |> filter(isomax == 10)
iso_15 <- iso |> filter(isomax == 15)


iso_5_m <- st_transform(iso_5, 3857)
iso_10_m <- st_transform(iso_10, 3857)
iso_15_m <- st_transform(iso_15, 3857)

area_5 <- st_area(iso_5_m)
area_10 <- st_area(iso_10_m)
area_15 <- st_area(iso_15_m)

# Print areas in square kilometers
print(set_units(area_5, km^2))
print(set_units(area_10, km^2))
print(set_units(area_15, km^2))

Voronoi-Cells

Voronoi cells (or Thiessen polygons) divide space such that each point in a polygon is closest to one specific service point. Each cell thus represents a catchment area of influence.

When we intersect these polygons with population data (e.g. census blocks or grids) we can approximately measure how many people are assigned (by procimity) to each facility.

We can also compare real administrative zones (e.g. school catchments) to Voronoi-derived zones. Discrepancies reveal potential mismatch between planning and reality.

we can use st_voronoi()for generating Voronoi cells. with `st_intersection() we can check for overlays with population or income data.

# Combine features

hospital_areas <- bind_rows(
  hospitals$osm_polygons,
  hospitals$osm_multipolygons
)


# Get centroids of each feature
hospital_pts <- hospital_areas |>
  st_centroid() |>
  st_transform(4326)  # WGS84 for consistency


hpts <- st_union(hospital_pts)


hpts_proj <- st_transform(hospital_pts, 32633)  # UTM zone 33N

bb_proj <- st_transform(placebb, 32633)


# Create Voronoi diagram
voro_raw <- st_voronoi(st_union(hpts_proj), envelope = st_as_sfc(st_bbox(bb_proj)))

# Extract and convert to sf
voro_polygons <- st_collection_extract(voro_raw, "POLYGON")
voro_sf <- st_sf(geometry = voro_polygons, crs = 32633)

voro_clipped <- st_intersection(voro_sf, bb_proj)

# Plot Voronoi cells + hospital locations
ggplot() +
  # Background: study area outline (optional)
  geom_sf(data = placebb, fill = NA, color = "grey40", linetype = "dashed") +

  # Voronoi polygons
  geom_sf(data = voro_clipped, fill = "lightblue", color = "pink", alpha = 0.4) +

  # Hospital points
  geom_sf(data = hospital_pts, color = "purple", size = 2) +

  labs(
    title = "Voronoi Cells Around Hospitals in Leipzig",
    subtitle = "Each polygon shows the area closest to a hospital",
    caption = "Data: OpenStreetMap via osmdata"
  ) +
  theme_minimal()

Bonus: Advanced visualisation using leaflet

Leaflet is a powerful and flexible R package for creating interactive maps. It allows you to combine multiple map layers, including base maps, markers, polygons, and custom shapes. You can customize markers with icons, popups, and labels to provide more context. Leaflet supports different base tiles like OpenStreetMap, Stamen, and CartoDB, giving you options for map style and detail. You can add layers control so users can toggle different datasets on and off. It also supports clustering of many points to keep the map clean and performant. Advanced visualizations can use polygons or polylines to show boundaries, routes, or regions. Leaflet allows dynamic styling of features based on attributes, such as coloring districts by population density. You can integrate leaflet with other R packages like sf for spatial data handling, making it easy to plot shapefiles or geojson. It supports adding legends, scale bars, and even custom JavaScript for interactivity. Popups and tooltips provide detailed info when users hover or click on map elements. Heatmaps and choropleth maps are possible through add-on packages or custom coding. Leaflet maps can be saved as HTML files, embedded in R Markdown, or Shiny apps for interactive web dashboards. Using leaflet proxies, you can update maps dynamically in response to user input without redrawing everything. Overall, leaflet provides a versatile environment to create maps from simple point plots to complex, multi-layered geographic visualizations.

Code

# Define Leipzig coordinates
leipzig_coords <- c(12.3731, 51.3397)  # lon, lat

# Your Lieblingsdöner coordinates (example: replace with your actual location)
doener_coords <- c(12.3848, 51.3390)



# Create leaflet map
m <- leaflet() |>
        addProviderTiles(providers$CartoDB.Positron) |>
        # Add polygon of Germany outline (optional: simplified for this example)
        addPolygons(data = nuts0_de, # you can load Germany shape as sf object here for boundary
                    fillColor = "lavender", color = "black", weight = 2) |>
        # Add a circle around Leipzig to highlight it
        addCircles(lng = leipzig_coords[1], lat = leipzig_coords[2],
                   radius = 10000, color = "pink", fillOpacity = 0.2,
                   popup = "Leipzig") |>
        # Add a marker for your Lieblingsdoener
        addMarkers(lng = doener_coords[1], lat = doener_coords[2],
                   popup = "Mein Lieblingsdöner 🌯") |>
        # Set initial view roughly centered on Germany
        setView(lng = 10.0, lat = 51.0, zoom = 6)


saveWidget(m, "my_leaflet_map.html")

(15 minutes)

In groups of three, think of a sociological question where space plays an important role—either as a key variable or as a context.

Discuss how you might operationalize “space” in this question.
Reflect on the possible methods you could use to measure spatial aspects.
Identify potential limitations or challenges in using spatial data for your question.

References

Abbott, Andrew. 1997. “Of Time and Space: The Contemporary Relevance of the Chicago School*.” Social Forces 75 (4): 1149–82. https://doi.org/10.1093/sf/75.4.1149.

Frank, Lawrence D., Martin A. Andresen, and Thomas L. Schmid. 2004. “Obesity Relationships with Community Design, Physical Activity, and Time Spent in Cars.” American Journal of Preventive Medicine 27 (2): 87–96. https://doi.org/10.1016/j.amepre.2004.04.011.

GESIS-Leibniz-Institut Für Sozialwissenschaften. 2019. “ALLBUS/GGSS 2018 (Allgemeine Bevölkerungsumfrage der Sozialwissenschaften/German General Social Survey 2018)Allgemeine Bevölkerungsumfrage der Sozialwissenschaften ALLBUS 2018.” GESIS Data Archive. https://doi.org/10.4232/1.13250.

Hernangómez, Diego. 2020. “giscoR: Download Map Data from GISCO API - Eurostat.” Comprehensive R Archive Network. https://doi.org/10.32614/CRAN.package.giscoR.

Logan, John R. 2012. “Making a Place for Space: Spatial Thinking in Social Science.” Annual Review of Sociology 38 (1): 507–24. https://doi.org/10.1146/annurev-soc-071811-145531.

Mandel, Robert J. Hijmans, Màrcia Barbosa, and Alex Aniruddha Ghosh. 2021. “Geodata: Download Geographic Data.” Comprehensive R Archive Network. https://doi.org/10.32614/CRAN.package.geodata.

McPherson, Miller, Lynn Smith-Lovin, and James M. Cook. 2001. “Birds of a Feather: Homophily in Social Networks.” Annual Review of Sociology 27: 415–44. https://www.jstor.org/stable/2678628.

Moraga, Paula. 2024. Spatial Statistics for Data Science: Theory and Practice with R. First edition. Chapman & Hall/CRC Data Science Series. Boca Raton: CRC Press, Taylor & Francis Group.

Moraga, Paula, and Laurie Baker. 2022. “Rspatialdata: A Collection of Data Sources and Tutorials on Downloading and Visualising Spatial Data Using R.” https://f1000research.com/articles/11-770. https://doi.org/10.12688/f1000research.122764.1.

Padgham, Mark, Bob Rudis, Robin Lovelace, Maëlle Salmon, Joan Maspons, Andrew Smith, James Smith, et al. 2023. “Osmdata: Import ’OpenStreetMap’ Data as Simple Features or Spatial Objects.”

Pebesma, Edzer. n.d. “Simple Features for R.” https://r-spatial.github.io/sf/articles/sf1.html. Accessed May 27, 2025.

Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With Applications in R. 1st ed. New York: Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.

Reardon, Sean F., and David O’Sullivan. 2004. “Measures of Spatial Segregation.” Sociological Methodology 34 (1): 121–62. https://doi.org/10.1111/j.0081-1750.2004.00150.x.

Snow, John. 1855. On the Mode of Communication of Cholera. London : John Churchill.

Tobler, W. R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46: 234–40. https://doi.org/10.2307/143141.

White, Michael. 1983. “The Measurement of Spatial Segregation.” American Journal of Sociology 88 (5): 1008–18. https://www.jstor.org/stable/2779449.

Week 05 - Spatial analysis I

What is Spatial Data?

Coordinate Reference Systems

GIS

Types of spatial data

The `sf package`

Geometry creation

Geometric confirmation

Geometric operations

Add data to simple features

Sources of data collection

Enriching survey data with administrative boundaries

Climatic data

OpenStreetMap Data with `osmdata`

Measuring space

Euclidean distance

Manhattan distance

Geodesic distance

Network distance

Isochronic distance

Voronoi-Cells

Segregation

Bonus: Advanced visualisation using leaflet

References

Copyright

Spatial thinking in the Social Sciences

What is Spatial Data?

Coordinate Reference Systems

GIS

Types of spatial data

The sf package

Geometry creation

Geometric confirmation

Geometric operations

Add data to simple features

Sources of data collection

Enriching survey data with administrative boundaries

Climatic data

OpenStreetMap Data with osmdata

Measuring space

Euclidean distance

Manhattan distance

Geodesic distance

Network distance

Isochronic distance

Voronoi-Cells

Segregation

Bonus: Advanced visualisation using leaflet

References

Copyright

The `sf package`

OpenStreetMap Data with `osmdata`