Maybe sv-map is done now?

Date: 2025-02-07

For this year I have a soft intent of completing one project per month, along with a short blog post. Don't call it a resolution!

I have many long-running projects that just need a few touches to be called done, so I think it'll be possible to do one per month, though I'm not going to stress if I don't reach it.

For January, I finished sv-map, the street view blue line archive. With sv-map, you can compare Google Street View evolution over time.

Changes since July 2024

sv-map has already been running for several years, but the past months I reworked the backend to be as maintenance-free as possible.

Downloading tiles

Google Maps uses a slippy map system. This is a way to divide a very large "image" (the entire world) into tiles at different zoom levels that can be sent quickly over the internet. All the way zoomed out, the entire world fits in a single tile:

Every time you zoom in, the tiles get split in quarters, and more detail is visible.

We can download these images, but for street view:

To determine what to download, we need to set some limits. Google Maps can zoom in to level 22, where the world's diameter is... 2 ^ 22 = 4,194,304 tiles. That many tiles horizontally and vertically means trillions of tiles to cover the whole earth! It's not possible to archive that much every day. We need to find a maximum zoom level we're happy with, and if we have a strategy to avoid downloading unnecessary tiles, we can get away with a more detailed zoom level!

Reducing the number

By just using Google Maps, I found that for the most part, blue lines stay when you zoom out. That means, if there is a very small amount of street view coverage in an isolated place, you will still see it at very zoomed-out levels: blue lines get simplified and clustered, but not dropped. The vast majority of the world has no street view, or even roads. This gives us our first strategy. We can start downloading at zoom level 0, where the whole world fits in a single tile, and then recursively drill down to the next zoom level. If we download a street view tile that has no blue pixels at all, we do not need to drill down further, as we can assume that all tiles that we could explore there will also be empty. This simple idea reduces the amount we need to download by about 95%.

Experimentally, I found that zoom level 10 gives a reasonable amount of detail, so that individual towns in densely covered areas are visible, while not resulting in a prohibitive amount of tiles to download.

To archive at this level, we need to download about 75,000 tiles per day. That's a lot, but it's not something that would be a problem for a website operating at Google-scale.

All downloaded tiles go into a PMTiles archive. PMTiles is a single-file archive format for slippy maps. A PMTiles archive for the street view blue lines up to zoom level 10 is about 300MB.

Diffing tiles

Displaying the difference between tiles is actually very simple. Tiles have blue pixels if there's a blue line, and transparent pixels if there's no blue line.

On the client side, we can load the archived tiles for two different dates, and then create a new tile with the Canvas API, simply by comparing each pixel:

Old tileNew tileOutput
BlueBlueBlue
BlueTransparentRed
TransparentBluePink
TransparentTransparentTransparent

This gives minor artifacting at the edges of blue lines, but it's almost perfect for a human exploring the data.

Making it run forever

All of the above has basically been done for about a year already, with some iterations before that. In the past, I used a SQLite database to store all the tiles. This worked pretty well, and let me reuse tiles that didn't change (which is the vast majority on any given day), by using a table as a content-addressable store.

I did have a few problems, though:

  1. Serving tiles from the database required a server application.
  2. All tiles were stored on fast, expensive NVMe storage when the vast majority of tiles are almost never accessed.
  3. Periodically, the database would reach the size of the volume I provisioned for it, requiring manual intervention (and increased storage cost).
  4. The server application and the archiver ran on an always-on VPS, which requires periodic manual maintenance for security updates etc.

Some months ago, I moved from SQLite to PMTiles. The archiver downloads all the tiles to a PMTiles file, which gets stored on S3. This solves problem #1, #2, and #3: S3 storage is "bottomless", so I never need to increase the available disk space. PMTiles files can be read directly by clients using HTTP Range requests, so I no longer need a server application.

In January, I also finally resolved problem #4. I essentially wanted a managed Cron job that could run the archiver at a set time every day. I used GitHub Actions for a while, which actually worked great, but scheduled Actions stop running if there's no activity on the repository for a while.

So, the goal was to use Amazon ECS scheduled tasks. If I made a change to the archiver, I'd just push a new Docker image to ECS, and it would get picked up automatically the next day.

I poked around the AWS console for several hours trying to set everything up, but all the JSON configs and all the different services that are apparently involved kind of broke my brain. I'm not looking to become an AWS expert. So, I reached for a tool that I never thought I'd use, as someone who mostly runs small side projects: SST.

SST

SST is mainly used for deploying serverless infrastructure. It turns out that my little side project, that was once a single Rust binary I manually copied onto a single server, is now technically a "serverless" application. And SST can write all the tricky JSON configs for me!

With SST, you write the end-state of what you want your setup to be in TypeScript. SST then figures out how to get there.

I found it actually really nice to have TypeScript here. I knew that I wanted a scheduled task for the archiver, and a bucket to store the PMTiles archives. So I could start writing my sst.aws.Bucket and sst.aws.Cron resources, and TypeScript would tell me what other resources I needed because they are required arguments. It's much, much easier than getting lost in the AWS console and only finding out I'm missing a certain resource on page 3 of the setup. I definitely would not think to start with whatever a VPC is if it wasn't for this.

const vpc = new sst.aws.Vpc('SvTrackerVpc')
const cluster = new sst.aws.Cluster('SvTrackerCluster', { vpc })

const bucket = new sst.aws.Bucket('SvArchiveBucket', {
  access: 'public',
  cors: {
    // These are needed to load parts of the PMTiles archive
    allowHeaders: ['range', 'if-match'],
    allowMethods: ['GET', 'HEAD'],
    allowOrigins: ['https://sv-map.netlify.app'],
    exposeHeaders: ['etag'],
    maxAge: '3000 seconds',
  },
})

const archiveTask = cluster.addTask('Archive', {
  link: [bucket],
  image: { context: '.', dockerfile: 'Dockerfile' },
})

new sst.aws.Cron('SvTrackerArchive', {
  task: archiveTask,
  // 12:00 (noon) UTC daily
  schedule: 'cron(00 12 * * ? *)',
})

That's where we're at now! It seems to be working fine, and it should keep working forever!