Work with full-history OpenStreetMap files

I’m working with a small team to develop ProjetDuMois.fr (project of the month) to encourage thematic contribution on OpenStreetMap during a month in France. This website will offer to community a dashboard with contribution statistics, a web map for efficient mapping, and badges for gamification. To offer precise and up-to-date statistics, it was necessary to start working with full-history OpenStreetMap files (.osh.pbf). It was a new topic to discover, and I will share with you this process journey, hoping to make it easier for others to work with these files.

What are full-history OpenStreetMap files ?

As a reminder, OpenStreetMap is a worldwide geographic database, we sometimes call the Wikipedia of maps. Anyone can contribute and make it better. Like many other collaborative systems, an history of all changes made is available. When someone edits the map, a changeset is created to describe what has been edited. Cumulating all changesets leads you to what database currently is.

10 years of OpenStreetMap edits

Most of the time, when you want to play with OpenStreetMap data, you’re interested in current state of database. This is available through many forms and providers. These extracts are perfect fits for creating web maps, geographical or statistical analysis, routing and so on. But what if you want to look at OpenStreetMap’s history ? What if you want to see how OSM looked 1 week ago ? 1 month ago ? Or 5 years ago ? In that case, you need to use full-history files. These files are different from simple extracts as they cumulate all changesets since the very beginning of OSM (all changes since 2007, data between 2004-2007 is provided in its 2007 state). Quite like an history book you can read with appropriate toolbox. Full-history files offer a rich set of information :

  • Geometry (nodes, ways, relations) and their attributes (tags)
  • In every version they went through
  • With changes metadata (user, date, used software…)

This is a literal goldmine for statistics. And as all treasures, you have to go on a quest before getting it !

Note (2020/08/18) : this article is meant to go into technical details on how to manipulate OSM historical files. If you want a simpler way to go through OSM historical analysis, you can try Ohsome API.

Picture by Sajan Shakya24 via Wikimedia Commons under CC-By-SA

Getting full-history OSM files

Let’s get practical, and see how we can work with full-history files, beginning with download.

Manual downloading of full-history files

Depending on what you want to do, you might be interested in different files :

Note that, as these files contain personal information about European Union citizens, you may comply with General Data Protection Regulation (GDPR) if you go beyond playing around on your personal computer with these files. That’s also why Geofabrik extracts are only accessible with an OpenStreetMap account, to quote their homepage :

These files may only be used for OpenStreetMap internal purposes, e.g. quality assurance. You must ensure that derived databases and works based on these files are only accessible to OpenStreetMap contributors.

As legal aspects are now clear, let’s continue our journey. In this article, we will work with a regional extract, because they are lighter to download and process. But all instructions below will be identical with world file. So let’s go on Geofabrik, what we look for is a file called your_region-internal.osh.pbf. They can be found on dedicated page of your area or country.

Link to download « Bretagne » area (French subdivision)

And this is it for downloading manually. If you’re not interested in automating this part, you can skip to section « Completing full-history file« .

Automatic download of full-history files

As you have seen, we need to first log-in with OSM account to get regional extracts with user metadata. The process is straightforward when done by hand, but needs more steps when done through a script. Let’s see how you can automate this task.

To protect its files with a login page, the Geofabrik team has created sendfile-osm-oauth-protector, an Apache module to enable OpenStreetMap authentication. We are here interested only by the Python client, which enables file access with OSM login and password. Before going further, you have to download the software repository and install client dependencies.

Once done, you should have access and be able to use the Python script called oauth_cookie_client.py. Its job is to give you cookies (yummy !) which we will use later for downloading OSM full-history file. Run the following command to generate a cookie file :

python3 oauth_cookie_client.py \
	-u "MyOSMAccount" -p "MyOSMPassword" \
	-c https://osm-internal.download.geofabrik.de/get_cookie \
	-o cookie.txt

The cookie file will contain a token which enables you to connect to Geofabrik site. Now, we can download through command line the regional extract of our choice, for example with wget. For some (unknown) reason, the generated cookie file can’t be taken directly by Wget, that’s why in the command it is processed before being used. So, run the following command (just change URL to get wanted extract) :

wget -N --no-cookies --header "Cookie: $(cat cookie.txt | cut -d ';' -f 1)" https://osm-internal.download.geofabrik.de/europe/france/bretagne-internal.osh.pbf

You’re good to go, the automatic way !

Completing full-history file

Now, we have a full-history file of OpenStreetMap. Full-history… really ? Well, not really. As said above, everything before 2007 is summarized as how database looked when the API changed to 0.5 to 0.6. But it’s not the worst part… It’s full-history until last export made available roughly every week-end. And this export contains data from 4 days before extract is made available. So the full-history file contains data until ~7 to 14 days ago, which is fine for many uses. But statistics I was looking for needs to be maximum 1 day late, so I needed to catch-up missing data.

If you’re not interested in catching-up data very recent data, you can go directly to section « Playing around with your full-history file« .

So, to complete our history file, we need to retrieve very fresh data. And OpenStreetMap most fresh data is available through its replication system (also called diff files). Replication files are minute/hourly/daily XML files which contains all changes that happened to database during a given amount of time. These files are made available in OsmChange format (.osc.gz), basically the same as classic OSM XML files, but with indication of what to create, edit or delete. Like other data files from OSM, they can be found either on official website or through mirrors.

What we will do here is to complete our history file using replication files. There are several ways to do so, here is the most convenient one I found. We will use two tools : Osmupdate and Osmium. First one will generate a complete OsmChange file, from the minute your full-history file stops until now. Second one will merge this OsmChange file with your full-history file to get you a single, up-to-date, full-history file.

Let’s start with Osmupdate. You can install it following instructions here (or through your distribution package, sometimes it is called osmctools). Once done, we will ask it to download recent changes from OSM :

osmupdate --keep-tempfiles --day -t=/somewhere/to/store/diff/files -v /your/current/full-history.osh.pbf /path/to/resulting/changes.osc.gz

Note here the option --day : it only retrieves daily replication files. This get your history until last midnight, which is enough for my use case. But you can change it to --minute or --hour if you want to, however it will take longer to retrieve as it will download lot of small files. You can also use another source than official one using --base-url option. Note that it’s better if the mirror offers daily and hourly replication files, otherwise it will be quite long to catch-up. OsmUpdate will look to your full-history file, check the date of last contained data, then download everything missing. It will finally produce the desired OsmChange file. This file will contain more or less data if you are using a mirror with regional data or official website with planet data. In my case, I used main website so the file contains planet data. However, I’m working with a regional extract. Merging it as is will result in a mix of regional then worldwide data, which will be inconsistent. I need first to filter changes to only keep regional changes, so that’s where Osmium comes to the rescue.

Install and configure Osmium according to documentation. Before going further, we also need another small file named your-region.poly. This is available on Geofabrik next to the extract you choose. This describe the area of your region, that will be helpful to cut planet data. Once downloaded, you can run this Osmium command to reduce OsmChange file to only contain changes in your region :

osmium extract -p bretagne.poly -s simple changes.osc.gz -O -o changes.local.osc.gz

Depending of the size of your region and how old your full-history is, this file will be more or less heavy. Last step is to merge this changes file with your full-history, to make it really complete :

osmium apply-changes -H /your/current/full-history.osh.pbf changes.local.osc.gz -O -o your-new-complete.osh.pbf

This can take some minutes depending of input file size and capacity of your computer. Now, that’s what I call a full-history OpenStreetMap file, so here starts our time travel through OSM data !

Picture by JMortonPhoto.com & OtoGodfrey.com via Wikimedia Commons under CC-By-SA

Playing around with your full-history file

You have many possibilities using these files, we will see two of them here. Osmium documentation give a pretty good overview of what is possible and how to proceed. Let’s start with a simple use case.

Counting features at a certain time

For ProjetDuMois.fr, I was looking for showing evolution of features in OSM in a certain period. We will make use to do so of two commands in Osmium : time-filter and tags-count. What they do is pretty self-explanatory, but you can see their documentation using osmium tags-filter --help for example. Let’s say we want to know how many charging stations there was on 01/08/2020, just run :

osmium time-filter your-full.osh.pbf 2020-08-01T00:00:00Z -o - -f osm.pbf \
    | osmium tags-count - -F osm.pbf amenity=charging_station

Note that we really have two separated steps here : first we get a classic .osm.pbf file of what data looked like at the time, then this result is sent to run counting operation. This processing can be automated using a loop to have count for every day or every hour for a certain type of feature. If you want to loop over time, you may want to pre-filter your full-history file to only keep features you’re counting (for faster processing). This can be done with this command :

osmium tags-filter your-full.osh.pbf amenity=charging_station -O -o charging_stations.osh.pbf

List all changes on certain features

For a leaderboard of contributors on a specific theme, I needed to know who edited a certain type of features over time, either creating, editing or deleting them. This needs several steps to offer a complete view of changes, in order to keep track of feature before and after it gets wanted tag :

  • Extract features over time having wanted tags
  • Summarize this as a list of OSM IDs
  • Extract list of all changes happening to features based on their ID
  • Convert result into a single CSV file

So first step is to extract features based on their tag. We’ve seen it before, it’s with Osmium tags-filter command :

osmium tags-filter your-full.osh.pbf -R amenity=charging_station -O -o extract_filtered.osc.gz"

Note the -R option to only keep features themselves and not their dependencies (nodes being part of a way, nodes and ways part of a relation).

We have an OsmChange file, which is not really convenient for the next steps. We need a simple list of OSM IDs to use osmium getid. To transform the XML file into a simple list of IDs, we will use xsltproc command, old fashioned tool for manipulating XML. Download the XSLT file here and then launch this command :

xsltproc osc2ids.xslt extract_filtered.osc.gz | sort | uniq  > extract_osm_ids.txt

It transforms XML into a list of IDs, then sort and deduplicate it. Now, we can use Osmium to filter features by their IDs using this list :

osmium getid your-full.osh.pbf -H -i extract_osm_ids.txt -O -o extract_osm_ids.osc.gz

This reads the list of IDs, find features in your full-history, then writes them into an OsmChange file. Note that if you work with a large dataset, Osmium can be killed because it uses too much memory. Splitting the list of IDs into smaller files (with split utility for example) and looping over osmium getid solves the issue. Now, you have an OsmChange only containing history of features which had at some point the tag you’re looking for. Last part is to make this usable in a classic spreadsheet software or database. We transform it as a CSV file, again using xsltproc. You can find another XSLT configuration file here for this part, then run :

xsltproc osc2csv.xslt extract_osm_ids.osc.gz > changes.csv

Awesome, you can see whoever created, edited or deleted a charging station in your area !

Conclusion

We made it through the full-history OpenStreetMap data analysis. We saw together where to find these files, how to complete them to be up-to-date, and what kind of processing we can do. They need a quite different toolbox compared to other OSM files, but allows a whole new world possibilities. Remember that they should only be used for analysis for the community. When using it, keep in mind that you’re manipulating metadata about people’s workflow and daily contributions. With great power comes great responsibility ! Also thanks to Frederik Ramm and Michael Reichert for their help on using Geofabrik extracts behind OSM login system.

Looking for OpenStreetMap data expertise for your projects ? Contact me and let’s discuss !