Aligning German Power Plant Data Sets

From Enipedia
Jump to: navigation, search

This is an effort to integrate the work that we've done on locating open data about power plants, and will also survey the comprehensiveness of the different sources, which range from "officially" compiled data sets to crowd sourced data.

This is also a good excuse to further our work on the Enipedia Power Plant Dataset Reconciliation API and to figure out how to improve the work on Enipedia Maps to better help with aligning/comparing data sets. There's quite a bit of work here which can be reused. One of the big issues with aligning data sets is that you often have to use many contextual clues to disambiguate power plants. Being able to compare data sets via maps, tables/spreadsheets and visualizations can really help to bring these together.

List of sources:

[edit] Linking to Identifiers used across Different Data Sets

Information about links to external data sets could possibly be managed via Semantic Internal Objects which contain the fields below. The current method is to specify a new property for each identifier, but this isn't a very scalable solution, as people would have to edit the PowerplantTest template, which is currently locked down as it is used on 50,000 pages.

  • Identifier - the identifier used within the external data set.
  • Source - link to some page/URL that further describes the source.
  • Mapping type - same (default), broader, narrower. This is especially needed for things like wind farms built in stages. There needs to be some way to specify is something is a subset of an entry in a different data set.
  • Notes - loose text to provide context

[edit] OpenStreetMap

The OSM data for Germany is incredibly extensive, in some areas even showing PV systems on houses, with information about their capacity. To align the data, it would be helpful to split it into different layers/files based on properties such as fuel type.

For linking OSM identifiers to other data sets, it's probably necessary if the ID refers to a node, way, or relation. I'm currently not sure if given a numeric ID, if the type of object can then be retrieved. Doing this correctly is a bit complicated:

"What if you want to link to a specific object in OpenStreetMap? You can but you shouldn't use an object ID, because the OSM IDs may change at any time. In the context of license change, this even may happen rather frequently. Also, every split of a way leaves one half with a different ID than the original way.

The solution is to link to the object with a certain property, usually a certain combination of tags. If a unique object exists, you are redirected to the object's web page at Otherwise, if for example the referred way has been split, a search result page shows all possible objects. As an extra, you can customize both the link target and the appearance of the result page."

If you can't reliably link to an object, then this becomes more of an issue of making the power plants in OpenStreetMap less anonymous. This means figuring how to create quasi-identifiers that can be used to uniquely identify something. This also ties into the concept of k-Anonymity. While the data sets that we're dealing with aren't anonymized, due to the common lack of unique identifiers and differing levels of detail, we're practically facing the same types of problems as researchers who try to determine the level of anonymity within a data set.

For OSM data, the IDs for power plants may not actually change that much. Wind turbines and even power plants are often represented as nodes. IDs may be lost if someone re-maps out a power plant in terms of its buildings and site layout, but as long as you have the coordinates of the item with the missing ID, you can go and inspect what happened to make it disappear.

The number of occurrences of various fuel types can be found via:

./osmfilter GermanyPowerGenerators.osm --out-count=power_source
./osmfilter GermanyPowerGenerators.osm --out-count=generator:source

[edit] EU-ETS

List of EU-ETS data entries with optional columns featuring Enipedia links for current matches.

PREFIX euets: <>
select * where {
?x euets:countryCode "DE" . 
OPTIONAL{?x euets:city ?city }. 
?x euets:latitude ?lat . 
?x euets:longitude ?lon . 
?x rdfs:label ?name . 
?x euets:euetsID ?euetsID . 
OPTIONAL{?x euets:account ?account . 
?account euets:AccountHolder ?accountHolder }. 
?eni_x prop:Country a:Germany . 
?eni_x prop:EU_ETS_ID ?ID . 
?eni_x prop:Ownercompany ?owner . 
FILTER(?euetsID = str(?ID)) . 
} ORDER BY ?city

See results here

Of the 1,955 distinct identifiers in the EU ETS, 168 are linked to entries in Enipedia. The total number of German power plants in Enipedia linked to EU ETS entries is 174, which indicates that there are either improper matches, or that some entries on Enipedia may be for multiple generation units of the same site.

The entries with the same EU ETS IDs which need to be checked are:

eni_1 eni_2 ID
Schwarze Pumpe Oxy Powerplant Schwarze Pumpe Powerplant 1984
Schwarze Pumpe Powerplant Schwarze Pumpe Oxy Powerplant 1984
Marl Veba Powerplant Werk Marl Powerplant 1997
Werk Marl Powerplant Marl Veba Powerplant 1997
Heinrich-fischer-bad Powerplant Weststadt Powerplant 2666
Weststadt Powerplant Heinrich-fischer-bad Powerplant 2666
Mhkw Darmstadt Powerplant Bhkw Darmstadt Powerplant 2678
Bhkw Darmstadt Powerplant Mhkw Darmstadt Powerplant 2678
Erfurt-ost Powerplant Mva Erfurt-ost Powerplant 4217
Mva Erfurt-ost Powerplant Erfurt-ost Powerplant 4217
Neumunster Powerplant Wittorfer Feld Landfill Powerplant 4221
Wittorfer Feld Landfill Powerplant Neumunster Powerplant 4221

The EU-ETS Data on Enipedia needs to be updated to match the latest data from the CITL. Aside from there being newer data available, the current data in use on Enipedia has issues, such as the installation name being the same as the installation identifier. An initial survey of the latest data shows that this has been fixed, at least for some installations (current version vs. Enipedia version).

This data should be compared to the EU ETS data in Excel to see if there is any useful additional information in the spreadsheets that is not available in the XML files. The spreadsheets all have slightly different formats and have various inconsistencies. For the Italian data, the several Installation IDs are "TBD (account not open)", but they do have a Permit ID. For Belgium, they do have an Installation IDs, but they just use the numbers 3 through 741. Permit IDs are available though, but Germany doesn't list these.

Personal tools