Enipedia Power Plant Dataset Reconciliation API

From Enipedia
Jump to: navigation, search

Note: This work has been superseded by that of Elasticsearch on Enipedia


[edit] Background

Energy and Industry Data Sets contains a list of a large number of various data sets that would be nice to integrate into Enipedia. In order to do this, you need to figure out which entry on Enipedia corresponds to the entry in those data sets. This isn't always straightforward, as there is not a standard naming convention for power plants and for one power plant, and people may call the same power plant different names.

One could match these by hand, but this quickly becomes an overwhelming task. It is desirable to have an automated process that can suggest likely matches, and then rely on human input for verification. This page documents several of the approaches that we have looked into in order to deal with this issue. One approach is to use the Reconciliation API defined by Open Refine (formerly Google Refine). The page Power plant dataset reconciliation example with Google Refine shows a detailed example of how this could be done. While this works to an extent, an issue is that multiple types of data fields are often needed in order to do accurate matching, and the importance of different fields may depend on the data sets being used. The Reconciliation API defined by Open Refine works with a single score and doesn't have flexibility to show a multitude of scores, such as how well the company names match, how close possible matches are geographically, etc.

[edit] Google Refine

We've developed a Reconciliation service that works with Google Refine and is based on these specifications. The url is http://enipedia.tudelft.nl/matching/reconcile.php, and you need to specify the Country column in order for the matching to work.

The more data that is available, the better the matches will be. You can also specify which columns correspond to which properties by filling out the "As Property" field with Owner, State, Latitude, Longitude, and/or Point (latitude and longitude separated by a comma).

The name that is returned is a combination of the power plant name, the owner, and the city. This is done to help the user disambiguate matches.

Once you have created a project in Google Refine, you can reconcile the data using the steps outlined in the pictures below. We've place more detailed instructions at Power plant dataset reconciliation example with Google Refine.

Click on the column to match on, then "Reconcile" → "Start reconciling..."
"Add Standard Service...", then fill in http://enipedia.tudelft.nl/matching/reconcile.php and click "Add Service". Check the column name corresponding to the country, type in "Country" in the "As Property" field, and click "Start Reconciling"
Example results for a single entry

[edit] Implementation

There are two ways to use the API. The user-friendly approach is via Google Refine which provides a visual web interface with which you can upload spreadsheets. The more expert approach is via sending requests to the API directly, which gives more flexibility in terms of how you process the data.

The backend is based on the R script here.

[edit] Sending POST requests directly (advanced)

The API can be accessed even without using Google Refine, and reconciliation can also occur by using a tool such as curl to send POST requests directly to the API and get results in JSON back. This isn't a very user friendly approach, but is more of an expert mode as it gives greater flexibility in terms of how you send queries and process the results. See Accessing Enipedia Power Plant Dataset Reconciliation API via POST requests for more details.

[edit] Matching Algorithm

The code currently looks at the Jaro-Winkler distance, the Levenstein distance, and the Jaccard Index for strings. These values are used to define a three dimensional vector, and the final score for the matches is the distance from the end of this vector to the origin. The idea is that "good" matches should have high scores for several different types of matching techniques.

[edit] Known Issues / Areas for Improvement

  • Country must be specified in the json query string. This is done to speed up the matching code and reduce the query time for the data on Enipedia.
    • TODO - The code needs to be improved to give a better response either when the country does not exist, or if it cannot be matched directly. It seems that Google Refine sends out batches of several entries to match, and if the country for one of them cannot be found, this seems to generate an error for the whole set.
  • This presentation on Record Linkage: Theory and Practice gives a good overview of different approaches that can be used to improve the code.
  • Allow for matching even if owner or powerplant name are blank. This occurs in some of the German data.
  • Allow for matching on Capacity.
  • It would be nice if scores for different types of matches could be returned. This would allow you to do things like first filter out all the good matches based on company name (rejecting power plants for all the other companies). One difficulty with this is that you would then have to create a new row for each possible match, which may not be possible.
    • We can also employ blocking to narrow down the search name. If there is a good match for the owner name, then only the power plants owned by that company can be used for matching. To increase the accuracy of this, it would help to also know about the ownership structure of companies and be able to perform matches on both multinational companies and their subsidiaries. A first step for that would be to also use company names for linked entities from Carma v3 dataset, allowing for the use of both subsidiaries (from v2) and international groups (from v3) for matching. A very good example is GDF-Suez
      • All the groups and their possible subsidiaries can be found via this query. Most of them seem to make sense, although there are a few ownership changes that are visible. For example, one of the results shows that Essent N.V. is a subsidiary of Akzo Nobel Nv, which is not the case. This particular case turns out to be a joint partnership between them under another name. Looking at pairs that co-occur frequently does seem to be a more reliable indicator of ownership structure.
  • In principle, the API could also be called from the code used by Enipedia Maps. Allowing for a visual interface may make matching a bit easier. This can also be tied into some of the work with scraperwiki.
  • When using Google Refine, the links to the "view" and "preview" services are not implemented yet. These should be set up give some view of all the data for a power plant in addition to a map.
  • The default number of matches returned is 5. The code could be modified to have this sent in the query request.
  • For the returned matches, "match" is always set to false to avoid hiding obscure but possibly relevant weak matches.
  • The code retrieves data from Enipedia via our SPARQL Endpoint. This is behind a caching proxy server, which means that you may get old results after updating the data on Enipedia. We need to figure out a reasonable strategy for clearing out the old results while also not hitting the server with the same queries all the time.
  • The speed may be improved through parallel processing and also pre-excluding possible matches outside a certain geographical distance (if coordinates are known).
  • The matching algorithm should be developed further to be both probabilistic and rule-based. A particular issue is that the company listed for a power plant may be the subsidiary of another company. Having information on the relationships between companies would allow us to have more data to create more certain matches.
  • It would also be useful to incorporate information theory approaches such as considering the self-information and mutual information. The main idea is that you don't want to match on very general terms such as "Powerplant" or "kraftwerk" since these don't provide much additional information, but it is better to match on terms you are very unique in both of the data sets.
  • We need to work on the process of integrating the matched data into Enipedia. To do this, we need the following:
    • Add references of data to allow for checking at a later period
    • Check if the name of a company is contained in the name of the power plant.
    • Distinguish between entries describing a power plant (i.e. a physical site) vs. power generation units, several of which may co-exist on the same site. For many of the UK power stations, it's common to see multiple power plants together (each with multiple generation units) that share the same name, but have the suffixes mentioned below. A few patterns to check for:
      • identifier for power generation unit: single letter " [A-Z]$", number "[0-9]$", roman numerals " [IVX]+$"
      • type of power plant: " GT" (gas turbine), " CCGT" (combined cycle), " OCGT" (open cycle), " CHP" (combined heat and power)
      • building stage: " Extension", " Rebuild", "Compensation Set"
      • geographical location: " (South|Central|North|East|West)$"
    • Add new entries for power plants not on Enipedia. Also set up system to check for duplicates in existing data.
    • Need standard terms for column names that correspond to fields in the Enipedia data structure.

[edit] Alternative Approaches

The experience so far has shown that string matching, filtering by distance, etc. isn't enough, and essentially some sort of expert system needs to be created in order to automate the process of alignment. The reconciliation feature of Google Refine runs into limitations due to the nature of the data that we're trying to match.

  • The data sets are not always talking about the same exact entity - there can be different levels of aggregation and overlap. The example of Wilton Power Station below shows this quite well. For example, the EU ETS data probably doesn't include Wilton 10 which burns biomass.
  • We need to be able to check for conflicts with existing data - It would be great to be able to just dump in the data to Enipedia, but we need to be able to check if there is existing data that conflicts, and if so, figure out why it conflicts.
  • Data in multiple columns needs to be aligned - Being able to align the company name to the entries on Enipedia would be useful to increase the efficiency of the matching process. All facilities which are mentioned in two data sets and are owned by the same company should be able to be matched (also checking for subsidiaries and joint partnerships).

It would also be useful to make links to indicate if entities are related in some way. This wouldn't necessarily indicate a one-to-one mapping between them, but could indicate overlap or nestedness. This would at least organize things enough to allow for later disambiguation.

The current code is in R, and isn't optimized to allow people to retrieve possible matches very quickly (ie. milliseconds versus 10's of seconds or more). It would be nice to use something more scalable that is being developed by a larger community to address the same types of issues. There's a few projects that may be promising:

  • Apache Stanbol seems promising, especially the Enhancement Engines and how it uses Natural Language Processing, disambiguation, etc. Initial tests on text containing "Navajo Generating Station" returned mixed results depending on which enhancement engine was selected or had priority. In general, it's interesting for its ability to recognize entities in plain text. There's also some interesting work on using Solr's More Like This function for disambiguating entities (github).
  • Elasticssearch seems quite interesting. See Elasticsearch on Enipedia for current development notes. See also:
  • Interactive Deduplication using Active Learning - some initial tests on implementing this have been promising. The great thing about this approach is that you just dump in a lot of measurements comparing entities (string distance metrics, percent differences of numbers, etc), and the decision tree figures out which of these measurements (and their cutoff values) are useful for figuring out what is a match and what isn't. Using active learning as they mention would be powerful as it would allow for crowdsourcing of this process and allowing people to vote on whether suggestions are matches or not. In an initial test on the LCPD data, it first comes to the conclusion that string metrics are only needed for linking entities, but once you tell it that for a single power plant, "gt 1" and "gt 2" aren't the same thing, then it decides that also looking at fields like heat output and fuel consumption, etc are important.
    • This ties into a larger lesson learned from the work on Enipedia. Linked Data is great if you're working with data that is well structured and has unique identifiers that are consistent across multiple data sets. Most energy data sets don't really have this (this has been a motivation for researching elasticsearch), and the way forward is probably more along with lines of "Loosely Coupled Open Data", where you have clusters of things that are mostly similar. Decision trees are interesting since they give you a somewhat human-readable output that shows the decision pathways used for determining if something is a match or not. This also provides a way to aid the process of relatively quickly generating a training set that can be used to test ideas for algorithm improvements.

One vision for where this could head is something like how LinkedIn asks you to verify if people have certain skills. The idea would be to create some sort of system where people could quickly identify links without too much effort. Once links are in place, queries like this can help direct attention to entries that can be fixed up without much additional effort.

[edit] Test Cases / Examples of Difficulties

This is a list of things we know should match, which the matching software doesn't match yet. This can be used to build up training sets to evaluate the performance of improvements to the matching software

[edit] Wilton Power Station

The table below illustrates some of the difficulties that occur in matching entities that are described in multiple databases. The leftmost column describes the different entities in the databases, while the rest of the columns show the names that are given to these.

Entity Data Sources
(based on most commonly encountered name) SembCorp Website UK Department of Energy and Climate Change EU ETS E-PRTR Large Combustion Plant Directive Carma.org (subset of WEPP) Wikipedia
(entire site, data for all units aggregated together)
Wilton Power Stations
  • SembCorp Utlilities Teesside Power Station

(This is likely without Wilton 10 as it burns biomass. Without the inclusion of the owner name, this could be confused with the other Teesside Power Station)

  • Sembcorp Utilities (uk) Ltd
  • Sembcorp Utilities (uk) Ltd Wilton 10 Power Station
  • Sembcorp Utilities (uk) Ltd, Wilton Power Station

(There's only one entry for a facility named Wilton that is owned by SembCorp. The labeling of this as Wilton 10 is likely wrong as mentioned in the discussion below)

  • Wilton Cogen

(aggregation not clear, are other units included?)

  • Wilton power stations
  • (a.k.a. SembCorp power station)
(Power stations within the site)
Wilton Power Station (main station)
  • Wilton Power Station
  • Wilton Power Station
  • Sembcorp Utilities U.K Ltd Wilton
  • Wilton Power Station
Wilton GT2
  • Wilton GT2
  • Wilton GT2
  • Sembcorp Utilities U.K Ltd Wilton
  • Sembcorp Utilities U.K Ltd Wilton 3
Wilton 10
  • Sembcorp Biomass Power Station
  • Wilton 10
  • Wilton 10
  • Sembcorp Utilities U.K Ltd Wilton
  • Sembcorp Utilities U.K Ltd Wilton 2
  • Wilton 10
Wilton 11 (planned)
  • Wilton 11 (planned)

The UK Department of Energy and Climate Change lists this as consisting of Wilton Power Station, Wilton GT, and Wilton 10.

The owner website mentions Wilton Power Station, Sembcorp Biomass Power Station ("Wilton 10"), Gas Turbine (GT1) and Gas Turbine (GT2). GT1 seems to be a part of the main power station.

Wikipedia mentions that this is a collection of several power stations, and is also known as SembCorp power station, but is not the same as Teesside Power Station.

Finding this in the EU-ETS data is not straightforward from a string matching perspective. Searching for "Wilton" returns an entry for "Wilton Olefins 6 (Cracker)", which is not it. Searching for Sembcorp returns an entry for "SembCorp Utlilities Teesside Power Station", which is likely it. The EU-ETS also contains an entry for the Teesside Power Station.

The E-PRTR entries are confusing as well. Running a query shows that there is only one facility in wilton owned by Sembcorp, which is listed with three different names: "Sembcorp Utilities (uk) Ltd", "Sembcorp Utilities (uk) Ltd, Wilton 10 Power Station", "Sembcorp Utilities (uk) Ltd, Wilton Power Station". The fact that Wilton 10 is mentioned for this one facility is strange since the main power plant at the site should be mentioned as well.

Looking at the entry for this facility on the E-PRTR site and examining the data for previous years shows that name changes quite frequently. There's no indication that the main Wilton Power Station is shut down, so it seems erroneous to label this as Wilton 10.

Further evidence that this labeling is wrong is that the EU ETS data lists CO2 emissions for 2010 as being 790,657 tons (so only CO2 from fossil fuels), while the E-PRTR estimate is nearly the same with 791,000 tons the same year. The fact that these numbers match so well is a bit strange as the E-PRTR should include CO2 from biomass, while the EU-ETS should exclude it. This assumption is backed up by the notice here that The CO2 emissions for EPER and E-PRTR are reported differently (EPER does exclude emissions from biomass). When searching for pollutant releases in the area, it's possible to select "Carbon Dioxide CO2 Excluding Biomass", but this turns up no facilities in the area. Searching for "Carbon Dioxide CO2" returns Wilton Power Station, which at least implies that the CO2 emissions recorded include those from biomass.

The LCP Directive data contains this:

Year Plant Number Plant name Plant location MWth Biomass (TJ) Other solid fuels (TJ) Liquid fuels (TJ) Natural gas (TJ) Other gases (TJ) SO2 (t) NOx (t) Dust (t)
2007 94 Wilton SembCorp Utilities, Wilton P Stn 418.598 5576.306 143.159 606.001 0 5303.3 3446.1 129.4
2008 60 Sembcorp Utilities U.K Ltd Wilton England 714 0 8302.55 10.214 1161.335 0 2570 1456.2 211.7
2008 204 Sembcorp Utilities U.K Ltd Wilton England 100 2139 0 0 88.492 0 1.1 135.1 4.6
2008 205 Sembcorp Utilities U.K Ltd Wilton England 100 0 0 0 50.672 0 0.01 0.1 0.02
2009 74 Sembcorp Utilities U.K Ltd Wilton England 714 0 4246.646 3.562 5647.128 0 1164.2495795525 992.0418719788 87.9914050096
2009 268 Sembcorp Utilities U.K Ltd Wilton 2 England 100 2669 0 0 100.07 0 2.01 193.6 1.07
2009 269 Sembcorp Utilities U.K Ltd Wilton 3 England 100 0 0 0 204.3 0 0.17 0.17 0.29

This is quite horrible as the plant number doesn't seem to be a unique identifier as it changes every year. The name may change as well. The entry for Wilton is likely the main plant. Wilton 2 and 3 are not entirely clear. Just based on the fuel type, it seems that Wilton 2 is Wilton 10 due to the use of biomass. Wilton 3 may be GT2. As a quick calculation, with 204 TJ = 56750 MWh, and assuming 100 MWth in addition to 44 MWe, this results in 394 hours of operation per year at full capacity.

[edit] GDF Suez power plants

  • (passed) RÖMERBRÜCKE to Romerbrucke Powerplant - The php code passes a json string to R, and something goes wrong with the encoding in the process. Trying to keep everything in UTF-8 doesn't seem to work, and php thinks that it's looking at ASCII.
  • Awirs 5 to Les Awirs Powerplant - Looking at the self-information of terms may help with this. The other candidate matches don't contain 'Awirs', which likely occurs relatively rarely in the data

[edit] Belgian power plants

Base data was taken from Elia Production Park, completed with addresses from network access points, and some added entries from corporate web sites of Electrabel and EDF Luminus.

  • A first reconciliation was attempted by matching ARP to ownercompany in addition to plant name. This proved to give too much weight to companies with respect to names, especially since company structure and responsabilities may be complex to figure out. Some examples:
    • Beveren 2 Indaver with Electrabel did not suggest Beveren Wte Powerplant with Indaver Nv as owner but only other Electrabel plants
      • This instance is a good test for how to weight or interpret string matching scores. When using a soup approach of matching a list containing the plant name and the owner name, the correct match has the highest Jaccard index, although it is tied with another entity. The Jaro-winkler and Levenshtein measures don't return this as the top candidate, but as a close second. Here a human mind would probably make an unconscious use of information entropy, considering Indaver more identifying than Electrabel (especially here were we know ARP doesn't mean owner). Anyway, there will always be the need for human review, what should matter is that he is provided with the complete picture not a partial one missing some good candidates.
        • This test is also a good challenge in terms of getting the right candidates to the humans in the first place. While the correct match is tied with the highest value for the Jaccard index, it's ranked 100th for the Levenshtein distance and 165th for the Jaro-Winkler distance, since with a soup approach it's matching a lot of the Electrabel plants. This raises a few questions and observations:
          • Is Google Refine the right tool for this? It only shows the three highest matches, so for this particular case, it would be easy to not return the correct match depending on how the candidates are ranked. If a user is free to examine several different lists of candidates, each generated using a different method, then this would likely address some of these situations. Also, we're not always doing one-to-one matching, and sometimes matching involves disambiguating clusters of things that are related, where we have to add information in order to be able to distinguish them. For us, setting up an API isn't that difficult, and it's also possible to create standalone code based on the code here, but what are the types of interfaces that we should consider having the API talk to?
          • Currently the Levenshtein and Jaro-Winkler distances are calculated using strings composed of multiple tokens, and it might be better to use these to compare the similarity of the individual tokens contained within those strings. A weakness of the current Jaccard index implementation is that it doesn't handle misspellings or similar forms. Implementing calculations related to information entropy could indeed also help with ranking results.
    • Froidchapelle Wind with Electrabel did not suggest Froidchapelle Powerplant which had no owner in Enipedia, but only plants from Electrabel even if their name were very different such as Rodenhuize Wind, Schelle and Ixelles
    • Langerbrugge Stora ST 1 with Electrabel did not suggest Langerbrugge Mill Powerplant with owner Stora Enso Iyj, but Electrabel's Langerbrugge (normal) as well as other as far away as Zeebrugge Distrigas Powerplant. This could easily have resulted in a false match since it was hiding the better of two good candidates.
    • Plate Taille T with Electrabel did not suggest La Plate Taille Powerplant with owner "Ministere Wallon Lequip (Met)" but things like Tielt Balta Powerplant
    • Wilrijk Isvag with Electrabel did not find Wilrijk Powerplant with owner Isvag but instead things like Saint-ghislain Powerplant or Zwijndrecht Van Hauteghem Powerplant
  • A second attempt was made with plant name only, which gave better results for the aforementioned cases but showed weaknesses whenever name do not start the same way:
  • A third reconciliation was attempted adding coordinates resulting from address geocoding but it didn't seem to add any benefit

In all those cases the old Probabilistic Instance Matching with its soup approach gives far better results.

These examples are being used to create a collection of test cases with verified matching entities, and the code is being rewritten to support easier experimentation with different strategies, with measurements generated indicating the matching accuracy. Returning to the soup approach is probably a good intermediate step, with some improvements being the inclusion of stop words to remove terms that are not useful in the matching process.

Many of the stop words can be identified by via examining the information entropy of tokens used for power plants and owners. Tokens with low information entropy (or more simply, they're just used more frequently) are in some cases common terms that don't provide that much additional information and may bias the matching process. For example, many Dutch companies have their legal status abbreviated at N.V., B.V., etc, and the presence or absence of these terms isn't as important as the matching on the name of the company itself. This flip side of this approach is that it also shows common terms that may give clues to what sort of power plant we're looking at. See Power Plant Terms for a list of what has been compiled using this technique. With (quite?) a bit more work, this could eventually be incorporated into some sort of Expert System.

The drawback of stop words is their finite number and the fact that they are language dependent. This is quite clear by looking at Power Plant Terms: type of companies depend both on language and country, and if we include the type of industries hosting a CHP this can virtually grow with every new dataset. Information entropy theory seemed more promising since it can adapt to almost any language and dynamically stretches with any new content, but it may also be complex to tune. For example the name of a large company owning many power plants can be seen as low entropy but is still useful in distinguishing facilities in the same city. On the opposite side, typical stop words such as articles ("Les", "L'" in French) will be less common than in a regular textual corpus.

[edit] Matching Tests Involving Calculation of Self Information

So far, an initial test to incorporate calculations of the self information of shared tokens has been much more successful, with the correct match chosen in all but one case for the Belgian plants listed above. The approach creates a set of tokens containing the plant name and the company, and first identifies how many of those tokens it has it common with a candidate from the Enipedia data. The self information of these shared tokens are then summed, and it is then assumed that the candidate match with the highest value indicates the correct match. The only case that failed was for "electrabel froidchapelle wind", where the correct match was ranked 2nd ("electrabel rodenhuize wind" was first, followed by "froidchapelle").

When creating plots of the sorted values of the self information of shared tokens, it's quite clear that the best candidate is ranked much higher than the rest of the candidates. A drawback of the current implementation is that it's not set up yet to handle misspellings or alternative forms of the same token.

See Initial Test Results Using Self Information Calculations for Entity Matching for output of the work in progress. In short, the algorithm still needs a bit of tuning, but some of the results seem promising compared to the previous techniques we have used.

The current development of the code on github is focusing on matching two complete data sets against each other (with blocking by country), instead of the approach with Google Refine where a single entity from one data set is matched against all the entities from another data set (although it could be adapted to that use case as well). A giant matrix is generated with the matching scores between entities in different data sets. This is currently being tested with help from researchers interested in aligning E-PRTR data to the EU ETS for Spain (see here for output of matching code). The (possible) advantage of this approach is that it gives us more information to work with that can indicate how good the matches may be. For example, if we find all the matching candidates across two data sets, we can find out if two entities are each other's best matching candidates. If the candidates rank each other differently, then this shows that there's some uncertainty and that we will have to disambiguate things.

This approach could be developed further to allow for first investigating the best candidates and then using some other techniques (fuzzy string matching?) to work through the rest. The general concept could be to prioritize taking care of the easy matches, and then use increasingly desperate/not-always-accurate techniques to try to find matches in the rest. The github code currently allows for approximate string matching of tokens, but it's not know yet how enabling this would influence the matching process.

See Example Code for Using Self Information Calculations for Entity Matching for some of the current in-progress work.

[edit] Versions

  • 0.3 (17 December 2012) Implemented matching based on owner name and coordinates.
  • 0.2 (5 June 2012)
    • Using Rserve daemon and the PHP Rserve client (thanks to suggestion by Nono) instead of system call to Rscript that loads the required libraries every time the API is accessed. The code speedup appears to be decent, and further gains may be had via parallel processing (foreach package?) with string matching and distance calculations with coordinates.
    • Fixed UTF-8 issue with Rserve implementation. Previously R code would report NA with strings such as RÖMERBRÜCKE.
  • 0.1 (31 May 2012)
    • Initial implementation
Personal tools