Energy and Industry Data Sets

From Enipedia
Jump to: navigation, search

This a list of Energy and Industry data sets available on the world wide web. Feel free to edit this list. We are always looking for new sources of information that can help out.

Contents

[edit] Power plants

Feel free to edit this page and add sources that may be useful

[edit] Global resources

  • Carma has now published its newest data set (v3) with updated data and power plant list, upgraded models with more realistic estimations, and up to date company names
  • Global Energy Observatory has detailed data about the most significant ones in the world (see scraper here, and compare its coverage versus Enipedia here).
  • IndustCards also has basic data for numerous facilities
  • IndustryAbout has a large number of power plants in its energy category (see scraper here). Compare its coverage versus Enipedia here
  • For nuclear plants, IAEA is the source and is also hosted at enipedia here.
  • For thermal plants, good KML maps at TurbineTec
  • For wind farms, a quite comprehensive listing at TheWindPower - they sell the data commercially, so we should focus on primary sources (owners/operators) instead. See here for a list of the sources that they use.
  • See list of top global energy companies here
  • The Clean Development Mechanism Registry contains information about many of the new power stations that are being developed outside the US & EU.
  • Look into data sets formerly hosted by Kasabi. There is one for renewable energy generators and there may be a few more interesting ones which we may be able to host as well.
  • Wikipedia has coverage of a large number of power plants (several thousand). This ScraperWiki code finds many of these by traversing the hierarchy of categories starting at http://en.wikipedia.org/wiki/Category:Power_stations_by_country. A clickable map based on this data is here. This only finds power plants that have their own dedicated page, and many more can be found in tables on pages that list all the power plants in a country or region. It's possible to extract the data from these tables, although the headers in use are not standardized, which would result in a large number of columns returned in the resulting data.
    • The ScraperWiki code is interesting as it retrieves the data for over 3000 power plants via a single query. Much of the energy related data on Wikipedia is becoming organized through the use of hierarchies of categories, and this same strategy can be applied to retrieving data for things like oil refineries among others.
      • There seem to be a glitch in the way DBPedia is handling categories: some of them are returned with the colon after Category encoded as %3A and thus do not match any plant. This causes some prominent power plants to be missing from the scraper's datastore, such as these ones. See the affected categories here.
        • I've just posted this issue on the dbpedia-discussion list. Searching through the discussion archives didn't turn up anything, and this doesn't seem to be a known issue (yet).
      • On a more systemic level, DBPedia is very handy for querying but it relies fully on the English Wikipedia for its nomenclature while versions in other languages often contain much more information on their local power plants. Compare for example Swiss hydro with its German counterpart. There are localized versions of DBPedia too, but they are less mature yet.
        • This scraperwiki blog post talks about some of the ways to parse templates on Wikipedia. Overall, this is a bit of a painful solution compared to the ability to run a sparql query, but this could perhaps be combined with an API call to "what links here" for each of the different language versions of some sort of power plant template. API calls could then be used to list which of a set of pages have been changed. Another option would be to maintain on Wikipedia a list of links to all the power plant pages. The functionality enabled by Special:RecentChangesLinked can be used to show which pages have changed recently, as demonstrated here, using the list such as that shown here.
    • The work with DBPedia described above is being superseded by a new scraper which traverses the Wikipedia category hierarchy here. The scraper doesn't process the template data, and as such doesn't contain as much diverse data, but it is designed to locate all the power plants with their coordinates on all the different language versions of Wikipedia. The strategies employed are:
      • The hierarchy of categories in English is used as a backbone in the traversal since that makes it easier to ensure that the relevant subcategories are being explored. Relevant categories are determined by looking for specific key phrases such as "power plant", "wind farms", etc. Without this, it's possible that it would start exploring subcategories that have nothing to do with power plants. Other languages could be included if we know which terms to look for.
      • For every category and page in a category, the corresponding page/category is found for the different language versions of wikipedia.
      • For all pages in every language, coordinates are retrieved via an API call.
  • Open Street Map also has coverage for a large number of power plants. This ScraperWiki code finds power generators (over 100,000 by now) by leveraging Overpass API. In OSM, "power generators" range from a large nuclear plant down to a single wind turbine or a private home with only a couple PV panels, but this still can be very useful especially for medium to small wind farms that are not always well covered elsewhere but can be easily spotted on recent imagery by OSM volunteers.
    • As something that could augment the ScraperWiki code, or at least reduce traffic on the Overpass API, we've got a copy of all the OpenStreetMap data (i.e. the planet.osm file) which is updated daily. After updating, all the power related data filtered out into a separate osm file (available here as a zip file). We haven't had time (yet) to think further about the next steps, but we can certainly make this available and develop some scripts to make the data accessible in different formats. See Extracting Power Data from OpenStreetMap for notes and ideas.
  • Wikimapia has quite good coverage, and has an API available.

[edit] Power plants by region

[edit] Europe

[edit] North America

[edit] South America

[edit] Asia

[edit] Power plants by operator

Transnational power groups:

Also developers:

[edit] To do

  • Link to companies on http://OpenEI.org. See here for an example entry.
  • Also look at companies such as ABB that build power systems. See here for info on hydro plants they've worked on (15 pages).
  • Look into the World Bank database on infrastructure projects.
  • The EIA provides information on the total production by country. This should be included as an indicator of completeness.
  • Integrate data from the Large Plants Combustion Directive and the EU-ETS.
  • Connect to Toxics Release Inventory
  • European plants need to be linked to their E-PRTR entries
  • The Renewable Energy Foundation (REF) has a database about renewable energy generators in the United Kingdom, and has the data available via a SPARQL endpoint. We should align our pages with the identifiers they use, link to their pages, and use this to update our own data.
  • Work on eGRID vs. CARMA data for the US. We've tried to match entries from the CARMA dataset with their corresponding entries in the eGRID dataset using a matching technique that compares plant names, owner names, coordinates, emissions, and power outputs. This has led to 3,101 matches for the 9,443 US powerplant entries in CARMA, although there are still several thousand power plants from the CARMA dataset without a corresponding eGRID entry, and vice versa. Where there are matches, data from eGRID should be given precedence, since some of the CARMA data base calculated using an estimation technique (see Calculating CARMA: Global Estimation of CO2 Emissions from the Power Sector).
  • Efforts so far have been focused on power plants, and not their owners. These should be fixed up as well and linked to appropriate sources. OpenCorporates.com and DBpedia would allow us to link to unique identifiers.
    • See RWE Group for an initial attempt on organizing the relationship between a owner company and its subsidiaries.
  • We need to refactor the data to more clearly make the distinction between power plants (i.e. site/location) and the multiple power generating units that exist on that site. Currently these two concepts are mixed together. Searching for terms like "Ii" and "-2" will show this in the data. This is needed since we deal with a mix of data at both the site and power generation unit level.

[edit] Bringing this all together

This page lists many of the useful data sets that are out there, and we'd like to bring together information from these into a single data set that gives a comprehensive view of the power industry.

One of the main issues we face is figuring out which entries in a data set link to which article on Enipedia. It's possible to automate this using the reconciliation feature of Google Refine, but this doesn't have a high rate of success with matching. This could be addressed through creating a Reconciliation Service API. The key thing is that we need to match on more than just name alone (i.e. there are often multiple power plants in large cities such as Berlin). You need to be able to match on a mix of the country, company name, capacity, etc in order to avoid false positives. We need to have a sort of "data fingerprint" of a power plant where there is enough information available that there is a high probability that there is not another power plant with the same combination of characteristics.

This Scraperwiki view created by Nono shows a great way to compare and align different data sets. The code behind the views that you can find here has some features that are quite powerful and have interesting implications:

  • It sources map layers from multiple mapping services (Google Maps and ItoWorld)
  • The "Edit in Enipedia" link uses a feature of Semantic Forms that allows for values to be set for parameters in a template of a new page.
    • We can improve this further by upgrading the Enipedia software to the latest version of Semantic Forms, as that would allow us to take advantage of the sfautoedit feature. This allows scripts to edit values of template parameters on existing pages without people even having to use the wiki. We can then still use the wiki for revision control of user edits.
      • Using the sfautoedit feature with the marker option draggable set to true, allows us to create a user-friendly interface to fix up coordinates.
  • Queries are run against remote SPARQL endpoints and then used to dynamically create content via the Google Maps API.
    • We need to expose the eGRID, EU-ETS and E-PRTR datasets via these. The original data publishers don't have anything close to this type of visual interface. We should set up an endpoint using 4store or virtuoso to better handle a large number of queries. Aligning data sets is a non-trivial long-term issue, and we should facilitate ways of allowing people to view the "original" data in addition to the combined data.
  • The SPARQL queries are run via AJAX calls, which means that the code can be adapted to create calls to our Reconciliation API. This would allow someone to click on an icon from an "outside" data set and then highlight the icons representing possible matches in Enipedia. This can really go a long way towards providing a better interface for aligning data sets. By using sfautoedit we can also automatically add unique identifiers to help keep track of the connections between these data sets.

The long-term view is that automation needs to be employed to keep track of data from different sources and the efforts to integrate it together. For inspiration, there's a few interesting projects such as ITO Map's comparison of the Vector Map District with OpenStreetMap, OpenStreetBugs, and MapDust.

[edit] General Energy

  • Look at converting the BP Statistical Review into RDF, allow to query trends over different years. See Talk:LNGTrade for some interesting ideas about trends to show.
  • The Global Energy Assessment by the IIASA is out, covering data on current and future energy systems.
    The Global Energy Assessment (GEA), launched in 2012, defines a new global energy policy agenda – one that transforms the way society thinks about, uses, and delivers energy. Involving specialists from a range of disciplines, industry groups, and policy areas, GEA research aims to facilitate equitable and sustainable energy services for all, in particular the two billion people who currently lack access to clean, modern energy.
    There is also a Public GEA Scenario Database with all data and scenarios from the report. This could be used for future work on scenarios, or current work on the Enegy Mix and market.

[edit] Natural Gas

  • For natural gas storage, see discussion at Talk:NaturalGasStorage. The IGU has a database from 2009 that is useful.
  • For gas production, there are good data sets about North Sea area (Norway, UK, Netherlands). This may also be useful for oil.

[edit] Power grids

There's a huge lot of data lying in OpenStreetMap Some raw data can be seen here but a more visual way to inspect them is here. Those data must however be organized. This can be done by looking at overview maps of operator's grid, such as:

We've started initial efforts at Portal:OpenGridData where we're using Enipedia as a repository of data (currently for the Netherlands) that we use to run load flow calculations. Please contact us if you're interested in helping us to spread this to other countries as well.

Personal tools
Namespaces

Variants
Actions
Navigation
Portals
Advanced
Toolbox