Carma v3 dataset

From Enipedia
Jump to: navigation, search

Also see work on comparing Carma version 2 vs version 3.


[edit] Overview

As of July 2012, Carma has published its newest dataset (v3) with updated data and power plant list, upgraded models with more realistic estimations, and up to date company names

[edit] Checking the data

This data is currently loaded in the triplestore as a named graph. This allows people to easily create tools such as IndustryAbout vs. Enipedia, along with other tools that can be used to audit the values in the data.

PREFIX rdfs: <>
PREFIX prop: <>
select * where {
GRAPH <> {
?x prop:country "Netherlands" .
?x rdfs:label ?plantName .
?x prop:energy_2004 ?energy_2004 .
  • It seems that the company field from the csv download is missing from the triplestore. This would have been especially useful since Carma changed its policy for companies and now uses the highest company level (corporate group). This could have been used, by matching them with v2 values, in the effort to trace companies owned by the same group.
    • Values for prop:company are now included in the data.
  • Just running the csv through Google Refine reveals about 20 possibly duplicated plants based on name. A similar amount exists for owner names.
  • The Dig Deeper tools allows for download of all the data. Known issues are mentioned here, and technical notes are here and here (detailed).

[edit] Import into the wiki and main graph

As of August 2013, most of the entries (but not all?) from the v3 dataset have been imported into the wiki (and thus merged into the main graph). They can be seen in this category. However there are some inconsistancies that make it sometimes a bit difficult to craft queries for checking data:

  • country names:
    • some names differ between the two datasets (Bahamas, Brunei, both Congos..., see here)
    • overseas territory are considered separate in v3 but where listed as states within their home countries in v2 (most cases can be seen here)
  • state names:
    • some names differ between the two datasets (spelling, translitteration, translation)
    • in some cases they do not even follow the same scheme (UK)
  • city names
    • many entries have city name set to 0 which causes geocoding to fail and coordinates to be set to 0,0. Having it blanked instead should enable geocoding on the state at least as was done for v2 data
      • This is currently being fixed with a bot script --ChrisDavis (talk) 13:52, 28 August 2013 (CEST)
  • company names:
    • in v3 they're at the group level, not the actual company and may also differ from v2 creating duplicates
    • in v3 there may be more than one owner (separated with a slash), this had led to creation of "company" pages in the wiki for every combination (some examples here)

The new entries are also lacking the tables for electricty production and CO2 emissions over years, making it less straightforward when checking the fuel type or activity period for a plant.

It also looks like in some cases the import bot has overwritten an existing entry while trying to create a new one, since power plant names are not always unique (at least if not including the "(planned)" attribute) and some may have been altered in existing entries and then reused for new ones.

[edit] Cleaning up Carma v2 vs. Carma v3

Some of what's above can be cleaned up with a few bot scripts, although other aspects are a bit more difficult and the (original) Carma v2 data needs referred to in order to help clear this up further and figure out what has changed on Enipedia since the original import.

Plants whose names have changed (identifier is the same) can be found via:

PREFIX prop2: <>
PREFIX prop3: <>
select * where {
GRAPH <> {
?plant2 rdfs:label ?plant2label . 
?plant2 prop2:plant_id ?id . 
GRAPH <> {
?plant3 rdfs:label ?plant3label . 
?plant3 prop3:plant_id ?id . 
FILTER(?plant2label != ?plant3label) .

Also, plants with the same names in the different versions, but different identifiers

PREFIX prop2: <>
PREFIX prop3: <>
select * where {
GRAPH <> {
?plant2 rdfs:label ?label . 
?plant2 prop2:plant_id ?id2 . 
GRAPH <> {
?plant3 rdfs:label ?label . 
?plant3 prop3:plant_id ?id3 . 
FILTER(?id2 != ?id3) .

The same id occurs on multiple pages, although for most cases this seems to be splitting the power plant into two plants such as is done in the UK DECC data:

select * where {
?x rdf:type cat:Powerplant . 
?x prop:CarmaId ?carmaID . 
?y rdf:type cat:Powerplant . 
?y prop:CarmaId ?carmaID . 
filter(?x != ?y) . 

Overall, this is a bit of a mess, and at least for the results of the second query the bot edits need to be reverted.

One of the issues is that some of the emissions and electricity production data has been amended and possibly overwritten since the original import. The query below shows all of the values for years other than 2000, 2007, and 2020. Many of these are for Russia, and are probably just Carma v3 data added into the Carma v2 data, but this isn't the case for all of the values. A check needs to be done to figure out which data has been amended from sources outside of carma.

PREFIX xsd: <>
select distinct ?pp ?country where {
    ?x prop:Is_electricity_production_of ?pp . 
  } UNION {
    ?x prop:Is_emission_of ?pp . 
  ?pp prop:Country ?country . 
  ?x prop:Year ?year . 
  FILTER(?year != "2007-01-01T00:00:00Z"^^xsd:dateTime && 
?year != "2000-01-01T00:00:00Z"^^xsd:dateTime && 
?year != "2020-01-01T00:00:00Z"^^xsd:dateTime ). 
} order by ?country

It might make sense to make the emissions and electricity production only available as a separated named graph since this data is very rarely updated by users.

[edit] Work on eGRID vs. CARMA data for the US

We've tried to match entries from the CARMA dataset with their corresponding entries in the eGRID dataset using a matching technique that compares plant names, owner names, coordinates, emissions, and power outputs. This has led to 3,102 matches for the 13,814 US powerplant entries in CARMA, although there are still several thousand power plants from the CARMA dataset without a corresponding eGRID entry, and vice versa. Where there are matches, data from eGRID should be given precedence, since some of the CARMA data base calculated using an estimation technique (see Calculating CARMA: Global Estimation of CO2 Emissions from the Power Sector).

  • The latest CARMA release (v3) is supposed to use actual values when they were publicly disclosed in selected countries (including the US). A check based on entries that had been previously linked, shows that indeed the values for energy_2004 in the v3 data set almost match those of eGrid for the same year with just a rounding, keeping no more than 5 digits. The same doesn't seem to be true for emissions in Europe where there are still significant discrepancies with EU ETS, but no comprehensive test has been done yet.
    • For Europe, CARMA uses E-PRTR data instead of EU ETS data. This does raise a very interesting question of whether EU ETS data matches with the E-PRTR. A quick check shown below for the Amercentrale shows they are off by a few 100 ktons. One sanity check for this comparison is that the property euets:calculatedEmissions needs to be used as this shows the emissions per year. The other numbers for emissions are cumulative per reporting period. There's also a chance that there may be multiple units listed for a single facility which could make direct comparison more difficult. Also see - "Please also note that under E-PRTR there is a facility-based reporting and under EU ETS an installation-based reporting which makes the data not directly comparable."
    • Values for the euets:eperIdentification property should be able to help match up the E-PRTR and EU ETS data for comparison. However, something seems wrong, as I can't find the values of euets:eperIdentification in the EPER data set itself (downloaded from here, coverted to CSV via the script below, then cat *.csv | grep "064ACI207"). See query results here for the values of euets:eperIdentification based on query shown below.
PREFIX rdfs: <>
PREFIX euets: <>
select * where {
?x rdf:type euets:Installation .
?x euets:eperIdentification ?eper .
?x euets:countryCode "FR" .

Conversion of EPER MS Access database to CSV:

tables=`mdb-tables EPER_dataset_05-02-2007.mdb`
for i in $tables
        mdb-export -d ' ' EPER_dataset_05-02-2007.mdb $i \
                | iconv -c -f UTF-8 -t US-ASCII \
                | sed ':a;N;$!ba;s/\r\n/ /g' \
                | sed 's/"//g' > \

CO2 emissions from the Amercentrale according to the E-PRTR. This facility has two identifiers in use (for different years?) - 9537 and 120840:

PREFIX rdf: <>
PREFIX rdfs: <>
PREFIX eprtr: <>
select ?totalQuantity ?year where {
?x rdf:type eprtr:FacilityReport .
?x eprtr:FacilityID ?facilityID .
?x eprtr:CountryName ?country .
?x eprtr:FacilityName "Essent Energie Productie B.v. (amer)" .
?x eprtr:PollutantReleaseAndTransferReportID ?prtrID .
?prtrID eprtr:ReportingYear ?year .
?country rdfs:label "Netherlands" .
?pr rdf:type eprtr:PollutantRelease .
?pr eprtr:FacilityReportID ?x .
?pr eprtr:PollutantName "Carbon dioxide (CO2)"
?pr eprtr:TotalQuantity ?totalQuantity .
} order by ?year

CO2 emissions from the Amercentrale according to the EU ETS:

PREFIX installation: <>
PREFIX euets: <>
select * where {
GRAPH <> {
installation:172 euets:napInfo ?napInfo .
?napInfo euets:calculatedEmissions ?calculatedEmissions .
?napInfo euets:periodYear ?year .
} order by ?year

Comparison between values (in kg - EU ETS data converted from tonnes):

year E-PRTR data EU ETS data
2004 6280000000
2005 5828901000
2006 5398314000
2007 5690000000 5193887000
2008 4190000000 4348652000
2009 5750000000 5760411000
2010 5284701000
  • In comparing eGRID vs. CARMA, the names may not match up exactly due to the naming system used in the underlying Platts database. Comparing the coordinates along with energy_2004, energy_2009, carbon_2004, carbon_2009 should allow for matches to be found easily and unambiguously.
Personal tools