Enipedia Data Quality Checks

From Enipedia
Jump to: navigation, search

Contents

[edit] Data Quality Checks

This page is meant to collect various quality checks that can be incorporated into an automated system that can alert users about common issues in the data. A semi-automatic process using RSemanticMediaWikiBot can be developed to quickly fix these. For the current development work, the edits are done by a bot account and don't show up in the default view of the recent changes.

These checks can be used to create a service similar to MapDust and OpenStreetBugs.

[edit] Current Bot Edits

The current state of the bot code can be found in the EnipediaDataQualityBot repository on Github.

A bot script is being run on a semi-regular basis to deal with issues for which no human intervention is (likely) needed.

  • Create pages for energy companies with at least two power plants.
  • Remove empty company pages, but check first that there's no text on them. Deletion occurs if the only text is the EnergyCompany template, and no other notes are included.
  • Fill in missing primary fuel types if there is only a single fuel type specified. If there is only one fuel type, then the primary fuel type is assumed to be this as well.
  • Fix links where wikimapia is listed instead of wikipedia. In this case, the value is moved to a template parameter for the wikimapia link, and the wikipedia link is then made blank.
  • Wikimapia link are filled in if a link to wikimapia is found in the references.
  • Wikimapia links are reformatted to a consistent format based on their identifier and the redirects present on the wikimapia site.

[edit] Power plants without countries (sometimes county is set instead)

This currently affects 14 plants.

select * where {
?x rdf:type cat:Powerplant . 
OPTIONAL {?x prop:Country ?country }. 
FILTER (!BOUND(?country)) . 
}

[edit] Check for EnergyCompany that doesn't have an actual page

Results are ranked by the number of plants owned

select ?owner count(?owner) as ?ownerCount where {
?x ?prop ?owner . 
FILTER(?prop = prop:Ownercompany || ?prop = prop:Operator) . 
Filter Not Exists {?owner a cat:Energy_Company} .
} group by ?owner order by DESC(?ownerCount)

These can be fixed by creating a page including:

{{EnergyCompany}}

[edit] Energy companies with a page, but no power plants

There are 46 instances of this.

Before deleting these, the page should be checked to see if there is any text besides just the standard template.

select * where {
?owner a cat:Energy_Company . 
Filter Not Exists {?x prop:Ownercompany ?owner . } .
Filter Not Exists {?x prop:Operator ?owner . } .
} 

[edit] Biggest energy companies (by power plant count) without Wikipedia links

select ?owner count(?x) as ?plantCount where {
?x rdf:type cat:Powerplant . 
?x prop:Ownercompany ?owner . 
FILTER NOT EXISTS {?owner prop:Wikipedia_page ?wp } . 
} order by DESC(?plantCount)


[edit] Fuel Types

[edit] Single fuel type, but no primary fuel type specified

In this case, the fuel type should just be copied over.

This occurs with 1 plants.

select ?x ?fuelType where {
?x prop:Fuel_type ?fuelType . 
{
select ?x count(?fuelType) as ?fuelCount where {
?x rdf:type cat:Powerplant . 
?x prop:Fuel_type ?fuelType . 
OPTIONAL{?x prop:Primary_fuel_type ?primaryFuelType} . 
FILTER(!BOUND(?primaryFuelType)) . 
} group by ?x 
}
FILTER(?fuelCount = 1) . 
}

[edit] Multiple fuel types, but no primary fuel type specified

select ?x where {
{
select ?x count(?fuelType) as ?fuelCount where {
?x rdf:type cat:Powerplant . 
?x prop:Fuel_type ?fuelType . 
OPTIONAL{?x prop:Primary_fuel_type ?primaryFuelType} . 
FILTER(!BOUND(?primaryFuelType)) . 
} group by ?x 
}
FILTER(?fuelCount > 1) . 
}

[edit] Primary fuel type is set, but no corresponding fuel type exists

The fuel type may be empty or there may be no values that match that in the primaryFuelType

select ?x where {
?x rdf:type cat:Powerplant . 
?x prop:Primary_fuel_type ?primaryFuelType .
Filter Not Exists {?x prop:Fuel_type ?primaryFuelType } .
} 

[edit] Primary fuel type is set, but no fuel type is set at all

select ?x where {
?x rdf:type cat:Powerplant . 
?x prop:Primary_fuel_type ?primaryFuelType .
Filter Not Exists {?x prop:Fuel_type ?fuel_type } .
} 

[edit] Find all country fuel type overview pages that need to be created

select * where {
FILTER(NOT EXISTS { ?countryOverviewFuelType rdf:type <http://semantic-mediawiki.org/swivt/1.0#Subject> } ) . 
{
select iri(replace(bif:concat("http://enipedia.tudelft.nl/wiki/", ?countryName, "/", ?fuelName), " ", "_")) as ?countryOverviewFuelType where {
?pp prop:Fuel_type ?fuelType . 
?fuelType rdfs:label ?fuelName . 
?pp prop:Country ?country . 
?country rdfs:label ?countryName . 
} group by ?country ?fuelType
}
}


[edit] References

[edit] Possibly incorrect references

Check if "enipedia" pops up in the link. This may mean that the reference is blank.

select * where {
?plant rdf:type cat:Powerplant . 
?ref prop:Is_reference_and_notes_of ?plant . 
?ref prop:Reference ?referenceURL . 
FILTER(regex(?referenceURL, "enipedia", "i"))
}

[edit] Find Energy Companies Without Websites and Recommend Possible URLs based on References of Owned Power Plants

(results)

select distinct ?x ?referenceURL where {
?x rdf:type cat:Energy_Company . 
OPTIONAL{?x prop:Website ?url } .
FILTER(!BOUND(?url)) . 
?plant prop:Ownercompany ?x . 
?ref prop:Is_reference_and_notes_of ?plant . 
?ref prop:Reference ?referenceURL . 
FILTER(!regex(?referenceURL, "wikipedia|globalenergyobservatory|bundesnetzagentur|enipedia|wikimapia|industryabout|industcards", "i"))
} order by ?x

[edit] Links to external data sources

[edit] Wikimapia link in Wikipedia field

select * where {
?x rdf:type cat:Powerplant . 
?x prop:Wikipedia_page ?wikiPage . 
FILTER(regex(?wikiPage, "wikimapia", "i"))
}

[edit] No Wikimapia link set, but one appears in the references

select * where {
?x rdf:type cat:Powerplant . 
OPTIONAL{?x prop:Wikimapia_link ?wikimapia . }
FILTER(!BOUND(?wikimapia))
?refNotes prop:Is_reference_and_notes_of ?x . 
?refNotes prop:Reference ?reference . 
?refNotes prop:Notes ?notes . 
FILTER(regex(?reference, "wikimapia", "i"))
}

[edit] No Wikipedia page set, but one appears in the references

This also removes "List of" pages

select * where {
?x rdf:type cat:Powerplant . 
OPTIONAL{?x prop:Wikipedia_page ?wikipedia . }
FILTER(!BOUND(?wikipedia))
?refNotes prop:Is_reference_and_notes_of ?x . 
?refNotes prop:Reference ?reference . 
?refNotes prop:Notes ?notes . 
FILTER(regex(?reference, "wikipedia", "i")) . 
FILTER(!regex(?reference, "List", "i"))
}

[edit] EU ETS

[edit] Find largest companies in Europe that are not associated with an account holder in the EU ETS

(results)

select ?owner sum(?output) as ?totalOutput where {
   ?plant prop:Ownercompany ?owner . 
   ?plant prop:Country ?country . 
   ?plant prop:Annual_Energyoutput_MWh ?output . 
   ?country rdf:type cat:Europe . 
   FILTER (?country != a:Russia) . 
   FILTER (?country != a:Ukraine) . 
   FILTER (?country != a:Turkey) . 
   FILTER NOT EXISTS {?owner prop:AccountHolderNameInEUETS ?accountHolderName } . 
} order by DESC(?totalOutput)

[edit] Find companies that are not associated with an account holder, but own a power plant that is linked to its EU ETS entry

(Results) - This shows that there is not always a 1:1 relationship between the indicated owners in both data sets.

PREFIX euets: <http://enipedia.tudelft.nl/data/EU-ETS/>
select ?plant ?euetsPlant ?owner ?accountHolder where { 
GRAPH <http://enipedia.tudelft.nl/wiki/> {
	?plant prop:Ownercompany ?owner . 
        ?plant prop:Annual_Energyoutput_MWh ?output . 
	?plant prop:EU_ETS_ID ?euetsID . 
	?plant prop:Country ?country . 
	?country rdf:type cat:Europe . 
	FILTER(?country != a:Russia) . 
	FILTER(?country != a:Ukraine) . 
	FILTER(?country != a:Turkey) . 
	FILTER NOT EXISTS {?owner prop:AccountHolderNameInEUETS ?accountHolderName } . 
}
GRAPH <http://enipedia.tudelft.nl/data/EU-ETS> {
?euetsPlant euets:euetsID ?id . 
?euetsPlant euets:account ?account . 
?account euets:AccountHolder ?accountHolder . 
}
FILTER(?id = xsd:string(?euetsID)) . 
} order by ?owner ?accountHolder


[edit] Find plants that should be in the EU ETS, but don't have a link specified yet

This is useful for improving how Elasticsearch on Enipedia is configured.

select * where {
   ?x rdf:type cat:Powerplant . 
   FILTER NOT EXISTS {?x prop:EU_ETS_ID ?euets } . 
   ?x prop:Country ?country . 
   ?country rdf:type cat:Europe . 
   filter(?country != a:Ukraine) . 
   filter(?country != a:Russia) . 
   filter(?country != a:Turkey) . 
   ?x prop:Annual_Carbonemissions_kg ?co2 . 
   filter(?co2 > 1000000) . 
} order by DESC(?co2)

(results)

[edit] Links to external data, but basic data is not filled out

[edit] Link to Wikipedia Article exists but no fuel type set

select ?country ?x ?wikipedia where {
?x rdf:type cat:Powerplant . 
?x prop:Wikipedia_page ?wikipedia . 
?x prop:Country ?country . 
FILTER NOT EXISTS {?x prop:Fuel_type ?fuel_type } . 
} order by ?country

(results)

[edit] Link to any external data set, but no fuel type set

select ?country ?x ?propertyName ?id where {
?x rdf:type cat:Powerplant . 
?x ?property ?id . 
?property rdfs:label ?propertyName . 
?x prop:Country ?countryPage . 
?countryPage rdfs:label ?country . 
?property rdf:type cat:External_Data_Properties . 
FILTER NOT EXISTS {?x prop:Fuel_type ?fuel_type } . 
# doesn't have the fuel type
FILTER(?property != prop:EU_ETS_ID)
# redundant since generated from Wikipedia link
FILTER(?property != prop:DBpedia_Page)
} order by ?country ?x

(results)

[edit] Other Checks

  • The Economic Exclusive Zones (EEZ) shapefile from Marineregions.org can be used to check if power plants are located within the correct country, using only a few lines of R code and aligning the results to the country's ISO 3166-1 Alpha-3 code. Checking for the correct state could be done (with a bit more alignment work) using the states/provinces shapefile at Natural Earth. The limitation of this is that it will have issues with power plants on the coastline or boundary areas if they are not within the bounding polygon. Adding some buffer around polygons before performing the check should help with that.
  • Company titles are not standardized (SA, S.A., sa, ltd, LTD, etc.) which leads to duplicate entries.
  • Wikipedia links that aren't links or link to other sites such as Wikimapia. The original intent was to link to the English version of Wikipedia so that DBPedia links would work, but the templates should be updated to allow for all language versions of Wikipedia. WikiData could play a role since it already links together different language versions of the same article. Data from templates is not included yet, but eventually this will happen.
    • DBpedia templates have been fixed for display but actual property values still need to be adjusted to allow dataset linking through sparql. This could be done by a bot running s/resourcepedia/dbpedia/
      • This has been added to the bot script. The fix involves the page being resaved since the DBpedia link is generated from the Wikipedia link.
select * where {
?x prop:DBpedia_Page ?dbpedia . 
filter(regex(?dbpedia, "resourcepedia", "i")) .
}
  • Encoding issues - build up list of characters that could cause problems, check RDF representation
    • A number of state values still contain "-2D" instead of "-" (0 occurrences), same for cities (0 occurrences)
      • This is currently being fixed by a bot script that resaves the pages.
  • Redirect issues, owl:sameAs
  • Capacity not specified correctly, missing "MW", values that seem bogus. The missing "MW" is basically cosmetic as "MW" is assumed by default. The presence or absence of this can't be found via a sparql query and there may not be a way to locate these without just processing the raw text on the pages for all plants with a capacity specified. However, a number of them seem to originate in DECC data import through User:Chrisbot.
  • Different entries for the same company. This can be seen at http://enipedia.tudelft.nl/wiki/Germany/Energy_Companies, where there are several entries similar to "E.on Kernkraft Gmbh" and "E.ON Kernkraft GmbH".
  • Stopword lists to be used for matching can be generated by looking at self information of terms per country. Manual review is still required to verify these.
  • Entities in different data sets can be compared - easy checks are coordinates, country. More difficult checks are fuel types, etc (lookup tables likely required + some sort of reasoning).
  • Check for duplicate entries. The work on enipedia-openrefine-reconcile can be employed for this. Instead of using it to match two data sets, it can try to match entities within the same data set.
  • Power plants with same exact coordinates - likely due to geocoding of location.
    • There are indeed large numbers of power plants for which only country is geocoded, those do not convey more actual information than those without coordinates.
      • This is being fixed by a bot script that updates a power plant's coordinates to those in Carma v3 if the plant has the same coordinates as at least one other power station. Before integrating the rest of Carma v3 we would need to do a check to see if people have updated the coordinates, or if the coordinates are a result of geocoding.
      • The queries on comparing Carma_v2_vs_V3 are a good starting point for seeing what's been updated, and initial queries show that the latest version is much improved.
select ?lat ?lon ?country count(?x) as ?nb where {
?x rdf:type cat:Powerplant . 
?x prop:Latitude ?lat . 
?x prop:Longitude ?lon . 
OPTIONAL {?x prop:Country ?country }. 
OPTIONAL {?x prop:State ?state}. 
FILTER (!BOUND(?state)) . 
FILTER (?lat!=0) . 
} group by ?lat ?lon ?country
order by desc(?nb)
    • then, there are power plants for which only state is geocoded
select ?lat ?lon ?country ?state count(?x) as ?nb where {
?x rdf:type cat:Powerplant . 
?x prop:Latitude ?lat . 
?x prop:Longitude ?lon . 
OPTIONAL {?x prop:Country ?country }. 
OPTIONAL {?x prop:State ?state}. 
FILTER (BOUND(?state)) . 
FILTER (?lat!=0) . 
} group by ?lat ?lon ?country ?state
order by desc(?nb)
    • The Carma V3 data is much better:
PREFIX prop3: <http://enipedia.tudelft.nl/data/CARMA_v3/property/>
select ?lat ?lon ?country count(?x) as ?nb where {
?x prop3:plant_id ?carmaid .
?x prop3:latitude ?lat . 
?x prop3:longitude ?lon . 
OPTIONAL {?x prop3:country ?country }. 
OPTIONAL {?x prop3:state ?state}. 
FILTER (!BOUND(?state)) . 
FILTER (?lat!=0) . 
} group by ?lat ?lon ?country
order by desc(?nb)
    • Duplicate coordinates in the Carma V3 data for which there is also an entry on Enipedia
PREFIX prop3: <http://enipedia.tudelft.nl/data/CARMA_v3/property/>
select * where {
{select ?lat ?lon ?country count(?x) as ?nb where {
?x rdf:type cat:Powerplant . 
?x prop:CarmaId ?carmaid . 
?carmaPlant prop3:plant_id ?carmaid .
?carmaPlant prop3:latitude ?lat . 
?carmaPlant prop3:longitude ?lon . 
OPTIONAL {?x prop:Country ?country }. 
FILTER (?lat!=0) . 
} group by ?lat ?lon ?country
order by desc(?nb)
} 
filter(?nb > 1) 
}
Personal tools
Namespaces

Variants
Actions
Navigation
Portals
Advanced
Toolbox