Upgrading Enipedia

From Enipedia
Jump to: navigation, search

These notes are from an older upgrade efforts in 2013/2014

Contents

[edit] Status

The site is currently running the latest software. It seems that everything is mostly working, but we're still tracking down things that may not be working. The notes here document current issues, and also contains notes from the upgrade process that should be refactored into other pages.

The discussion below can be grouped in several categories of issues & development directions: URL encoding in RDF, fuel type refactoring, other ways of working with Enipedia data, keeping track of queries embedded on pages, and integration with external data sets.

[edit] Issues that need to be fixed still

Feel free to add in any issues that have been spotted.

  • If a script such as SMW_refreshData.php is busy refreshing data on many pages, sometimes "Transaction Deadlocked" errors are received when running sparql queries. There should be a way to configure virtuoso to avoid these errors.
  • Data for power conversion units needs to be reloaded as the template was inserting spaces into some values for properties. References are also duplicated due to encoding issues - Amer_Powerplant/A9.
  • When clicking "edit with form" as an anonymous user, an error message is presented with wiki syntax mixed in.
  • Values for year built are exported to the triplestore using gYear, but represented in date time format. This causes problems when using a library such as Jena to process the literals as it complains that a date-time representation is compatible with gYear.
    • This can be patched in /SemanticMediaWiki/includes/export/SMW_Exporter.php, around line 592:
//$xsdtype = 'http://www.w3.org/2001/XMLSchema#gYear';
//CD - the format returned for a year is actually dateTime
$xsdtype = 'http://www.w3.org/2001/XMLSchema#dateTime';
    • The issue might be introduced a few lines further down with the code below. The value for $gregorianTime might be in dateTime format instead of gYear:
$lit = new SMWExpLiteral( $xsdvalue, $xsdtype, $gregorianTime );

[edit] New features

  • Include links throughout Enipedia to allow for downloading the data in csv/excel format. A few things are needed first:
    • Figure out how to deal with the issues described in Article_redirects_and_owl:sameAs.
      • The issues described on that page are not reproducible any more given the example queries. It's not clear what has changed.
    • Refactor how we specify fuel types. The current implementation that allows for multiple values for the same property causes sparql queries to return (nearly) double entries which can be confusing to users.

[edit] Refactoring/Cleanup

  • Fuel types, see discussion above.
  • For Template:PowerplantTest, links to EU-ETS, IAEA, etc should point to wiki documentation that tells more about the data sources and how people can work with/query them.
  • Enipedia SPARQL Endpoints is outdated, descriptions of data sets should be merged into Using SPARQL with Enipedia
  • References to http://enipedia.tudelft.nl/eprtr/ and http://enipedia.tudelft.nl/euets/ need to be removed
  • Portal:Advanced_Topics needs to be merged into the rest of the documentation
  • EU-ETS_Visualization_Examples only has a single example, should be integrated with the rest of the examples for this data set.
  • Maps need to be updated to remove the autozoom paremter. This results in an error of autozoom is not a valid parameter.
  • Deal with owl:sameAs links created by page redirects in a sensible way. This is currently breaking some of the queries that don't explicitly handle redirects.
  • Template:EnergyCompany (and others) needs to be updated so that queries can handle redirect links. See owner company of HVC_Afvalcentrale_Alkmaar for an example that needs to be fixed.
    • yep, this would be nice..
  • The Operating cost property doesn't have unit conversions properly set up.
  • Merge in and clean up the Report a Problem page. Information for upgrading should be archived somewhere, and we should have a page that lists all of the known issues that are being addressed or need attention.
  • Portal:Industry and Portal:Environment need to be overhauled. Some of the work with EU-ETS and E-PRTR data could be used for this.

[edit] URL encoding in RDF

The SMW_Exporter.php code encodes and decodes URLs according this this:

static public function encodeURI( $uri ) {
     $uri = str_replace( array( '"', '?', '\\', '^', '`' ),
                                    array( '-22', '-3F', '-5C', '-5E', '-60' ),
                                    $uri );

     //want to keep these as they are
     $uri = str_replace( array( '%21', '%24', '%26', '%27', '%28', '%29', '%2A', '%2B', '%2C', '%2E', '%2F', '%3A', '%3B', '%3D', '%40', '%7E', '%23'),
                                       array( '!', '$', '&', "'", '(', ')', '*', '+', ',', '.', '/', ':', ';', '=', '@', '~', '#'),
                         $uri );
     return $uri;
}

static public function decodeURI( $uri ) {
     $uri = str_replace( array( '-22', '-3F', '-5C', '-5E', '-60' ),
                                    array( '"', '?', '\\', '^', '`' ),
                                    $uri );

     $uri = str_replace( '-2D', '-', $uri );
     return $uri;
}
    • The dash is here encoded as -2D which is quite weird since the encoded character is part of it's own encoding...
    • There seem to be a number of unwanted side effects currently:
      • powerplants get duplicated with bogus entries in Sparql query results (see [1] for example)
      • Carma data can no more be displayed on forms for the affected entries (see for example La-butte-de-frause_Powerplant). Reference and notes do not show either (see Opoul-perillos_Powerplant).
        • The encoding function has been updated to not encode dashes any more, the data is being reloaded to the triplestore for all the power plants with dashes in their names, and triples have been deleted where their subject contained "-2D".
          • This seems OK for power plants but still plagues companies for which nothing shows (see Gesa-endesa). The sparqlencode is also still to be patched to handle correctly names containing a single quote as discussed at the end of this section.
            • sparqlencode is now patched to handle single quotes and ampersands. It still needs to be aligned and tested with the rest of the conversions specified in the encode/decodeURI code.
              • First tests show it's now OK for ampersands but single quotes still have issues, being now encoded as &-2339-3B instead of -26-2339-3B previously. The root cause (conversion from character to HTML entity) may be outside sparqlencode.
    $text = html_entity_decode($text, ENT_QUOTES);
    $a = urlencode( $text );
    $a = strtr( $a, array( '+' => '_' ) );
    $a = str_replace( '%3A', ':', $a );
    $a = str_replace( '%2F', '/', $a );
    $a = str_replace( '%28', '(', $a );
    $a = str_replace( '%29', ')', $a );
    $a = str_replace( '%27', "'", $a );
    $a = str_replace( '%26', '&', $a );


            • Owners with hyphens are currently being fixed - the count returned for the query below decreases when the query results are refreshed
select count(distinct ?x) where {
?x prop:Ownercompany ?owner .
FILTER(regex(?owner, "-2D")) . 
} 
            • A more systematic check needs to be done via the query below, with checks included for patterns such as -2, %2, -3, %3, etc.
select distinct(?y) count(?y) as ?propCount where {
?x ?y ?z .
FILTER(regex(?z, "-2D")) . 
} group by ?y order by DESC(?propCount)

The rest of the notes on this page need to be examined to come up with further checks for double entries occurring from related encoding issues. A quick scan based on the query below shows that there's quite a few entries related to the old way of performing unit conversion which can be cleaned up:

select * where {
?x rdf:type ?type . 
filter(regex(str(?x), "-23kg")) . 
} limit 10

In the same file, the tail end of the getResourceElementForWikiPage function has been modified to properly encode the URLs used for Semantic Internal Objects. These URLs are of the form http://enipedia.tudelft.nl/wiki/Amer_Powerplant#1

//CD modified
if ( !empty($modifier) ) {
//if ( $modifier !== '' ) { //not an effective check of if the variable is empty
     //CD modified 
     //$localName .=  '-23' . $modifier;
     $localName .=  '#' . $modifier;
}

A "-" is used instead of percent encoding due to a comment in ./SemanticMediaWiki/includes/SMW_Infolink.php, which mentions:

// Escape certain problematic values. Use SMW-escape
// (like URLencode but - instead of % to prevent double encoding by later MW actions)
  • ./SparqlExtension/SparqlLinker.php
  • (sparqlencode has been patched to work again, but we should look at using percent encoding. Currently SMW uses a hyphen instead, which isn't normal practice. Before switching to percent encoding, we need to check if there are any dependencies in the code that may break because of this. We also need to check for shadow versions of pages as indicated in the discussion below.) The sparqlencode wiki function does not match actual encoding in the triple store for special characters causing failure to retrieve data. See Electricité de France (EDF).
    • This discussion here shows that DBpedia only percent encodes the characters "#%<>?[\]^`{|}. We need to check that the SMW RDF Export and the SparqlExtension follow this. Also see http://dbpedia.org/URIencoding.
    • There are double versions of many of the Semantic Internal Objects due to differences in encoding. Occasionally (but surprisingly not all the time, due to caching?) these are seen to produce double entries on pages. We need to clean these out and verify that the current code is not still causing these.
select * where {
?x ?y <http://enipedia.tudelft.nl/wiki/Yanbu_Marafiq_Powerplant> .
}

results in :

http://enipedia.tudelft.nl/wiki/Yanbu_Marafiq_Powerplant%231	http://enipedia.tudelft.nl/wiki/Property:Is_reference_and_notes_of
http://enipedia.tudelft.nl/wiki/Yanbu_Marafiq_Powerplant-231-23	http://enipedia.tudelft.nl/wiki/Property:Is_reference_and_notes_of

More issues can be found via:

PREFIX prop: <http://enipedia.tudelft.nl/wiki/Property:>
select ?x where {
?x prop:Is_reference_and_notes_of ?z . 
filter(regex(?x, "%23")) . 
} limit 100
    • There are still orphaned wiki pages with the old encoding, such as Val-d-27isere Powerplant which is a shadow version of Val-d'isere Powerplant.
      • This specific problem was probably caused by an old bot used to rename EDF as it only seemed to affect power plants from that operator;
      • This doesn't seem to be affecting power plants anymore, but it does show up with quite a few energy companies. An example that is broken is Soc_Nigerienne_D'elec.
        • This is actually a different issue not caused by encoding in RDF itself but by the sparlencode function used in the query that encodes the HTML entity &#39; instead of the character itself (see below)
PREFIX cat: <http://enipedia.tudelft.nl/wiki/Category:>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select ?category count(?category) as ?catCount where {
?x ?y ?z . 
filter(regex(?x, "-27")) . 
?x rdf:type ?category . 
} order by DESC(?catCount)

[edit] Caching

  • The various caching configurations are still being tweaked, which may lead to issues of old content still being displayed on pages.
  • Caching issues are causing havoc, see notes on Enipedia Maps, need to look if squid is preventing the php scripts from running.
  • When you update coordinates for a power plant, the map on the page is not always updated with the new location. This doesn't seem to be an issue in Chrome, but it is a repeatable issue in Firefox. This seems to also be repeatable by just saving a page. There doesn't seem to be a way to refresh/purge the content in Firefox. We should use redbot.org to examine the response headers and figure out why the different browsers seem to have different caching strategies.
  • Regarding the cache, we need to more systematically look into what's going on with it. A few TODO's:
    • Contact mediawiki-l mailing list. The default recommended settings for squid don't seem to work (we have to explicitly not cache php files, which is not mentioned in the instructions), although there may be parts of our config that are impacting things.
    • Come up with a list of test cases to verify if caching strategy is working as expected.
    • Look into job queue set up. This may be part of the reason why changes show up a few days later.
      • Purging a page should load the latest content, but if say a power plant page is updated, I don't think there's a way currently to tell the summary pages that they need to be updated as well. A bot script could be set up to detect this and update related pages. Also, a good test of page load time of caching vs. not caching would be the EDF page due to the large number of power plants owned.
    • Look at how long pages stay fresh in the cache.

[edit] Other (not yet categorized)

  • Old Semantic Internal Objects are not deleted when a page is moved, see Omrin RestStoffenEnergiecentrale. These should be regenerated based on the new page.
    • It seems that the old page is not deleted either when the redirect is requested to be suppressed, see Syracuse for an example.
    • A way to find these is via:
select * where {
?x <http://semantic-mediawiki.org/swivt/1.0#page> ?y . 
filter(?x != ?y) . 
?x <http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate> ?modificationDate . 
?x rdf:type cat:Powerplant . 
?y rdf:type cat:Powerplant . 
} order by ?modificationDate
  • Special:Ask seems to be buggy. See example here. None of the values for properties that use units show up, and latitude/longitude doesn't show up for all of the values. Checking the values in the triplestore for a single power plant shows that all of the data is in there.
  • When using "Edit with Form" with power plants, the map doesn't always show up (at least in Firefox), which can cause loss of coordinates. The most robust fix for this would be to upgrade to the latest Maps, Semantic Maps, and SMW 1.8 when it's ready as this relates to a known issue.
  • zoom parameter doesn't work with Semantic Maps. Also not possible to select hybrid layer with Google Maps. Should just wait for Maps + Semantic Maps 2.0 to see if this fixes it. Current workaround is to just use the SparqlExtension for mapping.
  • Properties need to be refactored on the pages for Power Conversion Units.
  • __SHOWFACTBOX__ doesn't consistently show factboxes. Not sure yet how to repeat this issue.
  • The timeline on Investment in Rotterdam doesn't work. This is probably related to the bug of Semantic Internal Objects not working with inline queries.
  • MediaWiki:Common.js has a hack to allow for the clickable map in Semantic Forms to work. This should be cleaned up once SMW 1.8 and the latest version of Semantic Maps is ready. Bug is documented at https://bugzilla.wikimedia.org/show_bug.cgi?id=33560
  • When saving a property page for the first time after loading in the db and running the upgrade script, get a sparql query error with the query shown below. It seems in general that SMW has problems processing the results of sparql queries (from virtuoso at least) where no results are returned, or one of the values may be blank (i.e. nothing matching the optional clause). Issue is documented at https://bugzilla.wikimedia.org/show_bug.cgi?id=30542. We're not sure yet how to duplicate this, which prevents us from debugging it.
SELECT * WHERE {
rdf:type swivt:wikiPageSortKey ?s OPTIONAL { rdf:type swivt:redirectsTo ?r }
}
    • This query don't seem to make sense - not sure why rdf:type is a subject there. In SMW, properties can be subjects as well, since they also exist as pages and have things like types specified for this (owl:DatatypeProperty, etc). It could be that I need to run dumpRDF.php in order to synchronize the data, which may have been refactored in the upgrade. This can't be done until the units issue is fixed as this is visible when exporting RDF from a page, and would just pollute the triplestore.
  • Property:StorageCapacityLNG_m3 and Property:RegasificationCapacityLNG_m3 need to be created still. See notes here. Also, gas units should be changed Bcm to make things more convenient.
  • Not all coordinates are properly encoded - this also leads to issues with splitting prop:Point into latitude and longitude values. First check is if the use of N/S/E/W in prop:Point is causing problems. See http://enipedia.tudelft.nl/enipediaTesting/index.php/Navajo_Powerplant for an example, also the query below lists all the pages with problems.
select * where {
?x <http://semantic-mediawiki.org/swivt/1.0#specialProperty_ERRP> <http://enipedia.tudelft.nl/wiki/Property:Latitude> .
}
  • The list of power generation units shows the date + time, instead of just the year (See Eems_Powerplant). The reason for this is that Virtuoso doesn't seem to allow for functions (like used to extract the year) to be applied if the variable may be null (due to it's specification in an OPTIONAL clause). Currently, the results of a sparql query are sent to a series of templates that cleans up the year.
  • There may be a few queries that use data from the old Joseki endpoint. This mostly applies for external data sets.
  • Editing buttons don't show up - things like bold, italic, etc. buttons. Initial observations seem to indicate that it shows up on Firefox, but not on Chrome.
  • (the query has been fixed but the template doesn't seem to pass parameters to the Google Viz constructor) Fix User:Alfredas/Charts/GeoMap. If the endpoint parameter is specified, the results are not converted to the format needed for the Google Visualizations.
    • Actually no sparql endpoint can understand the format=gds parameter passed by the templates since it is only implemented by Special:SparqlExtension. Without federated queries, a more robust way would be to setup a standalone proxy PHP page on Enipedia that would forward the query to any standard sparql endpoint with output as XML and then do an XSL transformation to generate the format expected by Google Visualizations (the transformation may already be done internally by SparqlExtension). This would enable the use of the endpoint parameter to target standard sparql endpoints and not only SparqlExtension.
    • The format=gds parameter is meant for Special:SparqlExtension, which is defined in SparqlExtension_body.php. Currently this is set up to handle requests to the local triplestore. We need to modify this to handle external requests as well.
      • We also need to verify that this can work with sparql endpoints running different software. Ideally they all use the same standards in terms of the parameters that need to be sent to them to run a query, but we haven't done a systematic check of this.
    • A related effort is sgvizler, which is interesting since everything is done client-side and doesn't require any server setup or specific capability for the endpoint. It supports all Google visualization types and can be quite easily extended.
      • There's an issue where "%C2%A0" is introduced into the query string during the url encoding process. It's not clear where this is coming from. Even with a hex editor, no extra characters are visible, and modifying the code to replace this sequence doesn't work either. This prevents data from being returned (as on User:Alfredas/Sgviztest). This library does work when used on a simple html page - http://enipedia.tudelft.nl/scripts/test.html. The code for the MediaWiki Spark Extension may show how to fix this as the syntax for the embedding queries seems to be the same.
        • "%C2%A0" is the way a non-breakable space is encoded in UTF-8 and indeed, looking at the HTML source, there are &nbsp; added to the query
        • Where do non-breakable spaces come from? They most likely are an advanced text-layout feature from mediawiki: in some languages (such as French) typography rules ask for a non-breakable space before some punctuation characters such as question mark, exclamation mark, or semi-colon. Looking at the query, characters are indeed inserted instead of spaces before variable names starting with "?". If you add spaces before ";" they get converted too.
          • Looking through the Spark Extension source code reveals that a bit of a hack is used to get around this (Spark.class.php) - In MW 1.17 there seems to be the problem that ? after an empty space is replaced by a non-breaking space (&#160;) Therefore we remove all spaces before ? which should still make the SPARQL query work. As mentioned below, a parser hook function would be a better idea.
        • As a side note, one should not be able to enter HTML directly in the wiki. There should be a template that gets converted to the needed HTML, just like it's done currently with the SPARQL extension generating HTML code for GoogleViz. This is also probably why SPARQL queries do not get altered with the current technology: they are handled by the hook function and hidden from the general wiki processor.
      • This seems more mature than rdf-spark, which we have been considering. The original version of the SparqlExtension synchronized SMW with a triplestore. Newer versions of SMW have that functionality in their core code, meaning that the current version of the SparqlExtension (0.8) basically does the same thing as these javascript libraries. The main thing to test is how switching over to a client side implementation would affect page load speed. To help with site performance, we try to cache pages so that queries don't have to be re-run unless the user refreshes or purges the page. In any case, we should experiment with and promote the use of these libraries as it helps people use Enipedia data and embed visualizations on other sites as well.
      • Not sure what will actually change since the Google widgets are already making client calls against the Special:SparqlExtension. Making the format conversion client-side could even theoretically relieve the server from part of its burden. However, there are indeed currently a lot of caching problems where summary pages (for country, company, fuel type) don't reflect changes to individual power plants, sometimes for days, while Enipedia Maps displays them live by querying directly the sparql endpoint. So it may be worth a benchmark.

[edit] What

The software in use on Enipedia needs to be upgraded to help deal with performance issues and also take advantages of the latest features of MediaWiki (MW) and Semantic MediaWiki (SMW). Before we can do this, a few issues need to be dealt with, as not everything will work out of the box

A few benefits:

  • Generating RDF dumps for backups is faster (about 1 hour versus 6 hours when using dumpRDF.php). This regenerates all of the RDF that should be in the triplestore, which is helpful in case the data in the triplestore becomes corrupt.
    • Not finding this to be true currently as dumpRDF.php seems to take as long as before. May have been true with a different version of SMW used for testing, or results may have been misinterpreted.
  • sfautoedit in Semantic Forms allows for some interesting integration with javascript applications.
  • Page edits (especially the first one after a long period of no edits) can take a very long time. We're not exactly sure where the bottleneck is, but upgrading may help to address this. We've taken several steps to address this, such as using Squid as a caching proxy server, and using the APC opcode cache, but the problem seems to be a bit deeper than what these can address.
  • It looks like virtuoso has the means to handle page redirects that helps us avoid double counting in queries. See Article redirects and owl:sameAs.

[edit] Upgrade Steps

For upgrading, current site will be preserved, with the database copied over to a site currently used for testing, which will then become the production site if everything works out. There don't seem to be any major blocking issues, although some refactoring will need to be done.

Both MW and SMW will be upgraded, and the triplestore will be changed from Joseki to Virtuoso. 4store will still be running (on a cluster) with a static data set. Eventually we'd like to synchronize it with the data on Virtuoso, and have 4store act as both a backup and an endpoint that can be used for high-traffic applications like hackathons. This will help keep the wiki functional, even if there's a large number of queries happening.

Steps:

  • set current wiki to read-only until new site is ready
  • dump out latest rdf from current site and load into virtuoso - RDF dump takes about six hours. In the future may want to see if it's possible to rewrite SMW_dumpRDF.php to be significantly faster.
  • copy over mysql database from current live site
  • run extensions/SemanticMediaWiki/maintenance/SMW_setup.php to upgrade the database tables
  • make changes specified in Templates and Properties to Refactor for Enipedia Upgrade
  • The new properties and their associated values need to be written to the triplestore.
    • Probably the best strategy to aid in quickly switching between the sites is to take a snapshot of the RDF dump of the current wiki, and perform the refactoring of properties on the test site. This will probably take day. This will get the data in the triplestore 99% synchronized with what's on the live site. Once the upgrade process starts and the MySQL database is copied over, SMW_refreshData can be run for only those pages that have changed since the RDF dump was created.
    • A slower way (at least a day) is by running extensions/SemanticMediaWiki/maintenance/SMW_refreshData.php. To make things faster, We may want to just run a bot script to resave power plant pages, or use SMW_refreshData.php with the id's of everything of rdf:type cat:Powerplant.
      • It seems that multiple SMW_refreshData.php scripts can be run in parallel without slowing down each other. With 10 processes, about 3-4 pages are refreshed per second, meaning that all the power plant data can be refreshed in under five hours. Using 20 processes seems only marginally quicker. The script below takes as input a newline delimited text file with the names of all the wiki pages to refresh.
#!/bin/bash
rm small*
split -a 3 -l 10 ListOfAllPowerPlants.txt small
maxProgs=20
files=`ls small*`
for file in $files
do
        pagesToUpdate=`cat $file | paste -sd "|"`
        numProgs=$((`ps aux | grep refreshData | wc -l`-1))
        echo $file
        while [ $numProgs -ge $maxProgs ]; do
                echo "***************************sleeping**********************"  #debugging messages
                sleep 1
                #figure out how many processes are already running
                numProgs=$((`ps aux | grep refreshData | wc -l`-1))
        done
        php /WIKIPATH/extensions/SemanticMediaWiki/maintenance/SMW_refreshData.php -v --page="$pagesToUpdate" &
done

[edit] Issues to deal with

[edit] Non-critical

These are general issues that can be looked into later.

  • Look into setting up more intuitive way of adding power conversion units to power plants. Some have been added recently with URLs that are the same as the power plant, except with a slash on the end. Indeed there's a very appealing add power conversion unit button that you are tempted to click without even noticing there's a shallow text field on the left. Alas, doing so generates the aforementioned URLs with an empty unit name part: almost everyone got catched during their first attempts. Adding a simple validation test on button click would be a first effort if allowed by the wiki.
    • The current fix in place is to have a default name of "Unit 1", which is what would be generated anyway if you clicked on the link instead of the button. It seems that only the #formlink parser function is capable of auto-generating names of pages, while the #forminput parser function just reads what's in the text box, even if it's empty. With the #forminput function, there doesn't seem to be out-of-the-box functionality to check if the field is blank or not, although this would be a quite useful feature. See here for documentation.
  • Units are handled differently, and we need to refactor several of the properties. This will break old queries that reference things like prop:Energyoutput-23J. Need to check that at least the wiki pages dealing with power plants are working, as the rest can be dealt with rather quickly once this is ok.
  • Fix code for auto-updating list of spammer IP addresses.
  • There's a bug with moving pages - just get blank page, doesn't reload new page.
    • A temporary fix is in place - the code for deleting SIOs was causing problems in the changeTitle function of SMW_SparqlStore.php. We need to check that old SIOs are deleted when moving pages. Old triples should also be deleted when moving pages. Also check different cases when redirect is and is not left behind.
[Mon Sep 03 13:55:39 2012] [error] [client 127.0.0.1] PHP Fatal error:  Call to a member function getDBkey() on a non-object in ./extensions/SemanticInternalObjects/SemanticInternalObjects_body.php on line 192, referer: http://enipedia.tudelft.nl/enipediaTesting/index.php/Special:MovePage/Test2
[Mon Sep 03 13:55:59 2012] [error] [client 127.0.0.1] PHP Fatal error:  Call to a member function getNamespace() on a non-object in ./includes/specials/SpecialMovepage.php on line 59
  • For natural gas data, default unit is m3, should use Bcm instead. Also, some of the capacities should be in GWh instead of MWh. This requires a different default unit conversion template. May want to have some sort of switch statement that selects the correct unit conversion table based on the units used for the property.
  • Queries to the CIA world factbook (Template:Country_Factbook) doesn't work with the latest version of the SparqlExtension. The specific issue is that Virtuoso doesn't support federated queries, and the factbook endpoint doesn't seem to be running Jena which is required for the type of complex query that we need to retrieve the data in the correct format for a piechart. We may want to just leave this out for the time being. Also, the factbook data in RDF may not be maintained any more.
  • "wikitable sortable" seems to sort by string, and not by number with the new version.
  • (Fixed using the suggestion below, need to update query used in the template) - Template:Country Factbook breaks since federated queries are not supported - can use endpoint parameter, need to update SparqlExtension so that Google Charts are supported with external endpoints.
  • Template:POR needs to be updated - inline query coupled with googlemaps prevents page from rendering
  • Get strange error related to the parser: PHP Notice: Undefined variable: text in /includes/parser/Parser.php on line 3084. This happens with the current production site as well. It's not clear if this is causing any problems.
  • Make sure that rendered pages are cached.
    • Check setting of $wgMainCacheType, $wgParserCacheType, $wgMessageCacheType, and $wgEnableParserCache. These were causing problems previously by caching old content and then writing it to different pages.
    • Check that caching and purging of SPARQL query requests occurs in a sane way.
  • URLs with & and ' are will not have these characters urlencoded any more. Clean out old values from triplestore as there may be a few URLs using the old encoding.
    • Fântânele-Cogealac_Wind_Farm causes problems - check that the encoding is able to handle this. Figure out how to locate pages with similar issues.
  • swivt:page references localhost - why is this?
  • Need to submit bug report as Inline queries don't work with Semantic Internal Objects when a triplestore is used as the backend. Tried to debug this a while ago, and it seems that semantic internal objects do not have a wikiPageSortKey, which causes problems with the ask translation.
  • Virtuoso is currently set up with a limit of 100,000 rows. This will allow for downloading of the entire power plant data set without users having to write queries with LIMIT and OFFSET clauses. The virtuoso default limit is 10,000 rows, and we're not sure if raising this 10x will lead to any performance issues.
  • Update demo queries used for the SparqlExtension documentation - User:Alfredas/Charts
  • Clean out old properties which are not being used any more due to unit refactoring.
  • Pubby interface to external data can be set up in a more streamlined way. Also, data such as the EU-ETS isn't very clickable (example at http://enipedia.tudelft.nl/data/page/EU-ETS/country/NL/installation/172), should look into expanding on work of Alfredas of coupling XSLT to results of SPARQL queries. This could dramatically improve the interface and allow us to rapidly deploy new views of the data sets we import.

[edit] Already Addressed

This mostly documents various patches that need to be submitted.

  • (A bot has been run to fix the pages identified by the SPARQL query below.) Check into issues of lat/lon not being set, as described on Enipedia Maps. Affected plants can be seen here using the query below. This seems to be due to things like "° N" being in prop:point.
PREFIX prop: <http://enipedia.tudelft.nl/wiki/Property:>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX cat: <http://enipedia.tudelft.nl/wiki/Category:>
select * where {
   ?x prop:Point ?point .
   ?x rdf:type cat:Powerplant . 
   OPTIONAL {?x prop:Latitude ?lat} . 
   OPTIONAL {?x prop:Longitude ?lon } . 
   FILTER (!BOUND(?lat)) . 
   FILTER (!BOUND(?lon)) . 
} 
  • There's a bug with the xsl transformation used by Special:SparqlExtension to convert SPARQL results into JSON suitable for GoogleViz: when trying to remove possible newlines from SPARQL literals, it actually ends up removing spaces.
    • Is there a demo query/page that can be used to help verify if a fix is working? Otherwise this will be tested using an XML document that is set up to mimic the issue.
      • A country's powerplants page such as Netherlands/Powerplants is a good test. Everywhere labels are used instead of url substrings in the charts, spaces get eaten up. After removing the second translate in
<xsl:value-of select="translate(translate(sparql:literal,"'"," ")," ","")"/>  

the generated JSON looks correct when testing the transform locally.

      • The second translate has been removed. The only situation I'm aware of where newlines may exist in literals is for the references and notes data, but no one's going to be making charts from this data.
  • Prop:Point is not exported to RDF. This is currently breaking our work on Enipedia Maps, although an update is running to populate the point data. Bug report + patch is here
    • The issue is in the getDataItemExpElement function of SemanticMediaWiki/includes/export/SMW_Exporter.php
      • The current fix is to change this:
case SMWDataItem::TYPE_GEO:
    /// TODO
    return null;

to this:

case SMWDataItem::TYPE_GEO:
     $lit = new SMWExpLiteral( $dataItem->getSerialization(), 'http://www.w3.org/2001/XMLSchema#string', $dataItem );
     return $lit;
  • Semantic Forms has been patched in order to show values in listboxes. Default install gives error shown below. Bug report + fix has been submitted here here. There's still a small issue in that "Hard_Coal" != "Hard Coal". If there's an underline in the name, it will not match if the list is populated by category members.
PHP Notice:  Undefined variable: cur_value in ./extensions/SemanticForms/includes/forminputs/SF_ListBoxInput.php on line 47
  • Patched Virtuoso issue with many SIO's on a page generating 10000+ lines of SQL.
  • Patched inability to export RDF for SIO's.
  • Modified RDF export to not encode certain characters - &,',:, etc. Check if this is a sane strategy - no problems have been noticed.
  • Added code that allows for a SPARUL user to be specified for virtuoso. This allows for the SPARQL endpoint to be public, while SPARUL commands are run as a privileged user. The current documentation for SMW doesn't seem to expose the SPARQL endpoint to the outside world - both query and update URLs are on the localhost.
  • $wgProxyList doesn't work, or at least it allows banned IPs to create accounts, but not to edit. This is likely what's happening with the recent flood of account creation that doesn't lead to spam. There seems to be another issue where you can block a user but they can still create an account from the same IP address, even though MediaWiki gives you the option to prevent this from happening.
    • There's probably a better way to fix this, but updating ./includes/specials/SpecialUserlogin.php as shown below prevents banned ips from creating an account. Original:
if ( !$wgUser->isAllowed( 'createaccount' ) ) {
        $wgOut->permissionRequired( 'createaccount' );
        return false;
} elseif ( $wgUser->isBlockedFromCreateAccount() ) {
        $this->userBlockedMessage();
        return false;
}

New:

if ( $wgUser->blockedBy()){
        $this->userBlockedMessage();
        return false;
} elseif ( !$wgUser->isAllowed( 'createaccount' ) ) {
        $wgOut->permissionRequired( 'createaccount' );
        return false;
} elseif ( $wgUser->isBlockedFromCreateAccount() ) {
        $this->userBlockedMessage();
        return false;
}

[edit] Templates and Properties to Refactor

Templates and Properties to Refactor for Enipedia Upgrade contains the list of changes to be made. Using SPARQL with Enipedia needs to be updated to indicate the new properties in use.

[edit] Templates to Refactor

Templates such as Template:Electricity Production and Template:Emissions which set values of Semantic Internal Objects in them via {{#set_internal:...}} need to have the parameter names capitalized. If units are not set correctly for Semantic Internal Objects, this is probably what is happening.

[edit] Properties to Refactor

All of these properties use units, and will be affected by the SMW upgrade. Need to copy conversion factors from Type:Mass, Type:Energy, Type:ElectricPower, Type:Fuelcost, Type:Fuelemissions, Type:Currency_amount and Type:Volume. Special:Types has an overview of everything. Some of these are not widely used, and may want to clean up them if they're not really in use, but from initial experiments.

The most used properties (starting from prop:Generation_capacity) will be refactored first as it will probably take about a day to run the jobs that update these across all the power plant pages.

Property Type Number of Values
Property:AdditionalIncome Currency amount 3
Property:AdditionalPrivateInvestment Currency amount 3
Property:Cost Currency amount 124
Property:GasPrice Currency amount 7
Property:Operating cost Currency amount 124
Property:Price Currency amount 31
Property:Amount fuel consumption Energy 16
Property:Amount heat production Energy 2
Property:Fuelprice Fuelcost 9
Property:Fuelemissions Fuelemissions 9
Property:Amount emissions Mass 8
Property:CoalConsumption Mass 67
Property:CoalProduction Mass 67
Property:CoalProvenReserves Mass 67
Property:Generation capacity thermal Power 82
Property:CoalBedMethaneReserves Volume 1
Property:LNGExport Volume 12
Property:OilConsumption Volume 80
Property:OilProduction Volume 80
Property:OilProvenReserves Volume 80
Property:PipelineMaximumAnnualCapacity Volume 69
Property:RegasificationCapacityGas Volume 120
Property:RegasificationCapacityLNG Volume 120
Property:ShaleGasReserves Volume 1
Property:StorageCapacityLNG Volume 120
Property:TechnicallyRecoverableShaleGasResources Volume 134
Property:TightGasReserves Volume 1

[edit] Pages to Check

These use queries that may break with the upgrade. Some of these are due to the refactoring of how units are used with properties. For others, it is due to the fact that the "fn:" prefix cannot be used to do things like retrieve substrings. The queries need to be updated to include the appropriate workaround.

See Searching for SPARQL queries in use on the wiki for several ways to find aspects of the queries in use.

[edit] Aspects to Verify with New Version of SMW

  • Check that URIs written out to the RDF store are in the expected form. It should be something like http://enipedia.tudelft.nl/wiki/ArticleName
  • Patch SMW code with regard to which characters it escapes in the RDF export. By default, it will generate URLs including "Property-3A" instead of "Property:".
  • Are Semantic Internal Objects exported correctly? Check that they work with both SPARQL and inline queries
  • Newer version of SMW specifies unit conversions differently. This changes how the data is represented in the triplestore, so we need to figure out how to best refactor this, and how long it would take for these changes to propagate.
    • Identify all properties with units specified
    • Identify all SPARQL queries that reference these properties
    • Need to refactor names of properties to explicitly include the default unit, so that the properties are self-documenting. The new version of SMW automatically changes things like prop:Energyoutput-23J to prop:Energyoutput.
  • Check new SPARQL endpoint - still need to verify that these work with SMW, and pick which one to hook up to the wiki. Changing from Jena/Joseki to something else is an important step for taking advantages of some of the possibilities that Nono has highlighted with his work on maps. An interesting extension based on this code would involve using the jQuery functionality for remote autocompletion. With this, you could create maps where you can type in a few letters of a company name and then find all the power plants owned by them.
    • SparqlExtension - the SparqlExtension needs to be upgraded to fit into the latest version of SMW. This will become version 0.8. If the endpoint parameter is not set for a query embedded on a wiki page, then the query results are processed via this code.
      • Tests
        • normal mode where just return table
        • normal mode + google visualization
        • using endpoint parameter with table
        • using endpoint parameter with visualization
        • Test different output formats available in Special:SparqlExtension
          • Copy over recent work by Alfredas with custom xslt view.
      • Patches to SMW
        • Check that RDF URIs are escaped properly, according to: http://www.w3.org/TR/rdf-concepts/#dfn-URI-reference. Characters such as & and ' are encoded as -26 and -27 currently.
        • Check GasOverview/GasFlowsNetwork which queries SIOs from NaturalGasFlows. Semantic Internal Objects don't seem to be exported to the triplestore in the new version of SMW. This needs to be patched.
        • Credentials need to be passed in order to securely run SPARUL commands on Virtuoso. To restrict outside users from being able to delete all data in the triplestore, it is necessary to set up another user with update permissions, while the default sparql user has only select permissions. For 4store and Joseki, this is not necessary as you can just block access to the update URL, so that only requests from the localhost are accepted.
        • Virtuoso can't handle the SPARUL commands generated by resaving NaturalGasFlows. Here is a non-ideal workaround. The current solution is to split up SIO update statements into batches of 10 (NaturalGasFlows has 282 SIOs). It seems that at around 10000 characters the POST request fails, although the error message specifically mentions 10000 lines of code. The request generated for NaturalGasFlows is about 106,000 characters long, which works with Joseki.
          • Virtuoso 37000 Error SP031: SPARQL: Internal error: The length of generated SQL text has exceeded 10000 lines of code
      • Observations
    • Both 4store and virtuoso are running with static copies of the wiki data (snapshot from June) along with the external datasets (EU-ETS, eGRID, E-PRTR, IAEA). The URLs for the endpoints will change once we make sure that things are working.
    • Implementations
    • Named Graphs - both of these are set up with named graphs to help manage the data
    • Ideas/Considerations
      • May want to have virtuoso connected directly to the wiki data, while having 4store (currently running on a cluster) set up as a sort of high-traffic endpoint that people can send a large number of queries to. We want people to have the freedom to experiment with the data without (them or us) worrying about breaking things.

[edit] Future Work

  • Wikipedia IP Range blocks are useful for tracking down IP ranges of known spammers.
    • This helps, but doesn't keep them all out since there's way too many proxies available. It seems that spammers have their own blacklist which you can get on if you complain enough to the companies hosting the proxies. The script below helps with this process by taking a text file of spammer IPs and then filters the whois output to highlight the abuse email you need to contact.
#!/bin/bash
ips=`cat listOfIPs.txt | sort | uniq`
rm abuseEmails.csv
for ip in $ips
do
        abuseEmails=`whois $ip | grep -e "OrgAbuseEmail\|abuse-mailbox\|abuse@" | sort | uniq | sed 's/OrgAbuseEmail: \+//g' | sed 's/abuse-mailbox: \+//g'`
        for abuseEmail in $abuseEmails
        do
                echo $ip, $abuseEmail >> abuseEmails.csv
        done
        # give whois a break
        sleep 5
done
  • Treemap visualization of Power Generation by Fuel Type is crashing on Netherlands/Powerplants - need to look into replacing this with something more stable.
  • Scraperwiki views are not working due to query timeouts, likely due to their transition to a new platform. A test case is select distinct Country from powerplants on the GEO scraper
  • Update pages such as Netherlands/Energy Companies to include subsidiary information.
  • There's some power plants showing up twice in Template:EnergyCompany due to the complexities of doing a hierarchical traversal of company structures over both owners and operators. Having columns for both owners and operators may be a way to deal with this. RAO Energy Systems of the East is a good test case.
  • There are character encoding issues with the org chart visualization, see [En+ Group].
  • The triplestore has been upgraded to Virtuoso 6.1.7 which should allow more of the SPARQL 1.1 features to be used in queries. Property paths have been verified to work, but LET and BIND don't seem to allow for constants to be assigned to variables.
  • Check for SIO orphans when moving and deleting pages
  • Semantic Forms allow for autocompletion on outside values. This is very cool for allowing for integration with outside data sources - EU ETS account holder, scraperwiki work, etc.
    • Initial tests with a static json haven't worked, need to follow the manual and create a web service. There may also be a bug fixed in the latest version of Semantic Forms (see discussion here).
  • We do ridiculously complex things with templates, which unfortunately leads to performance issues. The Scribunto extension in development would allow us to still do sophisticated things while also speeding up the site. From current testing, it seems that the version of MW in use on Enipedia needs to be upgraded to allow for this to work. Some patches need to be done as MW 1.21 breaks the SparqlExtension.
  • For country LOD links, include Reegle Country Profiles.
  • The work on aligning data sets is becoming mature enough that a bot script can be set up to automatically generate wiki pages highlighting very likely matches.
    • Development work on this will likely start around August. Design ideas are being fleshed out below.
    • This could be adapted to identify power plants that have similar names, locations, etc. which could cause confusion, e.g. Le Havre Powerplant vs. Le Havre Endesa Powerplant‎. It would be nice to have some system in place to create alerts about other entries that may have to be disambiguated.
    • Company entries could also be fixed up with this, as similar names might indicate the presence of owners and subsidiaries or alternative names for the same company.
    • Aside from integrating the data from several of the scrapers, the E-PRTR data should be included as well. The version in RDF now has links to the html versions of the facility pages.
    • Information about possible matches could be written to a subpage of the power plant page itself to avoid clutter and allow for experimentation. It would also be useful to have some indication on a power plant page if there are some likely matches which have not been verified and linked in yet via properties such as Property:Wikimapia link and Property:Wikipedia page.
    • Some sort of country/company overview would be useful also to highlight high scoring matches. A further feature would be to highlight high scoring matches for which the data on Enipedia is not very complete.
  • The Graphviz visualization for the SparqlExtension needs to be patched. The use of a unique identifier for the images allows for more than one graph to be embedded on a page, but it also results in even more copies once the page is reloaded.
  • Carma data needs to be better separated from out. Quite a few of the queries assume that the power plants will have electricity production and CO2 emission data for them, and this is not often the case for power plants that have been added by users. It would be nice to manage the Carma estimations for these values as a separate graph, but first a check needs to be done on how many power plants have had this data overwritten.
  • For plant start dates, should look at what IAEA has: iaea:commercialOperation, iaea:connectedToElectricityGrid, iaea:constructionStarted
  • For Hydro plants - dam, pumped storage, run of river.
  • In general, should re-evaluate the ontology we are using and expand as necessary.
  • Page redirects are not managed in the triplestore in a very good way. See Article redirects and owl:sameAs for an in-depth discussion of how these can affect the result of SPARQL queries.
  • Fuel types are not currently dealt with in a very robust way. May want to specify primary, secondary, and other_fuel_types. Only other_fuel_types could have multiple entries.
  • (This is current work - virtuoso will be the new default. The notes here describe various issues that we need to be aware of.) Change the RDF store to 4Store, Virtuoso, or OWLIM- first would need to do an inventory of the SPARQL queries in use by templates on the site. We've used a bit of SPARQL 1.1, and it's not know if these two fully support the spec yet like Jena/Joseki does. Changing the RDF store would make the site more robust, increase performance in terms of queries per second, and include timeouts for queries that are too complex to finish within a reasonable amount of time.
    • Good overview of different triplestores.
    • OWLIM is interesting due to its use of Jena (what we use currently) and its support of owl:sameAs statements. This second part is important since we don't currently employ a robust way of dealing with redirects. Currently if you have a redirect from one page to another, SPARQL queries will show both the redirect page and the preferred page.
    • bigdata is also worth a look
    • Support for sub-queries? federated queries?
    • 4store issues
      • Federated queries don't seem to be supported yet. We use these on a couple parts of Enipedia, but it's not widespread.
        • Via the SPARQLExtension, it's possible to directly query external endpoints, which reduces the server load on our end. This strategy can also be good for the server on the other end. At least for Jena, the aggregation syntax is not sent to the external endpoint, meaning that the external endpoint returns all the results, with aggregation only happening on the near end. This can lead to erroneous values if you're performing a query involving aggregation of a number of records that is greater than the number that the external endpoint is allowed to return. Virtuoso may do things differently.
      • 4store uses a slightly different syntax than Jena.
        • select count(?x) where {?x ?y ?z} doesn't work
        • select (count(?x) as ?count) where {?x ?y ?z} works
      • The soft-limit setting in 4store will influence query results for large and/or complex queries.
      • When filtering, you must explicitly specify the type. For example, in our queries to find power plants without coordinates:
      • fn:substring isn't supported, but substr() is. Need to look up equivalent function names in the SPARQL 1.1 spec. We may use other similar functions in queries.
        • select (fn:substring(str(?x), 33) as ?name) doesn't work
        • select (substr(str(?x), 33) as ?name) works
      • LET (used in country gas infrastructure - Template:NaturalGasInfrastructure and Using linked data for energy calculations) isn't supported out of the box. Error message says it needs LAQRS.
        • Need to test this on a separate computer, as rebuilding everything with LAQRS enabled doesn't actually seem to enable it.
        • Doing this:
          ./configure --enable-query-languages="laqrs sparql rdql" && make
        • Tells us that:
Rasqal build summary:
  RDF query languages available : rdql sparql laqrs
  RDF query languages enabled   : laqrs sparql rdql
  Raptor version                : 2.0.8
    • Virtuoso issues
      • The Virtuoso Jena Provider might be promising in helping us to deal with the issues mentioned below. In short, SPARQL 1.1 is not completely supported by Virtuoso, but it has better performance than Jena which does support SPARQL 1.1.
      • (Virtuoso 7 is now installed and running) The open source version doesn't support spatial predicates. See feature matrix for more details. Currently, the PowerplantTest template splits up values for the point property into latitude and longitude values automatically. This enables us to run sparql queries over geographic bounding boxes.
        • They seem to have now been included in VOS v7.0. This could enable advanced checks as native sparql queries, without resorting to external tools such as R.
        • Virtuoso 7 has an issue that causes the wrong values of rdfs:label to be returned for queries involving power plants with the same coordinates - https://github.com/openlink/virtuoso-opensource/issues/86. If sorting is not done, then the value for rdfs:label is correct. - this seems to be fixed by with Virtuoso 7.10.
        • Queries for the named graphs don't show all the named graphs. This doesn't show the named graphs for the original carma data, although the data can be queried by specifying those graphs.
        • FILTER NOT EXISTS doesn't work correctly with variable length property paths. In virtuoso 6 this was ok.
          • Broken query:
select * where {
?x rdf:type cat:Powerplant . 
?x prop:Ownercompany ?owner . 
?x prop:Operator ?operator . 
?x prop:Country a:Russia . 
filter not exists {?owner prop:Subsidiary+ ?operator} .
filter(?owner != ?operator) . 
}
          • Current hack/workaround:
select * where {
?x rdf:type cat:Powerplant . 
?x prop:Ownercompany ?owner . 
?x prop:Operator ?operator . 
?x prop:Country a:Russia . 
filter not exists {?owner prop:Subsidiary ?operator} .
filter not exists {?owner prop:Subsidiary/prop:Subsidiary ?operator} .
filter not exists {?owner prop:Subsidiary/prop:Subsidiary/prop:Subsidiary ?operator} .
filter(?owner != ?operator) . 
}
      • If you specify an aggregate variable (i.e. sum(?x) as ?y) in the select statement, you can't re-use it in another calculation in the select statement (i.e. (?y/1000) as ?x)
      • It doesn't seem possible to apply a function to a variable if one of its values is NULL. An example of this is applying a substring function to a variable that is specified in an OPTIONAL clause.
      • LET isn't supported at all (maybe something equivalent?).
    • Jena non-issues - Both LET and BIND work fine in Jena.
select * where { LET(?s:=10) . } 
select * where { BIND(10 as ?s) . } 
    • Below is a workaround for where a literal needs to be assigned to a variable:
select ("10") as ?s where {}
    • With Virtuoso 7, the BIND examples with literals shown above don't work, although BIND seems to work in situations where you bind the value of one variable to another:
select * where {
?x rdf:type cat:Powerplant . 
?x prop:Point ?point . 
BIND(?point as ?coordinates)
} limit 10

[edit] Philosophy/Why do we use the tools that we do

Advantages of current approach:

  • Revision control - we need to know who did what when and revert to previous versions. The site does get occasionally vandalized, and this makes it very easy to fix things up. Also, legitimate edits sometimes have mistakes and it's useful to be able to compare previous values.
  • Semantic Forms is awesome for providing user-friendly editing of data with things like autocompletion, maps input, and upcoming features like this (background info).
  • Embedding queries in templates + ability to have nested templates is incredibly powerful
  • Lots of extensions to MediaWiki make it easier to add new features than with a custom-built solution.
  • We're not aware of anything with this level of maturity that has these features - a significant amount of work has gone into these and dealing with all sorts of weird exceptions and corner cases. We'd rather not duplicate that effort if possible.
  • Features like [2] helps with data cleanup. Also, see all the features available on Special:SpecialPages.

Disadvantages of current approach:

  • It's not as fast as working directly with a triplestore.
    • Alfredas has written a demo xslt stylesheet that transforms the results of a DESCRIBE query for a power plant into a web page with an embedded map. This also contains javascript which is aware of the subject, predicate, and object of the triples, meaning that this could be used to generate statements to update data in the triplestore. In any case, this shows a few interesting features that can be used to create lightweight interfaces to the data without having to load all the dependencies associated with the wiki software.
      • This proof of concept can be seen by going to Special:SparqlExtension, running a query such as DESCRIBE a:Amer_Powerplant, and for the output format, selecting Custom XSLT, or see result here
      • This doesn't (yet) help us with the issue of nested templates where a conditional statement over one value can then call another template which then runs another query.
      • Related efforts are http://www.openlinksw.com/dataspace/dav/wiki/Main/VirtSPARQLXSLT and http://tw.rpi.edu/web/Help/UsingSparqlTag
      • May want to just use a javascript approach, built upon the sparql query functionality in Enipedia Maps. Also see http://www.w3schools.com/xsl/xsl_client.asp for an example of a xslt transformation performed within javascript.
      • It would be best to create a system using human-readable URL, similar to what we've set up with Pubby.
      • XSLT+SPARQL could be promising, but needs a bit more work. This could be used to set up a system that pre-renders pages based on triplestore data, and could run updates on command.
        • One of the more complicated operations involves the creation of sparkline tables, since each row in the table may have missing values at different positions, but the column order needs to be based on the unique set of dates available for all rows. See http://www.biglist.com/lists/xsl-list/archives/200012/msg00354.html for a possible solution.
  • Wiki enforces editing one page at a time. It would be nice to be able to edit the data as if it were a spreadsheet. A common use case is that we know the owner for a set of power plants has changed. With the wiki setup, you have to edit every page individually. Redirects also cause issues with the triplestore and lead to double counting in the results.
  • Bot editing is slower than working directly with data in triplestore
    • There's an opportunity to create bots that are more 'context aware' and are able to flag things that may be weird edits. For example, if you wanted to change the owner of a plant, it would also check to see if two owners were listed, and would only change the desired one. We've done some work on bots that can write to templates given data in csv files, but they can't do things like remove parameters from templates.
  • The jobs queue in MediaWiki doesn't seem to be suited to what we do, in particular the RefreshLinks jobs take forever to run and occur in large numbers when we do things such as modify the main template used for power plants.

[edit] Older Notes

[edit] Upgrade test Aug 25-26

The upgrade test from weekend of Aug 25-26 didn't work. A major issue was that unit conversions did not work. It was found that the reason for this is that there were hidden characters in the property names specified in the power plant template. A way to debug this is to look at the exported RDF, and see if there are any usual characters (hex values for ascii characters) in there. Also, an issue was found where content saved to pages was replaced with content from previously saved pages. I'm not sure if this is an issue with the caching proxy server as the production site uses the same server, and these issues haven't been noticed on there. It seems that commenting out the lines below in LocalSetting.php fixed things. It was noticed that some of the content that showed up on pages was from a previous test, and the content was not contained in the current database - meaning that the content was coming from outside the wiki itself.

//$wgMainCacheType = CACHE_ACCEL;
//$wgParserCacheType = CACHE_ACCEL;
//$wgMessageCacheType = CACHE_ACCEL;
//$wgEnableParserCache = true; //turn on parser caching
Personal tools
Namespaces

Variants
Actions
Navigation
Portals
Advanced
Toolbox