User talk:ChrisDavis

From Enipedia
Jump to: navigation, search

Hi Chris,

Would it be possible to show me a few examples of bot scripts (I mean a script actually making modifications) that do various things on the wiki? I took a look at your R Semantic MediaWiki Bot, the README explains somewhat how to use it. But do you also have scripts using python bots for example (I know that Pywikipediabot has many python bots) or not? I thought that looking at an already running script would be helpful. I suppose if I gave you a script, you could check/correct it and run it.

Also, last time I trained myself to using Google Refine language and I managed to modify the dates such as you did. I wanted to edit the operator name so that it corresponds to the owner name but both entries are in a different CSV file...should that be fixed directly into the scraper? or is there something to do directly in google refine? I didn't work that out.

--Raph (talk) 11:03, 26 February 2013 (CET)


Hi Raph,

If you want to work with template data, I'm not sure if Pywikipediabot has any functions that allow you to (easily) work with them out-of-the-box. My experience from a few years ago is that it allows you to extract the text from the page, and then you have to do some string operations on this text (replacements, concatenations, etc) in order to include the template, and then write this new text back to the page.

A working example for RSemanticMediaWikiBot is this:

#TODO change working directory to location of bot library code
setwd("/dir/to/this/code")
source("bot.R") 

#TODO fill in user credentials
username="USERNAME"
password="PASSWORD"
apiURL = "http://enipedia.tudelft.nl/enipedia/api.php"

bot = initializeBot(apiURL) #initialize the bot
login(username, password, bot) #login to the wiki

edit(title="User:ChrisDavis/BotTestPage", 
     text="this is the new page text", 
     bot, 
     summary="sample edit summary")

With RSemanticMediaWikiBot, I've also got some code (not committed yet) that can read from data in a CSV file and then write this data to templates on specific pages. The main idea is that the first column lists the page name, the second is the name of the template, and the rest of the columns list the template parameters (in the headers) and their values. If you could put together a CSV file like this, it's not that hard for me to import it into the wiki, and also update the code & documentation for the bot.

For the owner & operator names, one approach is to copy the values in the owner/operator columns from each CSV and then paste them all in a new single column CSV. You can then load this into Google Refine, make a copy of this column, and then run "Edit cells->Cluster and edit" on the new column. This way you can preserve the original values in one column while having a second column with the preferred version of the name. This can then be used to create a lookup table or dictionary of preferred terms (like this) that can be used within the scraper or within some later data processing step.

--ChrisDavis (talk) 20:45, 26 February 2013 (CET)

Ok thanks for the tip, I've never used csv file before, and I just discovered that you could import it to see the data in a tabular format (and not as a text file), which is much easier if you want to add columns, etc.
I have two files with the right columns (page_name,template, and then the corresponding template's parameters). In the first one there is as trick to do when editing data though for the "water_depth" parameter, the depth_unit parameter is not part of the template but should be appended to the water_depth value I suppose.
In the second file, concerning the "ownercompany" and "share", these parameters are used with the #arraymap function, so they need to be added to the parameter value without replacing the previous values.
Tell me if I should do some more work on it that would make it easier to import in the wiki.
I'll put the files on my git account and then give you the link, can't do it now, my lab PC is on windows and I'm not admin, I don't have the right tools.
--Raph (talk) 11:05, 27 February 2013 (CET)
For the DECC data, I would start by trying to work with the data currently in the triplestore. The main reason is that if you look at the data for a single field, there's quite a lot of data in there. If we don't expect to actually edit the original data that much, but we expect some outside data provider (i.e. DECC) to update it on a regular basis, then it's much easier and efficient to keep the data in the triplestore (and set up scripts to make the conversion process to RDF easier) than to run bots over the wiki. With this strategy, the wiki page for a field can then be used to add additional documentation/reference materials about the field and can contain information about identifiers used for it in other data sets. I'll make an example template and we can go from there and see how things work. Also, if you build up a list of the types of things you want to do with the data, we can explore where this approach would break and figure out different strategies to get around issues that arise.
--ChrisDavis (talk) 16:09, 28 February 2013 (CET)
Ok I see what you mean. So I was thinking of writing something like this (simple and useless example) to try to use data in the triplestore. Would that work? I'm sure I'm missing something. I inspired myself from the Template:Display EUETS Emissions Sparkline.
<noinclude>
Example usage:
{{Display DECC Data|DECC_name=AFFLECK}}
The DECC_name refers to '''rdfs:label''' property.
</noinclude>
<includeonly>{{#sparql:
PREFIX deccprop: <http://enipedia.tudelft.nl/data/DECC_UK_PPRS/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select ?name ?operator where {
?installation rdfs:label ?name . 
filter({{?name ={{{DECC_name}}}).
?installation deccprop:operator ?operator . 
}
}}</includeonly>
or this maybe:
<noinclude>
Example usage:
{{Display DECC Data|DECC_name=AFFLECK}}
The DECC_name refers to '''rdfs:label''' property.
</noinclude>
<includeonly>{{#sparql:
PREFIX deccprop: <http://enipedia.tudelft.nl/data/DECC_UK_PPRS/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select ?name ?operator where {
?installation rdfs:label ?name . 
?installation deccprop:name ?decc_name
filter({{?decc_name ={{{DECC_name}}}).
?installation deccprop:operator ?operator . 
}
}}</includeonly>
I can also wait for the template you said you'd set up.
--Raph (talk) 20:02, 28 February 2013 (CET)
Take a look at User:ChrisDavis/TestField - this sketches out the basic idea using Template:Display_DECC_Data and Template:Display_DECC_Field_Meta_Data. The first query gets some of the metadata about the field, specifically things for which there is only a single fact. The results of the query are then passed to a template that then redisplays this data in some other format (i.e. a bulleted list). The second query gets the production data and then passes it to a sparkline visualization. I'm not sure that the sparkline visualization is the best idea since it runs off the page, but it at least gives a quick overview of the production trends.
Assuming that DECC doesn't change the names of the fields over time (and assuming that there's not some other unique identifier available), the next step would be to run a bot to create pages based on the names of the fields and populate these pages with template calls.
--ChrisDavis (talk) 20:39, 28 February 2013 (CET)
Thanks Chris for bootstrapping this so quickly.
Having used DECC production data for several years, I can confirm field names do not change in that specific part (though they regularly add new ones to cope with new production starts). But it's a pity they do not use unique identifiers, since field name differ between different kind of publications and there is actually no such thing as a "DECC name". For example in the approval file first brought out by Raph, a number of fields are listed under a (slightly) different name than in production data. So, suppose you are interested in the time elapsed between discovery date, approval, and production start, you can't easily link them. It could also be useful to leverage geographical data available here to locate each field on a map, but this is using yet another (slightly) different name for some fields... So I fear we will miss the most powerful aspect of Enipedia — linking different datasets together — unless you come with some powerful reconciliation tool.
--Nono (talk) 22:48, 28 February 2013 (CET)
I'd be glad to work on the reconciliation issue. Data reconciliation is really a fundamental bottleneck with efforts around open energy data, and the DECC data gives us some more test cases for figuring out how to improve the current reconciliation code.
--ChrisDavis (talk) 09:36, 1 March 2013 (CET)
Thank you both, things are much clearer now. It's easier to understand when you follow things while they are being done than to look at an entire work already done (like all you did with Powerplants for example).

[edit] Journal Libre

I wasn't sure what I proposed is clearly understood. As Cosmin (one of my supervisors with whom I'm writing the article that I'd like to see published freely), did not get it first (or yet). So I produced a graph as explanation.--RP87 (talk) 14:45, 3 November 2015 (CET)

Personal tools
Namespaces

Variants
Actions
Navigation
Portals
Advanced
Toolbox