Elasticsearch on Enipedia

From Enipedia
Jump to: navigation, search

A working prototype is here and source code is available on github


[edit] What?

A big challenge of working with data is bringing together all of the diverse sources that talk about the same entities. This is needed as while there are many open data sets that describe the power industry, they usually cover various aspects such as fuel types, generation, emissions, or coordinates. These data sets often do not reuse identifiers, so it's quite a challenge to bring this information together, as has been documented here. We've made quite some decent progress as shown by the stats at Category:External_Data_Properties, but it would be great if this process was much easier.

Query languages such as SPARQL are great for working with structured data, but this is not so useful when working with the unstructured (or semi-unstructured) data that is quite commonly encountered. A search server such as Elasticsearch or Solr seems to provide a way forward for dealing with messy data. Solr hasn't been evaluated yet, but as described below, Elasticsearch has some features that seem quite promising.

[edit] Elasticsearch

Interesting features:

  • Very easy setup (especially on Linux)
  • Very responsive (10's of ms for results)
  • Data can be easily segregated into different databases
  • The URLs needed to run queries can be easily constructed.
  • Queries can be run over all databases, or a subset of them. This could be easily coupled with checkboxes on a webpage. This means that you could run a query that basically says "find me any source that is probably talking about this power plant".
  • Results are returned in json, with scores included
  • Different types of matching can be done on the different fields (also geographic queries)
  • Blocking can be employed in the queries (i.e. search through all databases, but only entries within a particular country). In other words, the types of queries run could be unstructured or structured to various extents.
  • Fuzzy Like This fuzzy seach function.
  • Data from the sparql endpoint could be loaded in using JSON-LD. It seems that some of the data may have to be "flattened" as this doesn't seem to deal with intermediate data objects.
  • The flexibility of the data format means that you can upload data without having to first deal with disambiguation or trying to structure it. As long as it's findable, it's already useful. What will probably happen is that we'll probably standardize on certain types of data fields commonly encountered (coordinates, fuel type, etc), but this doesn't have to be done up front initially.
  • This isn't just about people running single queries. This could be part of an automated system that highlights likely matches among all data sets. This could guide people through the "low hanging fruit" of entries that are quite likely the same.

[edit] Development Ideas

  • (Read only) search API is available at http://enipedia.tudelft.nl/search
  • Have a different icon for each website, and if possible make it so the icon becomes highlighted when you hover the mouse over the listing.
  • When you click on a search result name, it should automatically open in a new tab.
  • Need to deal with special characters - Bełchatów != Belchatow
  • Investigate "snowball effect" for linked data - the more linkages that exist for an entity, the more information we have about it, which may make more linkages easier to find since there are more terms to match on (i.e. the entity has a better "fingerprint" or collection of terms which distinguishes it from other entities).
  • Results from one data set can get in the way of good results from another data set.
    • A test case is "E.On Maasvlakte". This quite clearly shows up in the LCPD, but does not rank high enough against other data sets to appear under the default settings.
    • I tend to use it one dataset at a time and usually have to remove the more obvious words (power etc) to get sensible results.
      • Incorporating a cutoff frequency should allow for removal of common words, and is currently enabled by default.
  • Highlighting would help to show which terms are showing up in the matches.
  • Search within bounding box of current map view
    • A geo_point mapping needs to be specified to get this to work
    • Searching by polygon could allow be used for countries, states, etc.
  • Some way to download results? A challenge is that the results would be a bit messy since there would likely be quite a few column headers that are more or less describing the same things.
  • The code here shows some initial tests to use Elasticsearch for matching across data sets. The work at elasticsearch-entity-resolution might also be useful to investigate.
    • This attempts to match all of the entities on one data set versus all the entities in another set using both common terms queries and fuzzy search queries.
    • This could be used to facilitate "Loosely Coupled" Open Data where likely connections are indicated, and some sort of interface would allow for people to verify or vote on which links are actual links.

[edit] Data sets to include

  • Carma v2, v3 - included
  • Enipedia
  • EU ETS (included, 2013 data, not on SPARQL endpoint yet)
    • The identifiers don't correspond with those in use on the SPARQL endpoint.
  • E-PRTR (included, need to fix links), EPER
  • GlobalEnergyObservatory
  • IndustryAbout - included, need to incorporate latest updates
  • LCPD - Large Combustion Plant Directive (included)
  • IAEA
  • Wikimapia
  • OpenStreetMap - included
    • updated nightly from the power data extracted here
    • This currently is all the data tagged as "power=generator". This should be expanded to have "power=*", as "power=plant" is also used to tag power plants, without them being indicated as a generator.
    • Theres an issue that the user name of the person who made the edit on OSM is coming up as a match for power plant names.
      • This could be a bug or a feature depending on the use case. As an example, with a bit more development work, it would be possible to run a query for all the osm edits by a particular user in a geographic region. On the other hand, Elasticsearch allows for querying data in particular fields, which could be quite useful to exclude fields that are not useful to the type of query being conducted. This should be integrated somehow into the current interface. The current default setup is that all fields are searched, which is helpful as all the different data sources don't use the same name for their data fields, and it's somewhat difficult to pre-state which fields are useful to search and which are not.
  • Wikipedia - included (category traversal on the English Wikipedia)
    • using category traversal starting at Category:Power_stations_by_country
    • would be useful to also grab different language versions
    • List pages such as List of power stations in France can be indexed per row, so that searches don't return the whole table, but just the entry that we are interested in.
      • The column names need to be cleaned up, "Name 1" shows up frequently.
    • published & unpublished code (mostly) exists for this already
    • There's some work done on indexing Wikipedia pages with Elasticsearch, but it's not clear if this can be modified to only process a subset of pages and incorporate updates of those pages.
  • Sourcewatch
  • Clean Development Registry / Joint Implementation
    • they don't have co-ordinate location, but its still useful to have the link as you can find the location and everything else in the pdf on their site
      • Coordinates are in many of the pdfs, but it might be a challenge to consistently parse them. Also, some of the projects involve multiple sites (such as biogas production on multiple farms), so a collection of points should be gathered in addition to calculating a centroid based on the average of all the points.
    • The spreadsheets here are easy to parse, but they don't contain a direct link back to the entry on the UNFCCC website. For example, for the Rio Blanco Small Hydroelectric Project, the identifier is "28" and the validator "DNV", which can be seen in the URL http://cdm.unfccc.int/Projects/DB/DNV-CUK1101980215.28/view, but it's not clear how to generate the entire URL from the data in the spreadsheet. Some combination of data from the UNFCCC website and the spreadsheets would have to be used.

[edit] Advanced

This can be coupled with OpenRefine (formerly Google Refine). Usually it helps to have as much text as possible to match on, especially if there are multiple facilities with similar names. In the example below, a column named "ColumnWithAllTextToMatch" is created. Then the option to "Add column by fetching URLs" is selected, with the url below used. URL escaping must be done since spaces are removed (behind the scenes) from the URL that is actually used to get data from the server. This procedure is similar to that for retrieving geographic coordinates from place names from the Google Maps API with OpenRefine.

The example below only looks at data from IndustryAbout. A list of all the data sets that can be queried can be found here. To query all sources, you need a URL that begins like:


To search over a list of specific data sources you need something like :


The query below will get only the highest scoring result using a common terms query which will minimize the importance of frequently used words that may not contribute to matching.

'http://enipedia.tudelft.nl/search/industryabout/_search?source={"from":0,"size":"1","query":{"common":{"_all":{"query":"' + escape(cells["ColumnWithAllTextToMatch"].value, 'url') + '","cutoff_frequency":0.001}}}}'

A fuzzy search can be done via the code below. This is able to much better handle alternative spellings, etc.

'http://enipedia.tudelft.nl/search/industryabout/_search?source={"from":0,"size":"1","query":{"fuzzy_like_this":{"like_text":"' + escape(cells["ColumnWithAllTextToMatch"].value, 'url') + '"}}}'

Not everything will be matched, and matches with low scores will also be returned. To extract the scores so that you can filter on it, you need to "Add Column Based on this Column", pick a name for the new column, and copy-paste the code below:

with(value.parseJson().hits.max_score, x, x)

The "location" field is standard across all the data sets and is used to encode the coordinates for use in geographic queries.

with(value.parseJson().hits.hits[0]._source.location, x, x)

[edit] Fields specific to data sets

Data fields for the candidate matches can be extracted into columns, although the exact syntax will depend on the data source that is being used. Below are some examples from a few specific data sets.

[edit] IndustryAbout


with(value.parseJson().hits.hits[0]._source.url, x, x)


with(value.parseJson().hits.hits[0]._source.name, x, x)

[edit] Carma v3


with(value.parseJson().hits.hits[0]._source.name, x, x)


with(value.parseJson().hits.hits[0]._source.company, x, x)

[edit] Customizing Queries

You can also retrieve more than just the top hit by changing the URL used for the queries. To do this, change this:


To something like this:


You can then separate out the results into multiple columns by using a different index besides 0 in the "hits.hits[0]" code above.

You can follow the Elasticsearch Query DSL documentation to create custom queries. The Elasticsearch documentation shows how to configure queries using JSON, and this can be added directly to the URL as shown above by adding "?source=".

Personal tools