Initial Test Results Using Self Information Calculations for Entity Matching
See Enipedia Power Plant Dataset Reconciliation API for background on the work shown here. The table below shows the output of a matching algorithm that aims to link together the EU ETS data for France with the corresponding entries on Enipedia. The tokens for the EU ETS and Enipedia are based on the terms found in the power plant name and in the owner name. Terms in bold are the ones that are found in both sets of terms. The self-information of the matching tokens are calculated and then summed. The algorithm is based on finding tokens that two entities have in common, and no checks are done (yet) to see if some of the tokens may be misspelled. There's of a few entries in the data below that can be used to test this (sochaux & sochau, aeoroports & aeroports). Some of these may be instances of things like "é" getting converted to "eo". Levenshtein distance could be used to consider tokens as shared when they are close enough.
The matching was performed by finding the top scoring match on Enipedia for each entry in the EU ETS data for France. As a result, alternative candidates with lower scores are not shown currently.
There's a few issues that pop up when inspecting the results table. Numbers are included in the matching tokens, but they aren't really a reliable indicator of if two entities are the same or not. They probably should only be included during a second matching phase where the plant name is examined to understand if we're looking at a specific unit for a plant. Also, while tokenizing, punctuation characters like '/. should first be replaced by spaces not empty string to avoid unwanted word concatenation (etdarcis, socarpapeterie). The original intent behind this strategy was to convert terms such as B.V., to BV, although this is clearly having some side effects. Abbreviations need to be dealt with in a better way since the same one can appear in multiple forms ("BV", "B.V.", "B. V.", "B V"). Entities containing punctuation like '/.- should be located to see if the preprocessing of the strings before matching doesn't cause any issues. The RUnit package seems promising for collecting cases such as these in order to perform unit testing on the code.
Also, it would be useful to be able to expand abbreviations such as EDF into Électricité de France, and vice versa. The use of stop words could also be improved, but careful checking needs to be done, and further test cases such as these should be developed to make sure that some fixes to the matching algorithm don't cause problems somewhere else. For some of the stop words, it may be necessary to limit their use to particular languages, i.e. only to be used for power plants in particular countries.
This could be really helpful in providing suggestions to a human, especially with such an inconsistent dataset as EU ETS for which it is both painful and tedious to look for matches. Getting insights is always good news, but extensive tuning would probably be needed to cut off before reaching "futile" matches.