Example Code for Using Self Information Calculations for Entity Matching

From Enipedia
Jump to: navigation, search

This is an example of code developed as a part of the Enipedia Power Plant Dataset Reconciliation API. See here for more for the discussion around these efforts. The current efforts (as of March 2013) are not focusing on the original Google Refine implementation, but are aimed at creating generic code that can be used for matching entities between two data sets. The main idea is to address the fundamental issues in matching entities and then work on making the process a bit more user-friendly (i.e. re-integrate with Google Refine, etc).

The example code demonstrates a current working example based on the enipedia-openrefine-reconcile code, and matching efforts with other data sets (eGRID to E-PRTR) follow basically the same pattern. Once the enipedia-openrefine-reconcile library is installed, everything should work out of the box. The DataRetrieval.R code shows what's been developed so far in terms of functions that can retrieve data from different sources.

The main idea is that you need to make two "soup" vectors which contain terms that you wish to use to match between data sets. These soup vectors are passed to the calculateSelfInformationOfIntersectingTokens function, and candidate matches are scored using the assumption that good matches have the most words in common, with matches on rarely occurring words being scored higher. After the data is retrieved from this function, some processing is done to return the results in the form of a spreadsheet that can be visually inspected.

In the code, there are some efforts underway to allow for fuzzy string matching, but this isn't working very well yet as too many false positives are returned.

#never ever ever convert strings to factors
options(stringsAsFactors = FALSE)

library(EnipediaOpenrefineReconcile)
library(sqldf)

# want to record the soup columns
# also the columns that should be printed out for comparison

#The countries may have different names in the datasets, there needs to be an exact match in order to retrieve the data
enipediaData = retrieveCountryDataFromEnipedia("China")
geoData = retrievePlantDataFromGlobalEnergyObservatory("China")

enipediaData$soup = normalizeText(paste(enipediaData$CleanedOwnerName, 
                                        enipediaData$CleanedPlantName, 
                                        enipediaData$CleanedStateName, 
                                        enipediaData$city))

geoData$soup = normalizeText(paste(geoData$Location, 
                                   geoData$Name, 
                                   geoData$Name_of_Dam, 
                                   geoData$State, 
                                   geoData$Operating_Company, 
                                   geoData$Owners1))

data = calculateSelfInformationOfIntersectingTokens(enipediaData$soup, geoData$soup)

selfInfoOfIntersections = data$selfInformationOfEntitiesWithIntersectingTokens


matchingDataForSpreadsheet = c()
numResults = 5
for(colNum in c(1:dim(selfInfoOfIntersections)[2])){
  print(colNum)
  matchingDistance = selfInfoOfIntersections[,colNum]
  if (max(matchingDistance > 0)){
    sortingStats = sort(matchingDistance, decreasing=TRUE, index.return=TRUE)
    locs = sortingStats$ix[c(1:numResults)]  
    scores = sortingStats$x[c(1:numResults)]  
    
    #don't include scores that are zero
    if (min(scores) == 0){
      locs = locs[-which(scores == 0)]
      scores = scores[-which(scores == 0)]
    }
    
    #data1 is colNum - GEO
    #data2 is locs - Enipedia
    
    #append data together so that it's easier for the humans to verify
    
    candidatesInfo = c()
    
    #figure out which tokens match
    #the columns here are the entities that we care about, rows are the tokens
    #This needs to be done row by row
    
    matchingTokens = unlist(lapply(locs, 
                                   function(x){paste(data$allTokens[intersect(which(data$tokensMatrixData2[,colNum] == 1), 
                                                                                   which(data$tokensMatrixData1[,x] == 1))], 
                                                     collapse=", ")}))
    
    numberOfMatchingTokens = unlist(lapply(locs, 
                                           function(x){length(data$allTokens[intersect(which(data$tokensMatrixData2[,colNum] == 1), 
                                                                                            which(data$tokensMatrixData1[,x] == 1))])}))

    candidatesInfo = cbind(candidatesInfo, scores) #scores
    candidatesInfo = cbind(candidatesInfo, numberOfMatchingTokens) #numberOfMatchingTokens
    candidatesInfo = cbind(candidatesInfo, matchingTokens) #matchingTokens
    
    candidatesInfo = cbind(candidatesInfo, geoData$Name[colNum]) # GEO
    candidatesInfo = cbind(candidatesInfo, enipediaData$name[locs]) # Enipedia

    candidatesInfo = cbind(candidatesInfo, geoData$GEO_Assigned_Identification_Number[colNum]) # GEO
    candidatesInfo = cbind(candidatesInfo, enipediaData$x[locs]) # Enipedia
    
    #calculate distance between candidates
    distance = distCosine(cbind(geoData$Longitude_Start[colNum], 
                                geoData$Latitude_Start[colNum]),     
                          cbind(enipediaData$lon[locs], 
                                enipediaData$lat[locs]))
    
    candidatesInfo = cbind(candidatesInfo, distance) # geo_distance_m_between_candidates

    candidatesInfo = cbind(candidatesInfo, geoData$State[colNum]) # GEO
    candidatesInfo = cbind(candidatesInfo, enipediaData$CleanedStateName[locs]) # Enipedia
    
    candidatesInfo = cbind(candidatesInfo, geoData$Location[colNum]) # GEO
    candidatesInfo = cbind(candidatesInfo, enipediaData$city[locs]) # Enipedia
    
    candidatesInfo = cbind(candidatesInfo, geoData$Operating_Company[colNum]) # GEO
    candidatesInfo = cbind(candidatesInfo, enipediaData$CleanedOwnerName[locs]) # Enipedia
    
    matchingDataForSpreadsheet = rbind(matchingDataForSpreadsheet, candidatesInfo)
    
  } else { # if max score is zero, then we have no obvious matches
    print("no matches found")
  }
}

colnames(matchingDataForSpreadsheet) = c("scores", "numMatchingTokens", "matchingTokens", "geoName", "eniName", "geoID", "eniID", "dist", "geoState", "eniState", "geoCity", "eniCity", "geoOwner", "eniOwner")
matchingDataForSpreadsheet = as.data.frame(matchingDataForSpreadsheet)

#make sure that these are numeric so that the sorting works property (otherwise treats as text)
matchingDataForSpreadsheet$scores = as.numeric(matchingDataForSpreadsheet$scores)
matchingDataForSpreadsheet$numMatchingTokens = as.numeric(matchingDataForSpreadsheet$numMatchingTokens)
matchingDataForSpreadsheet = sqldf("select * from matchingDataForSpreadsheet order by geoID, scores DESC, numMatchingTokens DESC")

write.csv(matchingDataForSpreadsheet, file = "matchingDataForSpreadsheet.csv", row.names=FALSE)
Personal tools
Namespaces

Variants
Actions
Navigation
Portals
Advanced
Toolbox