EPRTR to RDF

From Enipedia
Jump to: navigation, search

[edit] What?

This is about our efforts to convert the European Pollutant Release and Transfer Register to an RDF format for use within Enipedia.

Note: this page is kept for reference as the code here may be useful for other applications. When Enipedia was started, the E-PRTR was not available in RDF, so we converted it ourselves from a MS Access database. Currently it is published as RDF by the European Environment Agency, and is also available via the SPARQL endpoint here. More information about their conversion process can be found here.

[edit] How? (using Ubuntu)

mdbtools needs to be installed so that you can convert the MS Access database to something else:

sudo apt-get install mdbtools

You can download the data here

Once you unzip the file, you can use the script below to convert the database into individual tab-delimited tables. The script is actually rather simple, and most of it is just comments.

#!/bin/bash
#make a directory for the csv files
mkdir eprtrTables

#get a list of tables in the database
tables=`mdb-tables Public_Database_14-10-2010.mdb`

#make a tab-delimited text file for all tables
for i in $tables
do
        #dump out the table in a text format with tab delimiters.
        #make sure a tab character is inserted for the -d flag below
        #it is not advisable to use comma as a delimiter since there are
        #some commas used within fields, and this may lead to parsing confusion.
        ### mdb-export -d ' ' Public_Database_14-10-2010.mdb $i \

        #convert from UTF-8 to US-ASCII
        #some of UTF-8 characters cause problems with xlwrap.  The issue motivating this conversion
        #related to something that should be a string, but xlwrap complained about wanting to convert it 
        #into a URL which should be in US-ASCII.
        ### | iconv -c -f UTF-8 -t US-ASCII \

        #convert \r\n to spaces.  This indicates a newline character in a data field, which confuses xlwrap
        ### | sed ':a;N;$!ba;s/\r\n/ /g' > \

        #strip out quotes, these will be escaped in RDF
        ### | sed 's/"//g' >

        # escape spaces if they are in any file names
        escapedTableName=`echo $i | sed 's/ /\\ /g'`

        # replace spaces with underscores in the filename that will receive the table contents
        filename=`echo $i | sed 's/ /_/g'`

        mdb-export -d '\t' Public_Database_14-10-2010.mdb $escapedTableName \
                | iconv -c -f UTF-8 -t US-ASCII \
                | sed ':a;N;$!ba;s/\r\n/ /g' \
                | sed 's/"//g' > \
                ./eprtrTables/$filename.csv
done

From here xlwrap can be used to convert the csv files into RDF. Previously we used D2RServer for this dataset, although this is a bit more complicated since you need to have a database server already running. Leaving the data in csv makes things a little more foolproof with fewer dependencies.

Personal tools
Namespaces

Variants
Actions
Navigation
Portals
Advanced
Toolbox