The Linkable Open Data Environment (LODE): Supporting Geospatial Analysis with Open Data
Over the past 2 years Statistics Canada’s Data Exploration and Integration Lab (DEIL) has launched and developed the Linkable Open Data Environment (LODE), an exploratory initiative that aims at enhancing the use and harmonization of open microdata primarily from municipal, provincial and federal sources.
The results are a collection of datasets released under a single open data license (Open Government Licence – Canada), and scripts for processing and visualizing the data that are publicly available as open source tools to collaborate with the open source community. Since all the data refers to individual records grounded to a geographic space (i.e., microdata), LODE datasets are georeferenced, and thus can support a variety of geospatial analyses and visualizations! For example, you can download the health care facilities database for analyses related to COVID-19.
Open data are generally defined as structured data that are machine-readable, freely shared, used, and built on without restrictions (Canada Open data portal). Statistics Canada is a major producer of open data, which primarily releases aggregated datasets to protect confidentiality of personal and sensitive statistical information and to comply with the requirements of the Statistics Act, under the authority of which most of the data of Statistics Canada are collected.
In recent years, however, a multitude of private and public entities have started producing and releasing microdata, such as civic addresses, building locations, business licenses, and public transit stops and schedules. This data is mainly in the public domain and is generally considered non-personal and non-sensitive information. The LODE focuses specifically on this type of data and on how Statistics Canada can further expand the open data ecosystem.
The LODE initiative has four pillars. First, the open databases on selected thematic subjects. Second, the open source tools developed to process and link these databases. Third, collaborations on the development and use of open data and open tools. Fourth, the Linkable Open Data Environment Viewer, an interactive map that displays the georeferenced records in the open databases.
Open databases are the core component of the LODE. Currently there are 3 released open datasets:
- The Open Database of Healthcare Facilities (ODHF), which contains the names, addresses and geo-coordinates of healthcare facilities across Canada. Facilities are classified by type. The current version (version 1.1) contains approximately 7,000 records compiled from open data sources, publicly available data, and data directly provided by sources for inclusion as open data.
- The Open Database of Educational Facilities (ODEF), which contains addresses of educational facilities across Canada. In its current version it contains over 20,000 records compiled from both open sources and from publicly available data (with permission from the data owners).
- The Open Database of Buildings (ODB) version 2.0, which contains approximately 4.4 million building footprints compiled from 65 sources.
There are also several datasets in open development. Most of the datasets used for their development are listed in the LODE GitHub repository. You can contribute to the development of these open databases by reporting additional publicly available databases that should be considered for inclusion! For instance, with a wide demand for a dataset of civic addresses within Canada, DEIL is collaborating with OpenAddresses, a not-for-profit organization that compiles worldwide open address data. The provinces and municipalities of Canada have already released approximately 11 million civic addresses as open or public data that have been identified by OpenAddresses. By collaborating with municipalities and OpenAddresses, a more comprehensive and open database of civic addresses can be achieved. Furthermore, an open civic address database can be linked to the open database of buildings, allowing for the ability to link datasets with civic addresses to building footprints for spatial analysis and visualization!
The LODE is developed and implemented with open source tools whenever possible. The source codes, scripts, and source files developed for the LODE initiative are hosted on the LODE GitHub page and are under the MIT license. This allows for open development in a version-controlled environment, in which all users can contribute by suggesting improvements or cloning the entire project for their own use.
An example of one of LODE’s open tools is OpenTabulate. This is a Python package (available on the Python Package Index) designed to organize, tabulate, and process structured data. OpenTabulate is released under the MIT license, and features:
- automated data retrieval
- a systematic way of organizing and retrieving data using sources files (inspired by OpenAddresses),
- tabulation of data into a standardized CSV format that is suitable for merging and linkage,
- various methods to process data, including address parsing, cleaning and reformatting.
To follow or contribute to development of OpenTabulate, see the OpenTabulate GitHub repository.
As of August 7, 2020, the LODE viewer is an interactive map that displays the content of select databases released as part of the LODE initiative. Only records with coordinates are shown on the map. In this first version, the LODE viewer showcases the Open Database of Healthcare Facilities (ODHF); but will eventually include the other released LODE datasets.
Interested in collaborating and/or contributing? The LODE initiative is developed and enriched through collaborative projects with external partners, involving data or code development as well as open data analysis. Feel free to contact the Data Exploration and Integration Lab’s LODE team to learn more about the LODE and how to collaborate: