Accuracy and Precision [Part II]: Identify GIS Data Inconsistency: Examples From Toronto’s Open Data Portal
This article is the second part of the accuracy and precision series, which discusses data characteristics, examples of data inconsistency and quality assurance guidelines. The previous article covered common errors often found during fieldwork.
Geographic data is the representation of real-world facts with a georeferenced location. Data sets are collected with various scales, accuracy levels and data structures. For instance, the first column of an attribute table may not contain the same type of information when compared with another dataset. Data mismatch or miscount is expected if users cannot identify the dissimilarities, leading to an undesirable result.
Spatial data often comes with detailed information to understand environmental, social or economic phenomena. It is used to outline the characteristics of an event’s location, trend, flow or distribution. Metadata is also known as the data about data. Spatial data provides information and creates a structured reference to identify attributes in a systematic and consistent approach. Variables such as producer, content, quality, description and date of an item or a location are typically part of the metadata. Metadata is a valuable resource to search and retrieve specific data from a whole heap of records. The entity of each dataset contains related attributes, and the attribute domains define the types of value allowed in each attribute.
Data attributes of different types are stored in data tables along with geographical information, from which they are referenced for spatial queries and analyses. Attribute domains are defined as rules that describe the variables of field types, which helps users to categorize different types of data. The names of the attributes are set when the dataset is created at the very beginning. The field type of each attribute category stores numeric values, texts or dates with the choice of short, long, float, double, text, or date.
Examples of Inconsistent Data
1: Attribute error
Attribute datasets must be formatted in the same way to be used in the same spatial analysis and processing tasks. If there are any data format inconsistencies, some data could be excluded when performing geoprocessing tasks.
The City of Toronto has an open data portal that contains different types of civic information. Data including government services, base maps and civic issues are updated regularly. The Toronto Police Services has uploaded a dataset regularly about motor vehicle collisions, including geographical location for each event. The locations of accidents are expressed in terms of street names with a format of street number, street name, and street type. It is concise to group all three pieces of information into a single column, however not the ideal case for geoprocessing.
Figure 1 shows the attribute table of the traffic accident record dataset. “FIELD_#” are columns of primary data from Toronto Police Services, and “FIELD_7” represents locations of traffic accidents. Some street names recorded in that field do not include street numbers because the accident happened near an intersection (and were recorded in FIELD_8), and therefore extra work is required to separate street numbers from street names. Additional columns, “Street_01” and “Street_02”, were added to solve the data format issue, which is discussed in Figure 2.
In Figure 2, column “Field2” counts the occurrence of street names in “FIELD_7”. There are duplicate rows with the same street name, caused by street type naming issues. For instance, “STREET”, “Stre”, “St” and “ST” all represent the same meaning. It is believed that this dataset is not created for spatial analysis, hence there is no standardization for the street naming format. An extra column “Street_01” is used to separate street names from street types (Figure 1) to solve the issue. It did not work well because some street names have more than one street type, for example, both “Woodbine Track Access” and “Woodbine Avenue” will be categorized as “Woodbine” if the same command is performed.
2: Logical error
In some situations, data inaccuracy causes a logical error when combining different layers of data. Figure 3 is a typical example of a logical mistake when combining layers with different map projections. Building blocks should be placed within land lot boundaries if all layers use the same coordinate system. Map projections for both layers were checked by ArcGIS Pro, the building polygon layer uses EPSG 3857 coordinate system, while the property boundaries layer uses EPSG 4326. Both EPSG 3857 and 4326 use the same WGS84 datum, though EPSG 3857 has a Pseudo-Mercator projection that projects the Earth onto a square. In contrast, EPSG 4326 simulates an equirectangular projection. Map projections should be checked for all imported layers (Figure 4) at the beginning to determine if a transformation is needed or not.
Define Data Quality
Definitions from Statistics Canada
Quality can be defined in terms of completeness and consistency. Quality measurement can evaluate the completeness of required metadata fields, including correct spelling and formatting. According to Statistics Canada, data can be assessed and measured in terms of relevance, accuracy, timeliness, accessibility, coherence, and interpretability to determine the quality of collected information. Consistency of data determines if the database contains any contradictions or not. For instance, roads should not overlap lakes or parks when they are visualized onto maps. The coordinate system, map projection for different datasets should be united before the layers are combined into the same map. By following the appropriate guideline, users can easily find out if there is any unreasonable information or not.
The Canadian Metadata Standard
The Government of Canada Records Management Metadata Standard (GC RMMS) is a record management system used by the federal government. The standard was developed to describe, locate and manage different types of metadata that we use every day. It follows ISO 11179, ISO 15489 and ISO 23081 standards to regulate metadata record registry and management practices. The GC RMMS serves as a guideline that encourages local governments to create customized standards to meet various requirements. For example, the City of Toronto has its own version of Records Management Metadata Standard (RMMS), and the City of Vancouver has a localized version of Records and Information Management Standard and a related By-law.
ISO 191xx : The GIS version of metadata standard
In the geospatial context, analysts can make reference to the ISO 191xx series that define how geographic information and The Government of Canada Records Management Metadata Standard (GC RMMS) is a record management system used by the federal government. The standard was developed to describe, locate and manage different types of metadata that we use every day. It follows ISO 11179, ISO 15489 and ISO 23081 standards to regulate metadata record registry and management practices. The GC RMMS serves as a guideline that encourages local governments to create customized standards to meet various requirements. For example, the City of Toronto has a version of the Records Management Metadata Standard (RMMS) while the City of Vancouver has a localized version of the Records and Information Management Standard and a related By-law.
Attributes and metadata can sometimes be complicated, but fortunately, guidelines are available to ensure a smooth data process. In the subsequent article of this series, common errors and misrepresentations of data visualization will be discussed.
Accuracy and Precision [Part I]
FGDC’s standard about metadata
Statistics Canada’s Quality Assurance Framework
Statistics Canada’s data quality assurance checklist
Geographical Information Systems: Principles, Techniques, Management and Applications
Selecting the right map projection
Esri’s definitions about attribute fields and domains