An improvement and continuation of work done during Winter 2025 as part of a graduate statistical consulting course at Portland State University in collaboration with TREC (Transportation Research and Education Center). The goal of the project was to compare two different datasets, PORTAL and INRIX, using the Maximum Mean Discrepancy (MMD) statistic to measure distributional similarity.
Executive Summary
Understanding the key differences between data collection approaches and their resulting datasets is crucial for transportation planning and traffic management. This project aims to compare travel-time data from two prominent sources, PORTAL and INRIX, using the Maximum Mean Discrepancy (MMD) statistic. Each dataset records travel-time on highway segments. However, they differ in approaches to collecting those recordings, where fixed point highway sensors are used in one dataset (PORTAL) and OEM probe data from moving vehicles are used in the other (INRIX). By analyzing the travel-time distributions from these datasets, we can identify significant differences and similarities that may impact their use in various applications.
The Data
Data was collected from two sources: PORTAL—public data managed by the Transportation Research and Education Center (TREC) at Portland State University (PSU)—and INRIX—a commercial provider. The PORTAL data is aggregated from sensors maintained by Oregon Department of Transportation (ODOT) and Washington State Department of Transportation (WSDOT). The INRIX data is collected from GPS-enabled vehicles, mobile devices, and other third-party sensors.
The Analysis
An unbiased estimator of MMD with the Radial Basis Function (RBF) kernel was applied to various views of the travel-time data in order to ask and attempt to answer several questions about how the distributions may or may not differ. The focus of analysis was constrained to a subset of 15-minute interval travel-time readings from 2019 through 2024 on I-5, I-205, and SR-14 in the Portland, Oregon - Vancouver, Washington Metropolitan Area.
Missing Intervals
The datasets differed substantially in data completeness, with PORTAL containing notable gaps at several stations and INRIX having comparatively few missing intervals. To ensure that these differences did not disproportionately influence the distributional comparisons, three complementary strategies were used to handle missing readings: standardization with zero-filling, masking during computation, and a combined standardized-masking approach. These methods allowed the analysis to separate effects due to missingness from genuine distributional differences.
Key Findings
Results indicate that the travel-time readings from PORTAL and INRIX can be considered to come from a different distribution as measured by the MMD statistic. This suggests that the two data sources may capture different aspects of traffic conditions—potentially due to differences in data collection methods and sensor types. However, the experiments also revealed a trend towards increasing similarity between the datasets over time, hinting at possible improvements in data collection, sensor coverage, or processing techniques.
Code and Report
Code located on GitHub: whitham-powell/TREC-PORTALvsINRIX-MMD
Link to the final report: Measuring Distributional Similarity via Maximum Mean Discrepancy (PDF)