# --------------------------------------
import warnings
warnings.filterwarnings("ignore")
# --------------------------------------
# A 'Pythonic' interface to Duckdb
import ibis
ibis.options.interactive = True
# --------------------------------------
# For printing human-readable file sizes
import humanize
# --------------------------------------
import streetscapes as scs
from streetscapes import logger
Converting CSV files to Parquet and merging them together¶
The CSV files of the original Global Streetscapes dataset add up to 64GB in total. Moreover, data is split in several files, which can make it a bit cumbersome to work with. Here, we convert the data to Parquet, which reduces file size and makes it easier to load and manipulate the data. Additionally, we combine columns from several sources into a single dataset that should serve most usecases.
The Ibis library provides a Pythonic interface to DuckDB, so it is not necessary to write raw SQL. More importantly, it supports certain types of lazy evaluation that makes it easier to work with large files, especially when merging (joining) files (tables).
First, let's declare some storage locations.
# The root directory for data files from Huggingface
HF_ROOT_DIR = scs.mkdir(scs.conf.CSV_DIR / "csv")
# The subdirectory for the data files.
# This is necessary because Huggingface mirrors the structure of the repository locally.
# We store this in a separate variable because it is used in the download function below.
CSV_SUBDIR = "data"
# The full path to the CSV files.
CSV_DIR = HF_ROOT_DIR / CSV_SUBDIR
# A directory for the individual Parquet files converted from CSV.
PARQUET_DIR = scs.mkdir(scs.conf.CSV_DIR / "parquet")
# A directory for the merged Parquet files.
MERGED_DIR = scs.mkdir(PARQUET_DIR / "merged")
# A DuckDB file on disk to avoid saturating the RAM
db_file = scs.conf.CSV_DIR / "duck.db"
Create a DuckDB connection via Ibis. This will be used to manipulate all the data below.
con = ibis.duckdb.connect(f"{db_file}")
Show some metadata about the available CSV files.
scs.render_info_csv()
- climate.csv - Contains the Koppen climate zone associated with each image's location. The calculation is as accurate as the location of the image given by the source, which also relies on the accuracy of the capturing devices. The accuracy could also be affected by the accuracy of the Koppen climate zone classification API from https://github.com/sco-tt/Climate-Zone-API. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - koppen_geiger_zone (string) - A zone code to identify the Koppen climate zone - zone_description (string) - Short description of the climate zone - contextual.csv - Contains the eight contextual attributes inferred for each image. Please refer to Table 3 in the paper for information on accuracy. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - platform (string) - Indicates the type of platform/road the image was taken from (e.g. driving surface, walking surface, etc.) - weather (string) - Indicates the weather condition in the image - view_direction (string) - Indicates the viewing direction of the image (e.g. front/sideways), not applicable to panoramic images (pano_status==True) - lighting_condition (string) - Indicates the lighting condition in the image (e.g. day/night/dawn or dusk) - glare (string) - Indicates the presence of glare in the image - quality (string) - Indicates the level of quality of the image as observed by the labeller (good/slightly poor/very poor) - reflection (string) - Indicates the presence of windshield reflection in the image - pano_status (bool) - Indicates whether the image is panoramic - ephem.csv - Contains the temporal information of each image calculated using the python package 'PyEphem' with regards to the time of day. The accuracy of the calculation is as accurate as the timestamp of the image given by the source, which also relies on the accuracy of the capturing devices. The accuracy could also be affected by the accuracy of PyEphem. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - timezone (string) - Timezone of the location where the image was captured - utc_offset_s (float) - UTC offset in seconds at the location where the image was captured - calculated_day_night (string) - Indicates whether the image was taken in day time or night time, calculated using PyEphem. For the polar regions, 'polar day' or 'polar night' might apply depending on the season. - hrs_aft_sunrise (float) - The number of hours after sunrise when the image was captured: negative values indicate the number of hours before sunrise - hrs_aft_sunset (float) - The number of hours after sunset when the image was captured: negative values indicate the number of hours before sunset - gadm.csv - Contains the administrative area associated with each image, at all available levels. The calculation is as accurate as the location of the image given by the source, which also relies on the accuracy of the capturing devices. The accuracy could also be affected by the accuracy of the GADM database. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - GID_0 (string) - Unique ID at level 0 (country), using ISO 3166-1 alpha-3 country code when available - COUNTRY (string) - Country Name in English - CC_1 (string) - Uniqe ID of the level 1 state/province/equivalent within the country - ENGTYPE_1 (string) - Administrative type in English (following commonly used translations), at level 1 - GID_1 (string) - Unique ID at level 1 (state/province/equivalent) - HASC_1 (string) - HASC (a unique ID from Statoids) at level 1 - ISO_1 (string) - ISO 3166-2 code of the level 1 state/province/equivalent - NAME_1 (string) - Official name in latin script, at level 1 - NL_NAME_1 (string) - Non-Latin name. Official name in a non-latin script (e.g. Arabic, Chinese, Russian, Korean), at level 1 - TYPE_1 (string) - Administrative type in local language, at level 1 - VARNAME_1 (string) - Variant name. Alternate names in usage for the place, separated by pipes |, at level 1 - CC_2 (string) - Uniqe ID of the level 2 county/district/equivalent within the country - ENGTYPE_2 (string) - Administrative type in English (following commonly used translations), at level 2 - GID_2 (string) - Unique ID at level 2 (county/district/equivalent) - HASC_2 (string) - HASC (a unique ID from Statoids) at level 2 - NAME_2 (string) - Official name in latin script, at level 2 - NL_NAME_2 (string) - Non-Latin name. Official name in a non-latin script (e.g. Arabic, Chinese, Russian, Korean), at level 2 - TYPE_2 (string) - Administrative type in local language, at level 2 - VARNAME_2 (string) - Variant name. Alternate names in usage for the place, separated by pipes |, at level 2 - CC_3 (string) - Uniqe ID of the level 3 commune/municipality/equivalent within the country - ENGTYPE_3 (string) - Administrative type in English (following commonly used translations), at level 3 - GID_3 (string) - Unique ID at level 3 (commune/municipality/equivalent) - HASC_3 (string) - HASC (a unique ID from Statoids) at level 3 - NAME_3 (string) - Official name in latin script, at level 3 - NL_NAME_3 (string) - Non-Latin name. Official name in a non-latin script (e.g. Arabic, Chinese, Russian, Korean), at level 3 - TYPE_3 (string) - Administrative type in local language, at level 3 - VARNAME_3 (string) - Variant name. Alternate names in usage for the place, separated by pipes |, at level 3 - CC_4 (string) - Uniqe ID of the level 4 adminstrative areas within the country - ENGTYPE_4 (string) - Administrative type in English (following commonly used translations), at level 4 - GID_4 (string) - Unique ID at level 4 - NAME_4 (string) - Official name in latin script, at level 4 - TYPE_4 (string) - Administrative type in local language, at level 4 - VARNAME_4 (string) - Variant name. Alternate names in usage for the place, separated by pipes |, at level 4 - CC_5 (string) - Uniqe ID of the level 5 adminstrative areas within the country - ENGTYPE_5 (string) - Administrative type in English (following commonly used translations), at level 5 - GID_5 (string) - Unique ID at level 5 - NAME_5 (string) - Official name in latin script, at level 5 - TYPE_5 (string) - Administrative type in local language, at level 5 - ghsl.csv - Contains the degree of urbanisation associated with the location of the image, calculated using the Global Human Settlement Layer (GHSL). The calculation is as accurate as the location of the image given by the source, which also relies on the accuracy of the capturing devices. The accuracy could also be affected by the accuracy of the GHSL dataset. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - urban_code (int) - A two-digit code that identifies the GHSL settlement typology: 30 - urban centre, 23 - dense urban cluster, 22 - semi-dense urban cluster, 21 - suburban or peri-urban, 13 - rural cluster, 12 - low density rural, 11 - very low density rural, 10 - water - urban_term (string) - The GHSL settlement typology term associated with the code - h3.csv - Contains the ID of the h3-indexed hexagon associated with each image, at all available resolution levels from level 0 to 15. The calculation is as accurate as the location of the image given by the source, which also relies on the accuracy of the capturing devices. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - h3_0 (string) - Unique index of the hexagonal cell containing the location of the image, at level 0 (average edge length = 1281.256011 km) - h3_1 (string) - Unique index of the hexagonal cell containing the location of the image, at level 1 (average edge length = 483.0568391 km) - h3_2 (string) - Unique index of the hexagonal cell containing the location of the image, at level 2 (average edge length = 182.5129565 km) - h3_3 (string) - Unique index of the hexagonal cell containing the location of the image, at level 3 (average edge length = 68.97922179 km) - h3_4 (string) - Unique index of the hexagonal cell containing the location of the image, at level 4 (average edge length = 26.07175968 km) - h3_5 (string) - Unique index of the hexagonal cell containing the location of the image, at level 5 (average edge length = 9.854090990 km) - h3_6 (string) - Unique index of the hexagonal cell containing the location of the image, at level 6 (average edge length = 3.724532667 km) - h3_7 (string) - Unique index of the hexagonal cell containing the location of the image, at level 7 (average edge length = 1.406475763 km) - h3_8 (string) - Unique index of the hexagonal cell containing the location of the image, at level 8 (average edge length = 0.531414010 km) - h3_9 (string) - Unique index of the hexagonal cell containing the location of the image, at level 9 (average edge length = 0.200786148 km) - h3_10 (string) - Unique index of the hexagonal cell containing the location of the image, at level 10 (average edge length = 0.075863783 km) - h3_11 (string) - Unique index of the hexagonal cell containing the location of the image, at level 11 (average edge length = 0.028663897 km) - h3_12 (string) - Unique index of the hexagonal cell containing the location of the image, at level 12 (average edge length = 0.010830188 km) - h3_13 (string) - Unique index of the hexagonal cell containing the location of the image, at level 13 (average edge length = 0.004092010 km) - h3_14 (string) - Unique index of the hexagonal cell containing the location of the image, at level 14 (average edge length = 0.001546100 km) - h3_15 (string) - Unique index of the hexagonal cell containing the location of the image, at level 15 (average edge length = 0.000584169 km) - instances.csv - Contains the count of instances (65 categories) detected in each image. Based on panoptic segmentation results obtained with Mask2former model. Accuracy is dependent on model performance. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - Bird (int) - Number of segmented instance - Ground-Animal (int) - Number of segmented instance - Curb (int) - Number of segmented instance - Fence (int) - Number of segmented instance - Guard-Rail (int) - Number of segmented instance - Barrier (int) - Number of segmented instance - Wall (int) - Number of segmented instance - Bike-Lane (int) - Number of segmented instance - Crosswalk---Plain (int) - Number of segmented instance - Curb-Cut (int) - Number of segmented instance - Parking (int) - Number of segmented instance - Pedestrian-Area (int) - Number of segmented instance - Rail-Track (int) - Number of segmented instance - Road (int) - Number of segmented instance - Service-Lane (int) - Number of segmented instance - Sidewalk (int) - Number of segmented instance - Bridge (int) - Number of segmented instance - Building (int) - Number of segmented instance - Tunnel (int) - Number of segmented instance - Person (int) - Number of segmented instance - Bicyclist (int) - Number of segmented instance - Motorcyclist (int) - Number of segmented instance - Other-Rider (int) - Number of segmented instance - Lane-Marking---Crosswalk (int) - Number of segmented instance - Lane-Marking---General (int) - Number of segmented instance - Mountain (int) - Number of segmented instance - Sand (int) - Number of segmented instance - Sky (int) - Number of segmented instance - Snow (int) - Number of segmented instance - Terrain (int) - Number of segmented instance - Vegetation (int) - Number of segmented instance - Water (int) - Number of segmented instance - Banner (int) - Number of segmented instance - Bench (int) - Number of segmented instance - Bike-Rack (int) - Number of segmented instance - Billboard (int) - Number of segmented instance - Catch-Basin (int) - Number of segmented instance - CCTV-Camera (int) - Number of segmented instance - Fire-Hydrant (int) - Number of segmented instance - Junction-Box (int) - Number of segmented instance - Mailbox (int) - Number of segmented instance - Manhole (int) - Number of segmented instance - Phone-Booth (int) - Number of segmented instance - Pothole (int) - Number of segmented instance - Street-Light (int) - Number of segmented instance - Pole (int) - Number of segmented instance - Traffic-Sign-Frame (int) - Number of segmented instance - Utility-Pole (int) - Number of segmented instance - Traffic-Light (int) - Number of segmented instance - Traffic-Sign-(Back) (int) - Number of segmented instance - Traffic-Sign-(Front) (int) - Number of segmented instance - Trash-Can (int) - Number of segmented instance - Bicycle (int) - Number of segmented instance - Boat (int) - Number of segmented instance - Bus (int) - Number of segmented instance - Car (int) - Number of segmented instance - Caravan (int) - Number of segmented instance - Motorcycle (int) - Number of segmented instance - On-Rails (int) - Number of segmented instance - Other-Vehicle (int) - Number of segmented instance - Trailer (int) - Number of segmented instance - Truck (int) - Number of segmented instance - Wheeled-Slow (int) - Number of segmented instance - Car-Mount (int) - Number of segmented instance - Ego-Vehicle (int) - Number of segmented instance - metadata_common_attributes.csv - Contains the common basic metadata attributes that are provided by both Mapillary and KartaView, and those we computed for both sources. Accuracy is subject to that of the original metadata provided by Mapillary / KartaView, which also largely depends on the accuracy of data provided by the capturing devices. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - lat (float) - Latitude of the image - lon (float) - Longitude of the image - datetime_local (datetime) - Calculated datetime (string) at which the image was captured, in the local timezone of where the image was taken - year (int) - The year in which the image was taken - month (int) - The month in which the image was taken - day (int) - The day in which the image was taken - hour (int) - The hour in which the image was taken - width (int) - Width of the image - height (int) - Height of the image - heading (float) - Heading direction of the image, based on compass direction, in degrees - projection_type (string) - Projection type of the image (i.e. perspective, spherical, fisheye, equirectangular, etc.) - hFoV (float) - Horizontal field of view (for Mapillary images: calculated using focal ratio and image dimensions; for KartaView images: directly obtained from metadata) - vFoV (float) - Vertical field of view (for Mapillary images: calculated using focal ratio and image dimensions; for KartaView images: directly obtained from metadata) - sequence_index (int) - The order index of the image in its sequence - sequence_id (string) - Sequence ID of the image specified by Mapillary / KartaView - sequence_img_count (int) - Number of images available in the sequence - metadata_kv.csv - Contains the metadata of each SVI originally provided by KartaView. The explanation of the fields is largely based on our interpretation of the documentation of KartaView API, which can be incomplete or absent for many attributes. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - kv_shotDate (datetime) - Datetime string at which the image was captured, in GMT - kv_cameraParameters (json) - A json of floats: fNumber, fLen, vFOV, vZF, aperture - kv_qualityDetails (json) - [insufficient documentation] - kv_qualityLevel (int) - Seems to be a level that indicates the quality of the image [insufficient documentation] - kv_qualityStatus (string) - Seems to indicate the status of the quality calculation [insufficient documentation] - kv_autoImgProcessingResult (string) - Indicates how an image has been processed (e.g. blurred, original) [insufficient documentation] - kv_autoImgProcessingStatus (string) - Indicates whether an image has been processed (e.g. finished) [insufficient documentation] - kv_dateAdded (datetime) - Datetime string at which the image was uploaded [insufficient documentation] - kv_dateProcessed (datetime) - Datetime string at which the image was processed [insufficient documentation] - kv_distance (float) - Distance between the current image and its previous image, in metres - kv_fieldOfView (float) - Field of View of the image - kv_filepath (string) - File path of the image [insufficient documentation] - kv_filepathLTh (string) - File path of a large thumbnail of the image [insufficient documentation] - kv_filepathProc (string) - File path of the processed image [insufficient documentation] - kv_filepathTh (string) - File path of a small thumbnail of the image [insufficient documentation] - kv_fileurl (string) - URL to the image file [insufficient documentation] - kv_fileurlLTh (string) - URL to a large thumbnail of the image file [insufficient documentation] - kv_fileurlProc (string) - URL to the processed image [insufficient documentation] - kv_fileurlTh (string) - URL to a small thumbnail of the image [insufficient documentation] - kv_from (int) - Likely the 'from' node of the OSM Way found nearest to the image [insufficient documentation] - kv_gpsAccuracy (float) - Horizontal accuracy of the GPS-tracked photo, in metres - kv_hasObd (int) - A flag that confirms the presence of an OBD connection at the time the sequence was captured, in 1 or 0 - kv_headers (float) - Seems to be some sort of angle, in degrees [insufficient documentation] - kv_imagePartProjection (string) - Seems to indicate something about the projection of the image [insufficient documentation] - kv_isUnwrapped (int) - Seems to be a flag for something, in 1 or 0 [insufficient documentation] - kv_isWrapped (int) - Seems to be a flag for something, in 1 or 0 [insufficient documentation] - kv_matchLat (float) - Seems to be the latitude of some sort of matched location of the image [insufficient documentation] - kv_matchLng (float) - Seems to be the longitude of some sort of matched location of the image [insufficient documentation] - kv_matchSegmentId (float) - [insufficient documentation] - kv_name (string) - Name of the image [insufficient documentation] - kv_orgCode (string) - [insufficient documentation] - kv_projectionYaw (float) - The projection on yaw axis, horizontal rotation in degrees, min -180, max 180 - kv_rawDataId (int) - Some sort of ID for the raw data [insufficient documentation] - kv_status (string) - Status of the photo: "public", "uploading", "processing", "failed", "deleted" - kv_storage (string) - Storage of the image [insufficient documentation] - kv_to (int) - Likely the 'to' node of the OSM Way found nearest to the image [insufficient documentation] - kv_unwrapVersion (int) - [insufficient documentation] - kv_videoId (int) - ID of the video [insufficient documentation] - kv_videoIndex (int) - Identifier representing a specific video index [insufficient documentation] - kv_visibility (string) - Indicates the visibility of the image: "public", "private", etc. [insufficient documentation] - kv_wayId (int) - Likely the ID of the OSM Way found nearest to the image [insufficient documentation] - kv_wrapVersion (int) - [insufficient documentation] - kv_address (string) - Address of the image's location [insufficient documentation] - kv_countryCode (string) - Country code of the country where the image is located [insufficient documentation] - kv_deviceName (string) - Name of the image capturing device [insufficient documentation] - kv_distanceSeq (float) - Distance of the sequence recorded in kilometres [insufficient documentation] - kv_sequenceType (string) - Type of the sequence, e.g. video [insufficient documentation] - kv_user (json) - Identifier and username of the contributor [insufficient documentation] - metadata_mly1.csv - Contains the metadata of each SVI originally provided by Mapillary. Split into five parts (metadata_mly1, metadata_mly2, metadata_mly3, metadata_mly4, metadata_mly5) to reduce file size. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - mly_captured_at (int) - UNIX timestamp at which the image was captured - mly_camera_parameters (array) - A string representation of an array of float: focal length, k1, k2 - mly_quality_score (float) - Score that indicates how good the image is (in experimental stage) [insufficient documentation] - mly_altitude (float) - Original altitude from camera Exif calculated from sea level - mly_atomic_scale (float) - Scale of the SfM reconstruction around the image - mly_exif_orientation (int) - Orientation of the camera as given by the Exif tag - mly_is_pano (bool) - Indicates whether an image is panoramic - mly_organization_id (int) - ID of the organisation under which the image was collected - metadata_mly2.csv - nan nan - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - mly_computed_altitude (float) - Altitude after running image processing, from sea level - mly_computed_compass_angle (float) - Compass angle (or heading) after running image processing - mly_computed_geometry (GeoJSON point) - Location after running image processing - mly_computed_rotation (array) - A string representation of an array of corrected orientation of the image - metadata_mly3.csv - nan nan - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - mly_make (string) - The manufacturer name of the camera device - mly_model (string) - The model or product series name of the camera device - mly_creator.username (string) - The username who owns and uploaded the image - mly_creator.id (int) - The user ID who owns and uploaded the image - metadata_mly4.csv - nan nan - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - mly_merge_cc (int) - ID of the connected component of images that were aligned together - mly_mesh.id (int) - Contains the ID of the mesh - mly_mesh.url (string) - Contains the URL to the mesh - metadata_mly5.csv - nan nan - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - mly_sfm_cluster.id (int) - Contains the ID of the point cloud in .ply format - mly_sfm_cluster.url (string) - Contains the URL to the point cloud in .ply format - osm.csv - Contains the information of the OSM street found nearest to each image within its 10 m radius. The calculation is as accurate as the location of the image given by the source, which also relies on the accuracy of the capturing devices. The accuracy could also be affected by the accuracy of the OSM dataset. OSM street networks were obtained through the OSMnx package. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - snap_dist (float) - The distance between the location of the image and the nearest road to it - u (int) - Unique identifier of the end node associated with the edge - v (int) - Unique identifier of the other end node associated with the edge - key (int) - A value that differentiates between parallel edges using the same u,v pair - osmid (int) - ID(s) of the OSM streets associated with the edge - oneway (bool) - A flag that indicates whether the edge is one-way - lanes (int) - Number of lanes associated with the edge - name (string) - Street names associated with the edge - highway (string) - Type of road associated with the edge (e.g. primary, secondary, residential, footway, cycleway, etc.) - type_highway (string) - Type of road associated with the edge in terms of whether it's for driving, cycling, walking, or others - maxspeed (string) - Maximum speed limit associated with the edge - junction (string) - Indicates the type of junction associated with the edge - length (float) - length of the edge - from (int) - Unique identifier of the 'from' node associated with the edge (either u or v) - to (int) - Unique identifier of the 'to' node associated with the edge (either u or v) - ref (string) - Reference number or code for roads, highway exits, routes, entrances to big buildings etc - tunnel (string) - Indicates whether the road is passing a tunnel and/or the type of tunnel - bridge (string) - Indicates whether the road is on a bridge and/or the type of bridge - service (string) - Indicates whether the road is a service road and/or the type of service road - access (string) - Indicates legal permissions/restrictions of the accessibility of the road - road_width (float) - Width of the road from kerb to kerb - area (string) - Indicates whether the way is closed and used to define an area - est_width (float) - Estimated width of the road from kerb to kerb - reversed (bool) - Indicates roads that alternate between different directions regularly but infrequently, typically only a few times a day - perception.csv - Contains the scores, in scale of 0-10, predicted for each of six perceptual dimensions (Safe, Lively, Beautiful, Wealthy, Boring, Depressing). Contains the predicted scores (in scale of 0-10, high score indicates strong feeling) of human subjective perceptions, using model trained on the Place Pulse 2.0. Accuracy depends on model performance. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - Safe (float) - Score for 'safe' (in scale of 0-10) - Lively (float) - Score for 'lively' (in scale of 0-10) - Beautiful (float) - Score for 'beautiful' (in scale of 0-10) - Wealthy (float) - Score for 'Wealthy' (in scale of 0-10) - Boring (float) - Score for 'Boring' (in scale of 0-10) - Depressing (float) - Score for 'Depressing' (in scale of 0-10) - places365.csv - Contains the place/scene classification for each image. Obtained from model pre-trained on the Places dataset which has over 400 categories for a variety of places and scenes. Accuracy depends on model performance. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - Safe (string) - The type of place or scene classified for the image - season.csv - Contains the season determined for each image based on climate zone and month. Accuracy depends on the temporal and spatial accuracy of the image given by source, and the accuracy of the Köppen climate clasiffication zones used. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - season (string) - Denotes which season the image is taken in: tropical, spring, summer, autumn, or winter - segmentation.csv - Contains the count of segmented pixels of each semantic classes (65 categories in total) identified in the image. Based on panoptic segmentation results obtained with Mask2former model. Accuracy is dependent on model performance. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - Bird (int) - Number of pixels of target semantic class - Ground-Animal (int) - Number of pixels of target semantic class - Curb (int) - Number of pixels of target semantic class - Fence (int) - Number of pixels of target semantic class - Guard-Rail (int) - Number of pixels of target semantic class - Barrier (int) - Number of pixels of target semantic class - Wall (int) - Number of pixels of target semantic class - Bike-Lane (int) - Number of pixels of target semantic class - Crosswalk---Plain (int) - Number of pixels of target semantic class - Curb-Cut (int) - Number of pixels of target semantic class - Parking (int) - Number of pixels of target semantic class - Pedestrian-Area (int) - Number of pixels of target semantic class - Rail-Track (int) - Number of pixels of target semantic class - Road (int) - Number of pixels of target semantic class - Service-Lane (int) - Number of pixels of target semantic class - Sidewalk (int) - Number of pixels of target semantic class - Bridge (int) - Number of pixels of target semantic class - Building (int) - Number of pixels of target semantic class - Tunnel (int) - Number of pixels of target semantic class - Person (int) - Number of pixels of target semantic class - Bicyclist (int) - Number of pixels of target semantic class - Motorcyclist (int) - Number of pixels of target semantic class - Other-Rider (int) - Number of pixels of target semantic class - Lane-Marking---Crosswalk (int) - Number of pixels of target semantic class - Lane-Marking---General (int) - Number of pixels of target semantic class - Mountain (int) - Number of pixels of target semantic class - Sand (int) - Number of pixels of target semantic class - Sky (int) - Number of pixels of target semantic class - Snow (int) - Number of pixels of target semantic class - Terrain (int) - Number of pixels of target semantic class - Vegetation (int) - Number of pixels of target semantic class - Water (int) - Number of pixels of target semantic class - Banner (int) - Number of pixels of target semantic class - Bench (int) - Number of pixels of target semantic class - Bike-Rack (int) - Number of pixels of target semantic class - Billboard (int) - Number of pixels of target semantic class - Catch-Basin (int) - Number of pixels of target semantic class - CCTV-Camera (int) - Number of pixels of target semantic class - Fire-Hydrant (int) - Number of pixels of target semantic class - Junction-Box (int) - Number of pixels of target semantic class - Mailbox (int) - Number of pixels of target semantic class - Manhole (int) - Number of pixels of target semantic class - Phone-Booth (int) - Number of pixels of target semantic class - Pothole (int) - Number of pixels of target semantic class - Street-Light (int) - Number of pixels of target semantic class - Pole (int) - Number of pixels of target semantic class - Traffic-Sign-Frame (int) - Number of pixels of target semantic class - Utility-Pole (int) - Number of pixels of target semantic class - Traffic-Light (int) - Number of pixels of target semantic class - Traffic-Sign-(Back) (int) - Number of pixels of target semantic class - Traffic-Sign-(Front) (int) - Number of pixels of target semantic class - Trash-Can (int) - Number of pixels of target semantic class - Bicycle (int) - Number of pixels of target semantic class - Boat (int) - Number of pixels of target semantic class - Bus (int) - Number of pixels of target semantic class - Car (int) - Number of pixels of target semantic class - Caravan (int) - Number of pixels of target semantic class - Motorcycle (int) - Number of pixels of target semantic class - On-Rails (int) - Number of pixels of target semantic class - Other-Vehicle (int) - Number of pixels of target semantic class - Trailer (int) - Number of pixels of target semantic class - Truck (int) - Number of pixels of target semantic class - Wheeled-Slow (int) - Number of pixels of target semantic class - Car-Mount (int) - Number of pixels of target semantic class - Ego-Vehicle (int) - Number of pixels of target semantic class - Total (int) - Total number of pixels in image - green_view_index (float) - Calculated by dividing 'Vegetation' column by 'Total' column - sky_view_index (float) - Calculated by dividing 'Sky' column by 'Total' column - simplemaps.csv - Contains the information of the city associated with the image, obtained from the World Cities database by Simplemaps. The calculation is as accurate as the location of the image given by the source, which also relies on the accuracy of the capturing devices. The accuracy could also be affected by the accuracy of the World Cities database. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - city (string) - The name of the city/town as a Unicode string (e.g. Goiânia) - city_ascii (string) - The name of the city/town as an ASCII string (e.g. Goiania). Left blank if ASCII representation is not possible. - city_id (int) - A 10-digit unique id generated by SimpleMaps - city_lat (float) - The latitude of the city/town - city_lon (float) - The longitude of the city/town - country (string) - The name of the city/town's country - iso2 (string) - The alpha-2 iso code of the country - iso3 (string) - The alpha-3 iso code of the country - admin_name (string) - The name of the highest level administration region of the city town (e.g. a US state or Canadian province). Possibly blank. - capital (string) - Blank string if not a capital, otherwise: primary - country's capital (e.g. Washington D.C.), admin - first-level admin capital (e.g. Little Rock, AR), minor - lower-level admin capital (e.g. Fayetteville, AR) - population (float) - An estimate of the city's urban population. Only available for some (prominent) cities. If the urban population is not available, the municipal population is used. - continent (string) - The continent the image is located at - speed.csv - Contains various statistical values related to speed for each image. Calculation accuracy depends on the accuracy of location and capture time from the source metadata, which also depends on the accuracy of such data from the capturing device. - uuid (string) - Universally Unique IDentifier, unique for every image - source (string) - Source of the image, either Mapillary or KartaView - orig_id (int) - Original ID of the image as specified by Mapillary or KartaView - seq_dist_m (float) - Distance of the image's sequence in metres - seq_dist_km (float) - Distance of the image's sequence in kilometres - seq_time_hr (float) - Time duration of the image's sequence in hours - seq_speed_kph (float) - Average speed of the image's sequence in km/h, calculated as the ratio of seq_dist_km to seq_time_hr - seq_img_count (float) - The count of images in the image's sequence - segmt_speed_mean_kph (float) - The mean of all segment speeds in the image's sequence (segment is the interval between each pair of consecutive points in a sequence) - segmt_speed_var_kph2 (float) - The variance of all segment speeds in the image's sequence - segmt_speed_max_kph (float) - The maximum of all segment speeds in the image's sequence - segmt_speed_max5_mean_kph (float) - The mean of the top five segment speeds in the image's sequence - distance_from_prev_m (float) - Distance between the image and the previous image in metres - distance_from_prev_km (float) - Distance between the image and the previous image in kilometres - time_from_prev_s (float) - Time duration between the image and the prvious image in seconds - time_from_prev_hr (float) - Time duration between the image and the prvious image in hours - avg_speed_from_prev_kph (float) - Average speed between the image and the previous image in km/h, calculated as the ratio of distance_from_prev_km to time_from_prev_hr
We will select and download a subset of the available CSV files to work with below.
file_names = [
"simplemaps",
"perception",
"osm",
"places365",
"segmentation",
"contextual",
"metadata_common_attributes",
"ghsl",
]
scs.download_files_hf([f"{CSV_SUBDIR}/{f}.csv" for f in file_names], local_dir=HF_ROOT_DIR)
Streetscapes | 2025-02-10@08:51:05 | Downloading files from HuggingFace Hub...
# List of CSV file paths
csv_files = list(CSV_DIR.glob("*.csv"))
# Convert all csvs in data dir to parquet
for file_name in csv_files:
# Compile the Parquet file name.
parquet_file = PARQUET_DIR / file_name.with_suffix('.parquet').name
logger.info(f"Converting '{file_name.name}' into '{parquet_file.name}'")
con.read_csv(file_name).to_parquet(parquet_file, compression="ZSTD")
logger.info("Done!")
# List of Parquet file paths
parquet_files = list(PARQUET_DIR.glob("*.parquet"))
Streetscapes | 2025-02-10@08:51:06 | Converting 'metadata_common_attributes.csv' into 'metadata_common_attributes.parquet'
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Streetscapes | 2025-02-10@08:51:12 | Converting 'ghsl.csv' into 'ghsl.parquet' Streetscapes | 2025-02-10@08:51:14 | Converting 'places365.csv' into 'places365.parquet'
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Streetscapes | 2025-02-10@08:51:16 | Converting 'contextual.csv' into 'contextual.parquet'
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Streetscapes | 2025-02-10@08:51:21 | Converting 'segmentation.csv' into 'segmentation.parquet'
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Streetscapes | 2025-02-10@08:51:43 | Converting 'simplemaps.csv' into 'simplemaps.parquet'
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Streetscapes | 2025-02-10@08:51:50 | Converting 'osm.csv' into 'osm.parquet'
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Streetscapes | 2025-02-10@08:52:02 | Converting 'perception.csv' into 'perception.parquet'
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Streetscapes | 2025-02-10@08:52:06 | Done!
Verify that the CSV and Parquet files contain the same information.
csv_file = con.read_csv(CSV_DIR / "osm.csv")
csv_file.head()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓ ┃ uuid ┃ source ┃ orig_id ┃ snap_dist ┃ u ┃ v ┃ key ┃ osmid ┃ oneway ┃ lanes ┃ name ┃ highway ┃ type_highway ┃ maxspeed ┃ junction ┃ length ┃ from ┃ to ┃ ref ┃ tunnel ┃ bridge ┃ service ┃ access ┃ road_width ┃ area ┃ est_width ┃ reversed ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩ │ string │ string │ int64 │ float64 │ float64 │ float64 │ float64 │ string │ boolean │ string │ string │ string │ string │ string │ string │ float64 │ float64 │ float64 │ string │ string │ string │ string │ string │ string │ boolean │ string │ string │ ├──────────────────────────────────────┼───────────┼──────────────────┼───────────┼──────────────┼──────────────┼─────────┼────────────────────────────────────┼─────────┼────────┼───────────────────────────────────┼───────────┼──────────────┼──────────┼──────────┼─────────┼──────────────┼──────────────┼────────┼────────┼────────┼─────────┼────────┼────────────┼─────────┼───────────┼──────────┤ │ bc5862a5-5e4c-4f74-bdd5-598e140dbb8f │ Mapillary │ 941074783267368 │ 7.616701 │ 2.694255e+08 │ 2.694255e+08 │ 0.0 │ 1017964492 │ True │ 2 │ Carrera 33 │ tertiary │ drive │ NULL │ NULL │ 62.864 │ 2.694255e+08 │ 2.694255e+08 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ False │ │ 4d445c9a-03e2-4dda-a494-caf907ad1620 │ Mapillary │ 2822175224761380 │ 4.063028 │ 1.605031e+08 │ 9.945572e+09 │ 0.0 │ 1085398478 │ False │ 1 │ Burgstraße │ footway │ walk │ 30 │ NULL │ 118.549 │ 9.945572e+09 │ 1.605031e+08 │ NULL │ NULL │ NULL │ NULL │ NULL │ 12.2 │ NULL │ NULL │ False │ │ 57713c58-62b2-465b-9df3-087b6d970603 │ Mapillary │ 387731282398462 │ 2.806557 │ 1.825447e+09 │ 5.656773e+09 │ 0.0 │ 171530233 │ False │ NULL │ 15th Street Northwest Cycle Track │ footway │ walk │ NULL │ NULL │ 53.521 │ 5.656773e+09 │ 1.825447e+09 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ False │ │ 5ce677fe-1f66-4a6e-a162-88ff5d6cd80a │ Mapillary │ 4331166880261832 │ 5.224566 │ 2.682150e+09 │ 7.694047e+09 │ 0.0 │ [771744361, 1108989117, 771743389] │ True │ 3 │ Jalan Pudu │ secondary │ drive │ NULL │ NULL │ 110.317 │ 7.694047e+09 │ 2.682150e+09 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ False │ │ 54fb768e-2864-4ed0-a658-e72f7a66cbc0 │ Mapillary │ 808249360075988 │ 5.111907 │ 2.473251e+08 │ 1.783876e+09 │ 0.0 │ 71197419 │ True │ 1 │ Avenue de la Liberté │ primary │ drive │ 50 │ NULL │ 106.098 │ 2.473251e+08 │ 1.783876e+09 │ N 3 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ False │ └──────────────────────────────────────┴───────────┴──────────────────┴───────────┴──────────────┴──────────────┴─────────┴────────────────────────────────────┴─────────┴────────┴───────────────────────────────────┴───────────┴──────────────┴──────────┴──────────┴─────────┴──────────────┴──────────────┴────────┴────────┴────────┴─────────┴────────┴────────────┴─────────┴───────────┴──────────┘
parquet_file = con.read_parquet(PARQUET_DIR / "osm.parquet")
parquet_file.head()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓ ┃ uuid ┃ source ┃ orig_id ┃ snap_dist ┃ u ┃ v ┃ key ┃ osmid ┃ oneway ┃ lanes ┃ name ┃ highway ┃ type_highway ┃ maxspeed ┃ junction ┃ length ┃ from ┃ to ┃ ref ┃ tunnel ┃ bridge ┃ service ┃ access ┃ road_width ┃ area ┃ est_width ┃ reversed ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩ │ string │ string │ int64 │ float64 │ float64 │ float64 │ float64 │ string │ boolean │ string │ string │ string │ string │ string │ string │ float64 │ float64 │ float64 │ string │ string │ string │ string │ string │ string │ boolean │ string │ string │ ├──────────────────────────────────────┼───────────┼──────────────────┼───────────┼──────────────┼──────────────┼─────────┼────────────────────────────────────┼─────────┼────────┼───────────────────────────────────┼───────────┼──────────────┼──────────┼──────────┼─────────┼──────────────┼──────────────┼────────┼────────┼────────┼─────────┼────────┼────────────┼─────────┼───────────┼──────────┤ │ bc5862a5-5e4c-4f74-bdd5-598e140dbb8f │ Mapillary │ 941074783267368 │ 7.616701 │ 2.694255e+08 │ 2.694255e+08 │ 0.0 │ 1017964492 │ True │ 2 │ Carrera 33 │ tertiary │ drive │ NULL │ NULL │ 62.864 │ 2.694255e+08 │ 2.694255e+08 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ False │ │ 4d445c9a-03e2-4dda-a494-caf907ad1620 │ Mapillary │ 2822175224761380 │ 4.063028 │ 1.605031e+08 │ 9.945572e+09 │ 0.0 │ 1085398478 │ False │ 1 │ Burgstraße │ footway │ walk │ 30 │ NULL │ 118.549 │ 9.945572e+09 │ 1.605031e+08 │ NULL │ NULL │ NULL │ NULL │ NULL │ 12.2 │ NULL │ NULL │ False │ │ 57713c58-62b2-465b-9df3-087b6d970603 │ Mapillary │ 387731282398462 │ 2.806557 │ 1.825447e+09 │ 5.656773e+09 │ 0.0 │ 171530233 │ False │ NULL │ 15th Street Northwest Cycle Track │ footway │ walk │ NULL │ NULL │ 53.521 │ 5.656773e+09 │ 1.825447e+09 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ False │ │ 5ce677fe-1f66-4a6e-a162-88ff5d6cd80a │ Mapillary │ 4331166880261832 │ 5.224566 │ 2.682150e+09 │ 7.694047e+09 │ 0.0 │ [771744361, 1108989117, 771743389] │ True │ 3 │ Jalan Pudu │ secondary │ drive │ NULL │ NULL │ 110.317 │ 7.694047e+09 │ 2.682150e+09 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ False │ │ 54fb768e-2864-4ed0-a658-e72f7a66cbc0 │ Mapillary │ 808249360075988 │ 5.111907 │ 2.473251e+08 │ 1.783876e+09 │ 0.0 │ 71197419 │ True │ 1 │ Avenue de la Liberté │ primary │ drive │ 50 │ NULL │ 106.098 │ 2.473251e+08 │ 1.783876e+09 │ N 3 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ False │ └──────────────────────────────────────┴───────────┴──────────────────┴───────────┴──────────────┴──────────────┴─────────┴────────────────────────────────────┴─────────┴────────┴───────────────────────────────────┴───────────┴──────────────┴──────────┴──────────┴─────────┴──────────────┴──────────────┴────────┴────────┴────────┴─────────┴────────┴────────────┴─────────┴───────────┴──────────┘
csv_size = sum(file.stat().st_size for file in csv_files if file.is_file())
parquet_size = sum(file.stat().st_size for file in parquet_files if file.is_file())
reduction_factor = csv_size/parquet_size
logger.info(f"Total file size | CSV: {humanize.naturalsize(csv_size)} | Parquet: {humanize.naturalsize(parquet_size)} | Reduction factor: {reduction_factor:2.5f}")
Streetscapes | 2025-02-10@08:52:07 | Total file size | CSV: 14.1 GB | Parquet: 3.6 GB | Reduction factor: 3.86145
We may want to combine multiple CSV files together into a single Parquet file. If we JOIN the full table directly with DuckDB, we quickly run into memory issues because duckdb.sql(...) creates an in-memory database to load the data and keep track of intermediate results. This is why we created a DuckDB database on disk above. Ibis can use that database to perform the joins lazily, after which we can save the merged Parquet file.
# Perform the joins.
logger.info(f"Starting merger with '{parquet_files[0].name}'...")
# Load the first file into a table.
# We are going to use it to perform incremental joins on that table.
joined = con.read_parquet(parquet_files[0]).as_table()
for parquet_file in parquet_files[1:]:
# Lazy-join the next Parquet file on the UUID column.
logger.info(f"Merging '{parquet_file.name}'...")
joined = joined.join(con.read_parquet(parquet_file).as_table(), "uuid").as_table()
# Save the final joined table to a compressed Parquet file.
logger.info("Saving merged file...")
merged_full = MERGED_DIR / "streetscapes.parquet"
joined.to_parquet(merged_full, compression="ZSTD")
logger.info("Done!")
Streetscapes | 2025-02-10@08:52:07 | Starting merger with 'places365.parquet'... Streetscapes | 2025-02-10@08:52:07 | Merging 'contextual.parquet'... Streetscapes | 2025-02-10@08:52:07 | Merging 'segmentation.parquet'... Streetscapes | 2025-02-10@08:52:07 | Merging 'metadata_common_attributes.parquet'... Streetscapes | 2025-02-10@08:52:07 | Merging 'perception.parquet'... Streetscapes | 2025-02-10@08:52:07 | Merging 'ghsl.parquet'... Streetscapes | 2025-02-10@08:52:07 | Merging 'simplemaps.parquet'... Streetscapes | 2025-02-10@08:52:07 | Merging 'osm.parquet'... Streetscapes | 2025-02-10@08:52:07 | Saving merged file...
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Streetscapes | 2025-02-10@08:52:51 | Done!
# Show the merged file size
merged_size = merged_full.stat().st_size
logger.info(f"Merged file size: {humanize.naturalsize(merged_size)}")
Streetscapes | 2025-02-10@08:52:51 | Merged file size: 2.1 GB
con.read_parquet(merged_full).head()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓ ┃ uuid ┃ source ┃ orig_id ┃ place ┃ source_right ┃ orig_id_right ┃ glare ┃ lighting_condition ┃ pano_status ┃ platform ┃ quality ┃ reflection ┃ view_direction ┃ weather ┃ Bird ┃ Ground-Animal ┃ Curb ┃ Fence ┃ Guard-Rail ┃ Barrier ┃ Wall ┃ Bike-Lane ┃ Crosswalk---Plain ┃ Curb-Cut ┃ Parking ┃ Pedestrian-Area ┃ Rail-Track ┃ Road ┃ Service-Lane ┃ Sidewalk ┃ Bridge ┃ Building ┃ Tunnel ┃ Person ┃ Bicyclist ┃ Motorcyclist ┃ Other-Rider ┃ Lane-Marking---Crosswalk ┃ Lane-Marking---General ┃ Mountain ┃ Sand ┃ Sky ┃ Snow ┃ Terrain ┃ Vegetation ┃ Water ┃ Banner ┃ Bench ┃ Bike-Rack ┃ Billboard ┃ Catch-Basin ┃ CCTV-Camera ┃ Fire-Hydrant ┃ Junction-Box ┃ Mailbox ┃ Manhole ┃ Phone-Booth ┃ Pothole ┃ Street-Light ┃ Pole ┃ Traffic-Sign-Frame ┃ Utility-Pole ┃ Traffic-Light ┃ Traffic-Sign-(Back) ┃ Traffic-Sign-(Front) ┃ Trash-Can ┃ Bicycle ┃ Boat ┃ Bus ┃ Car ┃ Caravan ┃ Motorcycle ┃ On-Rails ┃ Other-Vehicle ┃ Trailer ┃ Truck ┃ Wheeled-Slow ┃ Car-Mount ┃ Ego-Vehicle ┃ Total ┃ green_view_index ┃ sky_view_index ┃ lat ┃ lon ┃ datetime_local ┃ year ┃ month ┃ day ┃ hour ┃ width ┃ height ┃ heading ┃ projection_type ┃ hFoV ┃ vFoV ┃ sequence_index ┃ sequence_id ┃ sequence_img_count ┃ Beautiful ┃ Boring ┃ Depressing ┃ Lively ┃ Safe ┃ Wealthy ┃ urban_code ┃ urban_term ┃ city ┃ city_ascii ┃ city_id ┃ city_lat ┃ city_lon ┃ country ┃ iso2 ┃ iso3 ┃ admin_name ┃ capital ┃ population ┃ continent ┃ snap_dist ┃ u ┃ v ┃ key ┃ osmid ┃ oneway ┃ lanes ┃ name ┃ highway ┃ type_highway ┃ maxspeed ┃ junction ┃ length ┃ from ┃ to ┃ ref ┃ tunnel_1 ┃ bridge_1 ┃ service ┃ access ┃ road_width ┃ area ┃ est_width ┃ reversed ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩ │ string │ string │ int64 │ string │ string │ int64 │ boolean │ string │ boolean │ string │ string │ boolean │ string │ string │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ timestamp(6) │ int64 │ int64 │ int64 │ int64 │ float64 │ float64 │ float64 │ string │ float64 │ float64 │ int64 │ string │ int64 │ float64 │ float64 │ float64 │ float64 │ float64 │ float64 │ int64 │ string │ string │ string │ int64 │ float64 │ float64 │ string │ string │ string │ string │ string │ float64 │ string │ float64 │ float64 │ float64 │ float64 │ string │ boolean │ string │ string │ string │ string │ string │ string │ float64 │ float64 │ float64 │ string │ string │ string │ string │ string │ string │ boolean │ string │ string │ ├──────────────────────────────────────┼───────────┼──────────────────┼──────────────────┼──────────────┼──────────────────┼─────────┼────────────────────┼─────────────┼─────────────────┼───────────────┼────────────┼────────────────┼─────────┼─────────┼───────────────┼─────────┼─────────┼────────────┼─────────┼──────────┼───────────┼───────────────────┼──────────┼─────────┼─────────────────┼────────────┼──────────────┼──────────────┼──────────┼──────────────┼──────────┼─────────┼──────────┼───────────┼──────────────┼─────────────┼──────────────────────────┼────────────────────────┼──────────┼─────────┼──────────┼─────────┼─────────┼────────────┼─────────┼─────────┼─────────┼───────────┼───────────┼─────────────┼─────────────┼──────────────┼──────────────┼─────────┼─────────┼─────────────┼─────────┼──────────────┼─────────┼────────────────────┼──────────────┼───────────────┼─────────────────────┼──────────────────────┼───────────┼─────────┼─────────┼─────────┼──────────┼─────────┼────────────┼──────────┼───────────────┼─────────┼─────────┼──────────────┼───────────┼─────────────┼──────────────┼──────────────────┼────────────────┼───────────┼────────────┼─────────────────────────┼───────┼───────┼───────┼───────┼─────────┼─────────┼────────────┼─────────────────┼───────────┼───────────┼────────────────┼────────────────────────┼────────────────────┼───────────┼─────────┼────────────┼─────────┼─────────┼─────────┼────────────┼──────────────┼─────────────┼─────────────┼────────────┼──────────┼──────────┼────────────┼────────┼────────┼────────────────────┼─────────┼──────────────┼───────────────┼───────────┼──────────────┼──────────────┼─────────┼────────────────────────┼─────────┼────────────┼──────────────────────────────────────┼──────────────┼──────────────┼──────────┼──────────┼─────────┼──────────────┼──────────────┼───────────┼──────────┼──────────┼─────────┼────────┼────────────┼─────────┼───────────┼──────────┤ │ 1efa6715-5d26-49ef-8d21-e15f63cf9fe9 │ Mapillary │ 176200881083293 │ highway │ Mapillary │ 176200881083293 │ False │ day │ False │ driving surface │ good │ True │ front/back │ rainy │ 0.0 │ 0.0 │ 12693.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 1.521630e+05 │ 0.0 │ 0.0 │ 0.000000e+00 │ 168672.0 │ 0.0 │ 1326.0 │ 0.0 │ 0.0 │ 0.0 │ 11169.0 │ 4308.0 │ 0.0 │ 0.0 │ 775167.0 │ 0.0 │ 0.0 │ 103878.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 78279.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 3024.0 │ 6270.0 │ 0.0 │ 24846.0 │ 7905.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 375252.0 │ 0.0 │ 0.0 │ 160167.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 363588.0 │ 2.248707e+06 │ 0.046195 │ 0.344717 │ 44.810785 │ 20.466301 │ 2019-12-25 12:38:54.689 │ 2019 │ 12 │ 25 │ 13 │ 3840.0 │ 2160.0 │ 270.000000 │ fisheye │ 96.233149 │ 64.198573 │ 6 │ d9gafdvzocqzo3q6hivjd0 │ 107 │ 4.92 │ 5.37 │ 9.26 │ 6.31 │ 1.39 │ 0.83 │ 30 │ urban centre │ Belgrade │ Belgrade │ 1688374696 │ 44.8167 │ 20.4667 │ Serbia │ RS │ SRB │ Beograd │ primary │ 1.378682e+06 │ Europe │ 3.528198 │ 1.637649e+09 │ 1.637649e+09 │ 0.0 │ 150899584 │ True │ 1 │ Таковска │ primary_link │ drive │ 50 │ NULL │ 48.871 │ 1.637649e+09 │ 1.637649e+09 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ False │ │ 34de479f-e85a-4233-a6a6-9f776de6312b │ Mapillary │ 2654377988189349 │ plaza │ Mapillary │ 2654377988189349 │ False │ day │ False │ driving surface │ good │ False │ side │ cloudy │ 0.0 │ 0.0 │ 2735.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 2.661500e+04 │ 0.0 │ 0.0 │ 0.000000e+00 │ 368485.0 │ 0.0 │ 20110.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 931160.0 │ 0.0 │ 0.0 │ 35125.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 2230.0 │ 35095.0 │ 0.0 │ 0.0 │ 2885.0 │ 0.0 │ 1350.0 │ 0.0 │ 0.0 │ 0.0 │ 11275.0 │ 316010.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 1.753075e+06 │ 0.020036 │ 0.531158 │ 34.017047 │ -6.835733 │ 2018-05-22 09:51:30.000 │ 2018 │ 5 │ 22 │ 9 │ 1920.0 │ 1080.0 │ 109.129488 │ perspective │ 78.582521 │ 49.429226 │ 98 │ zFhb78wZ_pcA7tq6MkJHyQ │ 140 │ 2.02 │ 3.10 │ 6.16 │ 8.02 │ 2.14 │ 5.91 │ 30 │ urban centre │ Rabat │ Rabat │ 1504023252 │ 34.0253 │ -6.8361 │ Morocco │ MA │ MAR │ Rabat-Salé-Kénitra │ primary │ 5.727170e+05 │ Africa │ 1.331304 │ 2.613010e+09 │ 4.521762e+09 │ 0.0 │ 255624925 │ True │ 2 │ Rue Taïf زنقة الطائف │ residential │ drive │ NULL │ NULL │ 63.169 │ 4.521762e+09 │ 2.613010e+09 │ RR401 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ False │ │ 3e2d8c45-c8d5-4080-9733-388f222f260a │ Mapillary │ 233835585180942 │ downtown │ Mapillary │ 233835585180942 │ False │ day │ True │ cycling surface │ good │ False │ NULL │ clear │ 0.0 │ 0.0 │ 591.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 1.567120e+05 │ 0.0 │ 9703.0 │ 0.000000e+00 │ 667694.0 │ 0.0 │ 2522.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 347902.0 │ 0.0 │ 0.0 │ 69053.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 27912.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 2028.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 630068.0 │ 0.0 │ 17165.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 1.931350e+06 │ 0.035754 │ 0.180134 │ 33.886587 │ 35.519169 │ 2020-10-15 15:40:39.000 │ 2020 │ 10 │ 15 │ 18 │ 5760.0 │ 2880.0 │ 0.000000 │ equirectangular │ NULL │ NULL │ 35 │ yEpZX20yhILTT5NKCtRnCq │ 29 │ 3.61 │ 1.39 │ 5.68 │ 6.30 │ 4.96 │ 5.56 │ 30 │ urban centre │ Beirut │ Beirut │ 1422847713 │ 33.8869 │ 35.5131 │ Lebanon │ LB │ LBN │ Beyrouth │ primary │ 3.613660e+05 │ Asia │ 1.480610 │ 2.765663e+08 │ 2.765663e+08 │ 0.0 │ 227467833 │ True │ 1.0 │ Rue Adiba Ishac │ tertiary │ drive │ NULL │ NULL │ 50.724 │ 2.765663e+08 │ 2.765663e+08 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ False │ │ 06baa25b-6a67-435b-8197-7cfcc76393aa │ Mapillary │ 474759637083312 │ ice_skating_rink │ Mapillary │ 474759637083312 │ False │ day │ False │ walking surface │ slightly poor │ False │ front/back │ clear │ 0.0 │ 0.0 │ 9884.0 │ 11096.0 │ 0.0 │ 0.0 │ 284208.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 1.949924e+06 │ 0.0 │ 19988.0 │ 0.000000e+00 │ 9276.0 │ 0.0 │ 119556.0 │ 0.0 │ 272.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 483920.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 9044.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 2160.0 │ 0.0 │ 11700.0 │ 0.0 │ 0.0 │ 8864.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 63684.0 │ 0.0 │ 148.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 2.983724e+06 │ 0.162187 │ 0.000000 │ 23.742577 │ 90.382349 │ 2018-01-10 03:04:54.641 │ 2018 │ 1 │ 10 │ 9 │ 3264.0 │ 2448.0 │ 252.479889 │ perspective │ 59.423003 │ 46.341075 │ 58 │ tQYs5LvCeWKv9Qblel_LfA │ 97 │ 3.82 │ 5.39 │ 6.92 │ 1.45 │ 1.42 │ 1.58 │ 30 │ urban centre │ Dhaka │ Dhaka │ 1050529279 │ 23.7289 │ 90.3944 │ Bangladesh │ BD │ BGD │ Dhaka │ primary │ 1.683900e+07 │ Asia │ 2.445759 │ 3.727451e+08 │ 1.798125e+09 │ 0.0 │ 33051555 │ False │ NULL │ Road 5 │ residential │ drive │ NULL │ NULL │ 243.555 │ 1.798125e+09 │ 3.727451e+08 │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ True │ │ 7f7d21fd-ce6b-4737-90ea-bca221e67521 │ Mapillary │ 2872981302961624 │ airport_terminal │ Mapillary │ 2872981302961624 │ False │ day │ True │ cycling surface │ slightly poor │ False │ NULL │ snowy │ 0.0 │ 0.0 │ 10013.0 │ 22991.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 9.946400e+04 │ 0.0 │ 0.0 │ 1.558717e+06 │ 17737.0 │ 0.0 │ 74577.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 4345.0 │ 18948.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 9199.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 15497.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 1.831488e+06 │ 0.010346 │ 0.000000 │ 19.423489 │ -99.129826 │ 2017-07-30 13:12:42.735 │ 2017 │ 7 │ 30 │ 8 │ 5660.0 │ 2830.0 │ 0.000000 │ equirectangular │ NULL │ NULL │ 87 │ heM59GlGAguXpVLrvIZ_AQ │ 104 │ 1.85 │ 5.24 │ 6.81 │ 2.07 │ 1.64 │ 3.62 │ 30 │ urban centre │ Mexico City │ Mexico City │ 1484247881 │ 19.4333 │ -99.1333 │ Mexico │ MX │ MEX │ Ciudad de México │ primary │ 2.150500e+07 │ North America │ 2.518635 │ 2.709783e+08 │ 2.709810e+08 │ 0.0 │ [866722817, 867017411] │ True │ ['4', '3'] │ Avenida Fray Servando Teresa de Mier │ trunk │ drive │ NULL │ NULL │ 188.753 │ 2.709810e+08 │ 2.709783e+08 │ EJE 1 SUR │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ NULL │ False │ └──────────────────────────────────────┴───────────┴──────────────────┴──────────────────┴──────────────┴──────────────────┴─────────┴────────────────────┴─────────────┴─────────────────┴───────────────┴────────────┴────────────────┴─────────┴─────────┴───────────────┴─────────┴─────────┴────────────┴─────────┴──────────┴───────────┴───────────────────┴──────────┴─────────┴─────────────────┴────────────┴──────────────┴──────────────┴──────────┴──────────────┴──────────┴─────────┴──────────┴───────────┴──────────────┴─────────────┴──────────────────────────┴────────────────────────┴──────────┴─────────┴──────────┴─────────┴─────────┴────────────┴─────────┴─────────┴─────────┴───────────┴───────────┴─────────────┴─────────────┴──────────────┴──────────────┴─────────┴─────────┴─────────────┴─────────┴──────────────┴─────────┴────────────────────┴──────────────┴───────────────┴─────────────────────┴──────────────────────┴───────────┴─────────┴─────────┴─────────┴──────────┴─────────┴────────────┴──────────┴───────────────┴─────────┴─────────┴──────────────┴───────────┴─────────────┴──────────────┴──────────────────┴────────────────┴───────────┴────────────┴─────────────────────────┴───────┴───────┴───────┴───────┴─────────┴─────────┴────────────┴─────────────────┴───────────┴───────────┴────────────────┴────────────────────────┴────────────────────┴───────────┴─────────┴────────────┴─────────┴─────────┴─────────┴────────────┴──────────────┴─────────────┴─────────────┴────────────┴──────────┴──────────┴────────────┴────────┴────────┴────────────────────┴─────────┴──────────────┴───────────────┴───────────┴──────────────┴──────────────┴─────────┴────────────────────────┴─────────┴────────────┴──────────────────────────────────────┴──────────────┴──────────────┴──────────┴──────────┴─────────┴──────────────┴──────────────┴───────────┴──────────┴──────────┴─────────┴────────┴────────────┴─────────┴───────────┴──────────┘
For some usecases it might be more convenient to select certain columns from different files into a single table. This can be achieved in a similar manner to the previous example. Here, we create a dictionary with the file names and columns we want to select. We also need to specify a column that is common to all files to join on.
# Create dictionary choosing files and columns
selection = {
"contextual": ['uuid', 'source', 'orig_id'],
"osm": ['uuid', 'road_width', 'type_highway'],
"simplemaps": ['uuid', 'city'],
"metadata_common_attributes": ['uuid', 'lat', 'lon']
}
# Turn the selection into a list for easier traversal
selection = list(selection.items())
# Load the first file into a table.
# We are going to use it to perform incremental joins on that table.
parquet_file = PARQUET_DIR / f"{selection[0][0]}.parquet"
cols = selection[0][1]
logger.info(f"Starting merger with '{parquet_file.name}'...")
joined = con.read_parquet(parquet_file).select(*cols).as_table()
for file_name, cols in selection[1:]:
parquet_file = PARQUET_DIR / f"{file_name}.parquet"
logger.info(f"Merging table '{parquet_file.name}'...")
joined = joined.join(con.read_parquet(parquet_file).select(*cols).as_table(), "uuid").as_table()
# Save the final joined table to a compressed Parquet file.
logger.info("Saving merged file...")
merged_selection = MERGED_DIR / "streetscapes_selection.parquet"
joined.to_parquet(merged_selection, compression="ZSTD")
logger.info("Done!")
Streetscapes | 2025-02-10@08:52:51 | Starting merger with 'contextual.parquet'... Streetscapes | 2025-02-10@08:52:51 | Merging table 'osm.parquet'... Streetscapes | 2025-02-10@08:52:51 | Merging table 'simplemaps.parquet'... Streetscapes | 2025-02-10@08:52:51 | Merging table 'metadata_common_attributes.parquet'... Streetscapes | 2025-02-10@08:52:51 | Saving merged file...
FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))
Streetscapes | 2025-02-10@08:52:58 | Done!
# Show the merged file size
merged_size = merged_selection.stat().st_size
logger.info(f"Merged file size: {humanize.naturalsize(merged_size)}")
Streetscapes | 2025-02-10@08:52:58 | Merged file size: 333.1 MB
# Let's inspect the new file to see if the join has worked
con.read_parquet(merged_selection).head()
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓ ┃ uuid ┃ source ┃ orig_id ┃ road_width ┃ type_highway ┃ city ┃ lat ┃ lon ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩ │ string │ string │ int64 │ string │ string │ string │ float64 │ float64 │ ├──────────────────────────────────────┼───────────┼──────────────────┼────────────┼──────────────┼────────┼───────────┼──────────┤ │ 090314f3-d97e-4d5f-b3e9-889c7bb959a7 │ Mapillary │ 1123699064858321 │ NULL │ walk │ Zürich │ 47.379369 │ 8.525423 │ │ 7ed13e37-763b-43e8-8206-5c61226fb3c6 │ Mapillary │ 2042189375946817 │ 2 │ walk │ Zürich │ 47.379042 │ 8.547106 │ │ ac0af75c-a788-42b6-ba4b-f6e7cd120cf1 │ Mapillary │ 2974164709523083 │ NULL │ drive │ Zürich │ 47.382700 │ 8.546596 │ │ 1aa6a5d5-401d-4ac8-ba2c-595e65ec1aed │ Mapillary │ 189752149666038 │ NULL │ walk/cycle │ Zürich │ 47.376921 │ 8.539161 │ │ fe3060ce-3590-4829-b8a6-1cf82939d4a4 │ Mapillary │ 531344007857891 │ 2 │ walk │ Zürich │ 47.372355 │ 8.546553 │ └──────────────────────────────────────┴───────────┴──────────────────┴────────────┴──────────────┴────────┴───────────┴──────────┘
We are in touch with the developers of the original Open Streetscapes dataset to add these parquet files to the dataset on huggingface.