In [1]:
import sys
sys.path.append('..')
import numpy as np
import pandas as pd
import country_converter as coco
from haversine import haversine

Use world_cities csv as the unlabeled dataset, as it is more complete

From there population will be filled

In [2]:
from data.unlabeled import WORLD_CITIES
from data.labeled.preprocessed import LABELED_CITIES
WORLD_CITIES.head()
Out[2]:
city city_ascii lat lng country iso2 iso3 admin_name capital population id
0 Tokyo Tokyo 35.6897 139.6922 Japan JP JPN Tōkyō primary 37977000.0 1392685764
1 Jakarta Jakarta -6.2146 106.8451 Indonesia ID IDN Jakarta primary 34540000.0 1360771077
2 Delhi Delhi 28.6600 77.2300 India IN IND Delhi admin 29617000.0 1356872604
3 Mumbai Mumbai 18.9667 72.8333 India IN IND Mahārāshtra admin 23355000.0 1356226629
4 Manila Manila 14.5958 120.9772 Philippines PH PHL Manila primary 23088000.0 1608618140

Create the 3 letter symbol code mapping for all the countries existing inside

In [3]:
code_dict = {x: coco.convert(x) for x in set(WORLD_CITIES['country'].unique()).union(LABELED_CITIES['country'].unique())}

Add the corresponding column to both datasets

In [4]:
WORLD_CITIES['country_code'] = WORLD_CITIES['country'].apply(lambda x:  code_dict[x])
LABELED_CITIES['country_code']= LABELED_CITIES['country'].apply(lambda x:  code_dict[x])

Merge those datasets together using the following steps:

  1. Create a dictionary that contains the distances between cities that belong to the same country (with the same code). For every city whose name exists as is in both datasets, set the distance 0
  2. Compute the minimum distance and its location per city, that resides in the labeled cities
  3. If this distance is less than a threshold, that has been set to 10 kilometers, assume that the cities are the same. Use this mapping to fill the unlabeled cities with the labeled risks. If not, expand the unlabeled dataset with the city
  4. Drop any row that has no recorded population.
In [5]:
mapping_dists = {x['city']:{y['city']: (haversine((x['latitude'],x['longitude']),(y['lat'],y['lng'])) if x['city']!=y['city'] else 0)
                            for (_, y) in WORLD_CITIES[WORLD_CITIES['country_code']==x['country_code']].iterrows()} for (_, x) in LABELED_CITIES.iterrows()}
In [6]:
WORLD_CITIES.set_index('city', inplace=True)
LABELED_CITIES.set_index('city', inplace=True)

Fix some cases where the latitude and the longitude are in the wrong place

In [7]:
mapping_dists_fixed = {
    x:(
        {y: val for y,val in d.items()} if min(d.values()) < 2000 else 
        {y: haversine((LABELED_CITIES.loc[[x], 'longitude'].values[0],LABELED_CITIES.loc[[x], 'latitude'].values[0]),
                      (WORLD_CITIES.loc[[y], 'lat'].values[0],WORLD_CITIES.loc[[y], 'lng'].values[0]))
        for  y,val in d.items()}) for x,d in mapping_dists.items()}
In [8]:
mapping_df = pd.DataFrame(mapping_dists_fixed).T
mapping_df.head()
Out[8]:
Al Quds New York Los Angeles Chicago Miami Dallas Philadelphia Houston Atlanta Washington ... Wum Kribi Batouri Kalfou Bélabo Eséka Abong Mbang Kontcha Aiyomojok Mbé
Abasan Al-Kabira 97.672711 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Abington NaN 119.372096 3857.172736 1070.945249 1752.117379 154.8733 1492.55584 2167.739061 1865.051282 1400.845831 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Abuja NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Addis Ababa NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Adelaide NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 17304 columns

In [9]:
closest_match_df = pd.DataFrame(mapping_df.fillna(100000000).apply(lambda x: {'match':mapping_df.columns[np.argmin(x)], 'distance':np.min(x)}, axis=1).tolist())
In [10]:
closest_match_df.index = mapping_df.index
closest_match_df
Out[10]:
match distance
Abasan Al-Kabira Al Quds 97.672711
Abington Abington 0.000000
Abuja Abuja 0.000000
Addis Ababa Addis Ababa 0.000000
Adelaide Adelaide 0.000000
... ... ...
Xalapa Xalapa 0.000000
Yaoundé 6 Yaoundé 0.459487
Yokohama Yokohama 0.000000
Ílhavo Estremoz 1.115491
Örebro Örebro 0.985037

282 rows × 2 columns

In [11]:
dist_thres = 10 # the distance threshold for a city to be considered mapped correctly, in km
In [12]:
unmapped = closest_match_df[closest_match_df['distance']>dist_thres]
In [13]:
to_add_whole = LABELED_CITIES.loc[unmapped.index].copy() # add those rows as a whole to the augmented dataset
to_add_labels = LABELED_CITIES.loc[closest_match_df[closest_match_df['distance']<=dist_thres].index].copy() # only add the labels from these rows
to_add_labels.index = closest_match_df.loc[closest_match_df['distance']<=dist_thres,'match']

Creating an augmented dataset with all the cities, annotated or not

In [14]:
from data.labeled.preprocessed import RISKS_MAPPING
risks_cols = list(RISKS_MAPPING.keys())
In [15]:
augmented_cities_dataset = WORLD_CITIES[['lat','lng','country','population','country_code']].rename(columns={'lat':'latitude','lng':'longitude'}).copy()
augmented_cities_dataset = augmented_cities_dataset.merge(to_add_labels[['c40'] + risks_cols],left_index=True, right_index=True,how='left')
In [16]:
augmented_cities_dataset = pd.concat([augmented_cities_dataset, to_add_whole],axis=0)
augmented_cities_dataset['c40']= augmented_cities_dataset['c40'].fillna(False)
In [17]:
augmented_cities_dataset = augmented_cities_dataset[~pd.isna(augmented_cities_dataset['population'])]
In [18]:
augmented_cities_dataset
Out[18]:
latitude longitude country population country_code c40 risk0 risk1 risk2 risk3 risk4 risk5 risk6
A Coruña 43.3713 -8.4188 Spain 370610.000 ESP False NaN NaN NaN NaN NaN NaN NaN
Aachen 50.7762 6.0838 Germany 247380.000 DEU False NaN NaN NaN NaN NaN NaN NaN
Aalborg 57.0337 9.9166 Denmark 122219.000 DNK False NaN NaN NaN NaN NaN NaN NaN
Aalen 48.8372 10.0936 Germany 68456.000 DEU False NaN NaN NaN NaN NaN NaN NaN
Aalst 50.9333 4.0333 Belgium 85715.000 BEL False NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
Sekhukhune District Municipality -24.8335 29.9740 South Africa 1076840.000 ZAF False NaN NaN NaN 2.0 NaN NaN NaN
Sopó 4.9075 -73.9384 Colombia 27932.000 COL False NaN NaN NaN NaN 1.0 NaN NaN
Tangará da Serra -14.6194 -57.4858 Brazil 83.431 BRA False NaN NaN 2.0 NaN NaN NaN NaN
Tuzla 44.5375 18.6735 Turkey 235000.000 TUR False NaN NaN 2.0 NaN NaN NaN NaN
Vhembe -22.7695 29.9740 South Africa 1441178.000 ZAF False NaN NaN 3.0 3.0 NaN NaN NaN

25633 rows × 13 columns

Fix population for specific cities with a float one

which probably is cause by used "."

In [19]:
augmented_cities_dataset.population = augmented_cities_dataset.population.apply(lambda x: x * 1000 if int(x)!=x else x)

Filter out cities that are unlabeled and are not big enough, consulting BIG CITIES ALL COUNTRIES csv

In [20]:
from data.unlabeled import BIG_CITIES_ALL_COUNTRIES
In [21]:
unlab_mask = augmented_cities_dataset[risks_cols].isnull().all(axis=1)
not_in_big_cities_mask = augmented_cities_dataset.index.to_series().apply(lambda x: x not in BIG_CITIES_ALL_COUNTRIES.city.values)
augmented_cities_dataset = augmented_cities_dataset[~(unlab_mask&not_in_big_cities_mask)]
In [22]:
augmented_cities_dataset
Out[22]:
latitude longitude country population country_code c40 risk0 risk1 risk2 risk3 risk4 risk5 risk6
Aalborg 57.0337 9.9166 Denmark 122219.0 DNK False NaN NaN NaN NaN NaN NaN NaN
Aarhus 56.1572 10.2107 Denmark 237551.0 DNK False NaN NaN NaN NaN NaN NaN NaN
Aasiaat 68.7097 -52.8694 Greenland 3134.0 GRL False NaN NaN NaN NaN NaN NaN NaN
Abidjan 5.3364 -4.0267 Côte D’Ivoire 4980000.0 CIV False NaN NaN NaN NaN 1.0 NaN NaN
Abington 40.1108 -75.1146 United States 55573.0 USA False NaN 1.0 1.0 2.0 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
Sekhukhune District Municipality -24.8335 29.9740 South Africa 1076840.0 ZAF False NaN NaN NaN 2.0 NaN NaN NaN
Sopó 4.9075 -73.9384 Colombia 27932.0 COL False NaN NaN NaN NaN 1.0 NaN NaN
Tangará da Serra -14.6194 -57.4858 Brazil 83431.0 BRA False NaN NaN 2.0 NaN NaN NaN NaN
Tuzla 44.5375 18.6735 Turkey 235000.0 TUR False NaN NaN 2.0 NaN NaN NaN NaN
Vhembe -22.7695 29.9740 South Africa 1441178.0 ZAF False NaN NaN 3.0 3.0 NaN NaN NaN

1437 rows × 13 columns

Load and merge countries features

In [23]:
from data.unlabeled import COUNTRIES_DATASET
In [24]:
set(augmented_cities_dataset.columns).intersection(COUNTRIES_DATASET.columns)
Out[24]:
set()
In [25]:
augmented_cities_dataset = augmented_cities_dataset.merge(COUNTRIES_DATASET, left_on='country_code', right_index=True)
In [26]:
all(augmented_cities_dataset.Country == augmented_cities_dataset.country)
Out[26]:
False
In [27]:
augmented_cities_dataset.drop(columns='country',inplace=True)
augmented_cities_dataset.rename(columns={'Country':'country'},inplace=True)
In [28]:
augmented_cities_dataset= augmented_cities_dataset.reset_index().rename(columns={'index':'city'}).drop_duplicates(subset=['city','country'])
In [29]:
augmented_cities_dataset
Out[29]:
city latitude longitude population country_code c40 risk0 risk1 risk2 risk3 ... SDG 6.4.1. Services Water Use Efficiency SDG 6.4.1. Water Use Efficiency SDG 6.4.2. Water Stress Seasonal variability (WRI) Total internal renewable water resources per capita Total population with access to safe drinking-water (JMP) Total renewable water resources per capita Total water withdrawal per capita Urban population with access to safe drinking-water (JMP) country
0 Aalborg 57.0337 9.9166 122219.0 DNK False NaN NaN NaN NaN ... 558.335628 368.612902 20.040562 1.300000 1046.705025 100.0 1046.705025 129.285516 100.0 Denmark
1 Aarhus 56.1572 10.2107 237551.0 DNK False NaN NaN NaN NaN ... 558.335628 368.612902 20.040562 1.300000 1046.705025 100.0 1046.705025 129.285516 100.0 Denmark
2 Copenhagen 55.6786 12.5635 1085000.0 DNK False NaN 2.0 NaN 2.0 ... 558.335628 368.612902 20.040562 1.300000 1046.705025 100.0 1046.705025 129.285516 100.0 Denmark
3 Esbjerg 55.4670 8.4500 72205.0 DNK False NaN NaN NaN NaN ... 558.335628 368.612902 20.040562 1.300000 1046.705025 100.0 1046.705025 129.285516 100.0 Denmark
4 Frederikshavn 57.4337 10.5333 24103.0 DNK False NaN 2.0 NaN NaN ... 558.335628 368.612902 20.040562 1.300000 1046.705025 100.0 1046.705025 129.285516 100.0 Denmark
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1380 Teaoraereke 1.3333 173.0000 5105.0 KIR False NaN NaN NaN NaN ... 110.288104 80.302831 68.505925 2.532741 -1779.609434 66.9 0.000000 414.127038 87.3 Kiribati
1381 Trindade 0.3000 6.6667 6636.0 STP False NaN NaN NaN NaN ... 17.162254 7.489641 1.876147 2.502891 10526.874919 97.1 10526.874919 197.499626 98.9 Sao Tome and Principe
1382 Victoria -4.6236 55.4544 26450.0 SYC False NaN NaN NaN NaN ... 120.471023 89.323035 69.243227 1.601611 -2157.596566 95.7 0.000000 142.089651 95.7 Seychelles
1383 Warsaw 52.2167 21.0333 1790658.0 POL True 0.0 0.0 2.0 0.0 ... 153.691550 43.570222 34.890966 0.600000 1412.266377 98.3 1594.069324 265.616741 99.3 Poland
1384 Abasan Al-Kabira 31.3231 34.3440 32000.0 PSE False NaN 2.0 2.0 2.0 ... 51.887163 34.748970 41.078306 3.100000 171.047224 58.4 176.313456 79.035614 50.7 Palestine

1211 rows × 157 columns

In [30]:
from data.dataset import DATASET_PATH
augmented_cities_dataset.to_csv(DATASET_PATH,index=False)
In [31]:
augmented_cities_dataset
Out[31]:
city latitude longitude population country_code c40 risk0 risk1 risk2 risk3 ... SDG 6.4.1. Services Water Use Efficiency SDG 6.4.1. Water Use Efficiency SDG 6.4.2. Water Stress Seasonal variability (WRI) Total internal renewable water resources per capita Total population with access to safe drinking-water (JMP) Total renewable water resources per capita Total water withdrawal per capita Urban population with access to safe drinking-water (JMP) country
0 Aalborg 57.0337 9.9166 122219.0 DNK False NaN NaN NaN NaN ... 558.335628 368.612902 20.040562 1.300000 1046.705025 100.0 1046.705025 129.285516 100.0 Denmark
1 Aarhus 56.1572 10.2107 237551.0 DNK False NaN NaN NaN NaN ... 558.335628 368.612902 20.040562 1.300000 1046.705025 100.0 1046.705025 129.285516 100.0 Denmark
2 Copenhagen 55.6786 12.5635 1085000.0 DNK False NaN 2.0 NaN 2.0 ... 558.335628 368.612902 20.040562 1.300000 1046.705025 100.0 1046.705025 129.285516 100.0 Denmark
3 Esbjerg 55.4670 8.4500 72205.0 DNK False NaN NaN NaN NaN ... 558.335628 368.612902 20.040562 1.300000 1046.705025 100.0 1046.705025 129.285516 100.0 Denmark
4 Frederikshavn 57.4337 10.5333 24103.0 DNK False NaN 2.0 NaN NaN ... 558.335628 368.612902 20.040562 1.300000 1046.705025 100.0 1046.705025 129.285516 100.0 Denmark
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1380 Teaoraereke 1.3333 173.0000 5105.0 KIR False NaN NaN NaN NaN ... 110.288104 80.302831 68.505925 2.532741 -1779.609434 66.9 0.000000 414.127038 87.3 Kiribati
1381 Trindade 0.3000 6.6667 6636.0 STP False NaN NaN NaN NaN ... 17.162254 7.489641 1.876147 2.502891 10526.874919 97.1 10526.874919 197.499626 98.9 Sao Tome and Principe
1382 Victoria -4.6236 55.4544 26450.0 SYC False NaN NaN NaN NaN ... 120.471023 89.323035 69.243227 1.601611 -2157.596566 95.7 0.000000 142.089651 95.7 Seychelles
1383 Warsaw 52.2167 21.0333 1790658.0 POL True 0.0 0.0 2.0 0.0 ... 153.691550 43.570222 34.890966 0.600000 1412.266377 98.3 1594.069324 265.616741 99.3 Poland
1384 Abasan Al-Kabira 31.3231 34.3440 32000.0 PSE False NaN 2.0 2.0 2.0 ... 51.887163 34.748970 41.078306 3.100000 171.047224 58.4 176.313456 79.035614 50.7 Palestine

1211 rows × 157 columns

In [32]:
(~(augmented_cities_dataset[risks_cols].isnull()).all(axis=1)).sum() # the labeled samples
Out[32]:
324