import sys
sys.path.append('..')
import numpy as np
import pandas as pd
import country_converter as coco
from haversine import haversine

Use world_cities csv as the unlabeled dataset, as it is more complete¶

From there population will be filled

from data.unlabeled import WORLD_CITIES
from data.labeled.preprocessed import LABELED_CITIES
WORLD_CITIES.head()

Create the 3 letter symbol code mapping for all the countries existing inside¶

code_dict = {x: coco.convert(x) for x in set(WORLD_CITIES['country'].unique()).union(LABELED_CITIES['country'].unique())}

Add the corresponding column to both datasets¶

WORLD_CITIES['country_code'] = WORLD_CITIES['country'].apply(lambda x:  code_dict[x])
LABELED_CITIES['country_code']= LABELED_CITIES['country'].apply(lambda x:  code_dict[x])

Merge those datasets together using the following steps:¶

Create a dictionary that contains the distances between cities that belong to the same country (with the same code). For every city whose name exists as is in both datasets, set the distance 0
Compute the minimum distance and its location per city, that resides in the labeled cities
If this distance is less than a threshold, that has been set to 10 kilometers, assume that the cities are the same. Use this mapping to fill the unlabeled cities with the labeled risks. If not, expand the unlabeled dataset with the city
Drop any row that has no recorded population.

mapping_dists = {x['city']:{y['city']: (haversine((x['latitude'],x['longitude']),(y['lat'],y['lng'])) if x['city']!=y['city'] else 0)
                            for (_, y) in WORLD_CITIES[WORLD_CITIES['country_code']==x['country_code']].iterrows()} for (_, x) in LABELED_CITIES.iterrows()}

WORLD_CITIES.set_index('city', inplace=True)
LABELED_CITIES.set_index('city', inplace=True)

Fix some cases where the latitude and the longitude are in the wrong place¶

mapping_dists_fixed = {
    x:(
        {y: val for y,val in d.items()} if min(d.values()) < 2000 else 
        {y: haversine((LABELED_CITIES.loc[[x], 'longitude'].values[0],LABELED_CITIES.loc[[x], 'latitude'].values[0]),
                      (WORLD_CITIES.loc[[y], 'lat'].values[0],WORLD_CITIES.loc[[y], 'lng'].values[0]))
        for  y,val in d.items()}) for x,d in mapping_dists.items()}

mapping_df = pd.DataFrame(mapping_dists_fixed).T
mapping_df.head()

closest_match_df = pd.DataFrame(mapping_df.fillna(100000000).apply(lambda x: {'match':mapping_df.columns[np.argmin(x)], 'distance':np.min(x)}, axis=1).tolist())

closest_match_df.index = mapping_df.index
closest_match_df

dist_thres = 10 # the distance threshold for a city to be considered mapped correctly, in km

unmapped = closest_match_df[closest_match_df['distance']>dist_thres]

to_add_whole = LABELED_CITIES.loc[unmapped.index].copy() # add those rows as a whole to the augmented dataset
to_add_labels = LABELED_CITIES.loc[closest_match_df[closest_match_df['distance']<=dist_thres].index].copy() # only add the labels from these rows
to_add_labels.index = closest_match_df.loc[closest_match_df['distance']<=dist_thres,'match']

Creating an augmented dataset with all the cities, annotated or not

from data.labeled.preprocessed import RISKS_MAPPING
risks_cols = list(RISKS_MAPPING.keys())

augmented_cities_dataset = WORLD_CITIES[['lat','lng','country','population','country_code']].rename(columns={'lat':'latitude','lng':'longitude'}).copy()
augmented_cities_dataset = augmented_cities_dataset.merge(to_add_labels[['c40'] + risks_cols],left_index=True, right_index=True,how='left')

augmented_cities_dataset = pd.concat([augmented_cities_dataset, to_add_whole],axis=0)
augmented_cities_dataset['c40']= augmented_cities_dataset['c40'].fillna(False)

augmented_cities_dataset = augmented_cities_dataset[~pd.isna(augmented_cities_dataset['population'])]

augmented_cities_dataset

Fix population for specific cities with a float one¶

which probably is cause by used "."

augmented_cities_dataset.population = augmented_cities_dataset.population.apply(lambda x: x * 1000 if int(x)!=x else x)

Filter out cities that are unlabeled and are not big enough, consulting BIG CITIES ALL COUNTRIES csv¶

from data.unlabeled import BIG_CITIES_ALL_COUNTRIES

unlab_mask = augmented_cities_dataset[risks_cols].isnull().all(axis=1)
not_in_big_cities_mask = augmented_cities_dataset.index.to_series().apply(lambda x: x not in BIG_CITIES_ALL_COUNTRIES.city.values)
augmented_cities_dataset = augmented_cities_dataset[~(unlab_mask&not_in_big_cities_mask)]

augmented_cities_dataset

Load and merge countries features¶

from data.unlabeled import COUNTRIES_DATASET

set(augmented_cities_dataset.columns).intersection(COUNTRIES_DATASET.columns)

set()

augmented_cities_dataset = augmented_cities_dataset.merge(COUNTRIES_DATASET, left_on='country_code', right_index=True)

all(augmented_cities_dataset.Country == augmented_cities_dataset.country)

False

augmented_cities_dataset.drop(columns='country',inplace=True)
augmented_cities_dataset.rename(columns={'Country':'country'},inplace=True)

augmented_cities_dataset= augmented_cities_dataset.reset_index().rename(columns={'index':'city'}).drop_duplicates(subset=['city','country'])

augmented_cities_dataset

from data.dataset import DATASET_PATH
augmented_cities_dataset.to_csv(DATASET_PATH,index=False)

augmented_cities_dataset

(~(augmented_cities_dataset[risks_cols].isnull()).all(axis=1)).sum() # the labeled samples

324

	city	city_ascii	lat	lng	country	iso2	iso3	admin_name	capital	population	id
0	Tokyo	Tokyo	35.6897	139.6922	Japan	JP	JPN	Tōkyō	primary	37977000.0	1392685764
1	Jakarta	Jakarta	-6.2146	106.8451	Indonesia	ID	IDN	Jakarta	primary	34540000.0	1360771077
2	Delhi	Delhi	28.6600	77.2300	India	IN	IND	Delhi	admin	29617000.0	1356872604
3	Mumbai	Mumbai	18.9667	72.8333	India	IN	IND	Mahārāshtra	admin	23355000.0	1356226629
4	Manila	Manila	14.5958	120.9772	Philippines	PH	PHL	Manila	primary	23088000.0	1608618140

	Al Quds	New York	Los Angeles	Chicago	Miami	Dallas	Philadelphia	Houston	Atlanta	Washington	...	Wum	Kribi	Batouri	Kalfou	Bélabo	Eséka	Abong Mbang	Kontcha	Aiyomojok	Mbé
Abasan Al-Kabira	97.672711	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Abington	NaN	119.372096	3857.172736	1070.945249	1752.117379	154.8733	1492.55584	2167.739061	1865.051282	1400.845831	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Abuja	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Addis Ababa	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Adelaide	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	match	distance
Abasan Al-Kabira	Al Quds	97.672711
Abington	Abington	0.000000
Abuja	Abuja	0.000000
Addis Ababa	Addis Ababa	0.000000
Adelaide	Adelaide	0.000000
...	...	...
Xalapa	Xalapa	0.000000
YaoundÃ© 6	Yaoundé	0.459487
Yokohama	Yokohama	0.000000
Ãlhavo	Estremoz	1.115491
Ã–rebro	Örebro	0.985037

	latitude	longitude	country	population	country_code	c40	risk0	risk1	risk2	risk3	risk4	risk5	risk6
A Coruña	43.3713	-8.4188	Spain	370610.000	ESP	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Aachen	50.7762	6.0838	Germany	247380.000	DEU	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Aalborg	57.0337	9.9166	Denmark	122219.000	DNK	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Aalen	48.8372	10.0936	Germany	68456.000	DEU	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Aalst	50.9333	4.0333	Belgium	85715.000	BEL	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...
Sekhukhune District Municipality	-24.8335	29.9740	South Africa	1076840.000	ZAF	False	NaN	NaN	NaN	2.0	NaN	NaN	NaN
SopÃ³	4.9075	-73.9384	Colombia	27932.000	COL	False	NaN	NaN	NaN	NaN	1.0	NaN	NaN
TangarÃ¡ da Serra	-14.6194	-57.4858	Brazil	83.431	BRA	False	NaN	NaN	2.0	NaN	NaN	NaN	NaN
Tuzla	44.5375	18.6735	Turkey	235000.000	TUR	False	NaN	NaN	2.0	NaN	NaN	NaN	NaN
Vhembe	-22.7695	29.9740	South Africa	1441178.000	ZAF	False	NaN	NaN	3.0	3.0	NaN	NaN	NaN

	latitude	longitude	country	population	country_code	c40	risk0	risk1	risk2	risk3	risk4	risk5	risk6
Aalborg	57.0337	9.9166	Denmark	122219.0	DNK	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Aarhus	56.1572	10.2107	Denmark	237551.0	DNK	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Aasiaat	68.7097	-52.8694	Greenland	3134.0	GRL	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Abidjan	5.3364	-4.0267	Côte D’Ivoire	4980000.0	CIV	False	NaN	NaN	NaN	NaN	1.0	NaN	NaN
Abington	40.1108	-75.1146	United States	55573.0	USA	False	NaN	1.0	1.0	2.0	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...
Sekhukhune District Municipality	-24.8335	29.9740	South Africa	1076840.0	ZAF	False	NaN	NaN	NaN	2.0	NaN	NaN	NaN
SopÃ³	4.9075	-73.9384	Colombia	27932.0	COL	False	NaN	NaN	NaN	NaN	1.0	NaN	NaN
TangarÃ¡ da Serra	-14.6194	-57.4858	Brazil	83431.0	BRA	False	NaN	NaN	2.0	NaN	NaN	NaN	NaN
Tuzla	44.5375	18.6735	Turkey	235000.0	TUR	False	NaN	NaN	2.0	NaN	NaN	NaN	NaN
Vhembe	-22.7695	29.9740	South Africa	1441178.0	ZAF	False	NaN	NaN	3.0	3.0	NaN	NaN	NaN

	city	latitude	longitude	population	country_code	c40	risk0	risk1	risk2	risk3	...	SDG 6.4.1. Services Water Use Efficiency	SDG 6.4.1. Water Use Efficiency	SDG 6.4.2. Water Stress	Seasonal variability (WRI)	Total internal renewable water resources per capita	Total population with access to safe drinking-water (JMP)	Total renewable water resources per capita	Total water withdrawal per capita	Urban population with access to safe drinking-water (JMP)	country
0	Aalborg	57.0337	9.9166	122219.0	DNK	False	NaN	NaN	NaN	NaN	...	558.335628	368.612902	20.040562	1.300000	1046.705025	100.0	1046.705025	129.285516	100.0	Denmark
1	Aarhus	56.1572	10.2107	237551.0	DNK	False	NaN	NaN	NaN	NaN	...	558.335628	368.612902	20.040562	1.300000	1046.705025	100.0	1046.705025	129.285516	100.0	Denmark
2	Copenhagen	55.6786	12.5635	1085000.0	DNK	False	NaN	2.0	NaN	2.0	...	558.335628	368.612902	20.040562	1.300000	1046.705025	100.0	1046.705025	129.285516	100.0	Denmark
3	Esbjerg	55.4670	8.4500	72205.0	DNK	False	NaN	NaN	NaN	NaN	...	558.335628	368.612902	20.040562	1.300000	1046.705025	100.0	1046.705025	129.285516	100.0	Denmark
4	Frederikshavn	57.4337	10.5333	24103.0	DNK	False	NaN	2.0	NaN	NaN	...	558.335628	368.612902	20.040562	1.300000	1046.705025	100.0	1046.705025	129.285516	100.0	Denmark
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1380	Teaoraereke	1.3333	173.0000	5105.0	KIR	False	NaN	NaN	NaN	NaN	...	110.288104	80.302831	68.505925	2.532741	-1779.609434	66.9	0.000000	414.127038	87.3	Kiribati
1381	Trindade	0.3000	6.6667	6636.0	STP	False	NaN	NaN	NaN	NaN	...	17.162254	7.489641	1.876147	2.502891	10526.874919	97.1	10526.874919	197.499626	98.9	Sao Tome and Principe
1382	Victoria	-4.6236	55.4544	26450.0	SYC	False	NaN	NaN	NaN	NaN	...	120.471023	89.323035	69.243227	1.601611	-2157.596566	95.7	0.000000	142.089651	95.7	Seychelles
1383	Warsaw	52.2167	21.0333	1790658.0	POL	True	0.0	0.0	2.0	0.0	...	153.691550	43.570222	34.890966	0.600000	1412.266377	98.3	1594.069324	265.616741	99.3	Poland
1384	Abasan Al-Kabira	31.3231	34.3440	32000.0	PSE	False	NaN	2.0	2.0	2.0	...	51.887163	34.748970	41.078306	3.100000	171.047224	58.4	176.313456	79.035614	50.7	Palestine

	city	latitude	longitude	population	country_code	c40	risk0	risk1	risk2	risk3	...	SDG 6.4.1. Services Water Use Efficiency	SDG 6.4.1. Water Use Efficiency	SDG 6.4.2. Water Stress	Seasonal variability (WRI)	Total internal renewable water resources per capita	Total population with access to safe drinking-water (JMP)	Total renewable water resources per capita	Total water withdrawal per capita	Urban population with access to safe drinking-water (JMP)	country
0	Aalborg	57.0337	9.9166	122219.0	DNK	False	NaN	NaN	NaN	NaN	...	558.335628	368.612902	20.040562	1.300000	1046.705025	100.0	1046.705025	129.285516	100.0	Denmark
1	Aarhus	56.1572	10.2107	237551.0	DNK	False	NaN	NaN	NaN	NaN	...	558.335628	368.612902	20.040562	1.300000	1046.705025	100.0	1046.705025	129.285516	100.0	Denmark
2	Copenhagen	55.6786	12.5635	1085000.0	DNK	False	NaN	2.0	NaN	2.0	...	558.335628	368.612902	20.040562	1.300000	1046.705025	100.0	1046.705025	129.285516	100.0	Denmark
3	Esbjerg	55.4670	8.4500	72205.0	DNK	False	NaN	NaN	NaN	NaN	...	558.335628	368.612902	20.040562	1.300000	1046.705025	100.0	1046.705025	129.285516	100.0	Denmark
4	Frederikshavn	57.4337	10.5333	24103.0	DNK	False	NaN	2.0	NaN	NaN	...	558.335628	368.612902	20.040562	1.300000	1046.705025	100.0	1046.705025	129.285516	100.0	Denmark
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1380	Teaoraereke	1.3333	173.0000	5105.0	KIR	False	NaN	NaN	NaN	NaN	...	110.288104	80.302831	68.505925	2.532741	-1779.609434	66.9	0.000000	414.127038	87.3	Kiribati
1381	Trindade	0.3000	6.6667	6636.0	STP	False	NaN	NaN	NaN	NaN	...	17.162254	7.489641	1.876147	2.502891	10526.874919	97.1	10526.874919	197.499626	98.9	Sao Tome and Principe
1382	Victoria	-4.6236	55.4544	26450.0	SYC	False	NaN	NaN	NaN	NaN	...	120.471023	89.323035	69.243227	1.601611	-2157.596566	95.7	0.000000	142.089651	95.7	Seychelles
1383	Warsaw	52.2167	21.0333	1790658.0	POL	True	0.0	0.0	2.0	0.0	...	153.691550	43.570222	34.890966	0.600000	1412.266377	98.3	1594.069324	265.616741	99.3	Poland
1384	Abasan Al-Kabira	31.3231	34.3440	32000.0	PSE	False	NaN	2.0	2.0	2.0	...	51.887163	34.748970	41.078306	3.100000	171.047224	58.4	176.313456	79.035614	50.7	Palestine