import pandas as pd
import sys
sys.path.append("..")
from data.unlabeled.preprocessed import econ, aqua, edu, humdev

Find out the differences in country representation¶

print("Aqua shape:",aqua.shape,"Econ shape:",econ.shape,"Edu shape:",edu.shape,"Humdev shape:",humdev.shape)

Aqua shape: (200, 23) Econ shape: (149, 33) Edu shape: (241, 85) Humdev shape: (195, 67)

countries_diff_edu_humdev= edu.loc[set(edu.index) - set(humdev.index)]['Short Name']
countries_diff_aqua_humd = aqua.loc[set(aqua.index) - set(humdev.index)]['Country']
countries_not_in_humdev = countries_diff_edu_humdev.append(countries_diff_aqua_humd)

print("Extra not needed indicators in edu dataset",countries_not_in_humdev)

Extra not needed indicators in edu dataset NAC                                     North America
SSA              Sub-Saharan Africa (developing only)
LIC                                        Low income
ABW                                             Aruba
ASM                                    American Samoa
HIC                                       High income
EAS           East Asia & Pacific (all income levels)
VGB                                               NaN
PRI                                       Puerto Rico
ECA           Europe & Central Asia (developing only)
LMY                               Low & middle income
FRO                                    Faeroe Islands
CYM                                    Cayman Islands
UMC                               Upper middle income
TCA                          Turks and Caicos Islands
LDC      Least developed countries: UN classification
OED                                      OECD members
WLD                                             World
CHI                                   Channel Islands
ECS         Europe & Central Asia (all income levels)
EAP             East Asia & Pacific (developing only)
LMC                               Lower middle income
VIR                                    Virgin Islands
CUW                                           Curaçao
GUM                                              Guam
SAS                                        South Asia
PYF                                  French Polynesia
HPC            Heavily indebted poor countries (HIPC)
LAC       Latin America & Caribbean (developing only)
SXM                         Sint Maarten (Dutch part)
MIC                                     Middle income
GIB                                         Gibraltar
LCN     Latin America & Caribbean (all income levels)
XKX                                            Kosovo
BMU                                           Bermuda
ARB                                        Arab World
SSF            Sub-Saharan Africa (all income levels)
EMU                                         Euro area
EUU                                    European Union
GRL                                         Greenland
MNP                          Northern Mariana Islands
MNA      Middle East & North Africa (developing only)
IMN                                       Isle of Man
MEA    Middle East & North Africa (all income levels)
NCL                                     New Caledonia
MAC                                  Macao SAR, China
NIU                                              Niue
VAT                                          Holy See
PRI                                       Puerto Rico
FRO                                     Faroe Islands
COK                                      Cook Islands
TKL                                           Tokelau
dtype: object

countries_diff_humdev_econ = set(humdev.index) - set(econ.index) 
print("Not in humdev",countries_diff_humdev_econ)

Not in humdev {'LKA', 'FJI', 'PLW', 'LCA', 'JAM', 'SSD', 'SMR', 'GNQ', 'TLS', 'CUB', 'LIE', 'HKG', 'GRD', 'SWZ', 'LUX', 'VUT', 'ATG', 'WSM', 'FSM', 'KNA', 'MHL', 'MCO', 'BRB', 'HTI', 'DOM', 'MDA', 'PNG', 'COM', 'SLB', 'NRU', 'TTO', 'BHS', 'VCT', 'DJI', 'BWA', 'TUV', 'MUS', 'CPV', 'SYC', 'KIR', 'MDV', 'LSO', 'STP', 'DMA', 'TON', 'NAM', 'PSE'}

Econ contains to little information and is not used in the final dataset¶

big_table = humdev.join(edu, how="inner").join(aqua, how="inner")

big_table.dropna(axis=1, inplace=True)

name_columns = set(big_table.columns) - set(big_table.select_dtypes(include="number").columns)
print("Info columns",name_columns)

Info columns {'Table Name', 'Country', 'Short Name', 'Long Name'}

Not needed labels are removed¶

big_table.drop(['Short Name','Long Name','Table Name'],inplace=True, axis=1)

big_table

big_table.to_csv("../data/unlabeled/preprocessed_countries_dataset.csv")

	Population with at least some secondary education (% ages 25 and older)	Population with at least some secondary education, female (% ages 25 and older)	Population with at least some secondary education, male (% ages 25 and older)	Mean years of schooling, female (years)	Mean years of schooling, male (years)	Share of seats in parliament (% held by women)	Adolescent birth rate (births per 1,000 women ages 15-19)	Vulnerable employment (% of total employment)	Total population (millions)	Urban population (%)	...	SDG 6.4.1. Services Water Use Efficiency	SDG 6.4.1. Water Use Efficiency	SDG 6.4.2. Water Stress	Seasonal variability (WRI)	Total internal renewable water resources per capita	Total population with access to safe drinking-water (JMP)	Total renewable water resources per capita	Total water withdrawal per capita	Urban population with access to safe drinking-water (JMP)	Country
AFG	26.080	13.220	36.920	1.94800	6.006000	27.244	68.957000	79.726000	38.042	25.8	...	57.148622	0.923778	54.757019	2.500000	1299.037172	55.3	1799.917253	561.297018	78.2	Afghanistan
AGO	30.232	23.133	38.056	4.02300	6.359000	30.000	150.526000	65.995000	31.825	66.2	...	167.030879	142.467836	1.871883	3.100000	4963.650317	49.0	4977.065588	23.671246	75.4	Angola
ALB	93.174	93.700	92.497	9.70200	10.614000	29.508	19.642000	52.852000	2.881	61.2	...	21.852239	6.656907	7.139423	2.400000	9326.776621	95.1	10470.953679	492.273511	94.9	Albania
AND	72.327	71.484	73.327	10.43900	10.564000	46.429	18.266334	4.461035	0.077	88.0	...	146.632709	86.300426	69.033809	1.600000	4098.648070	100.0	4098.648070	422.680401	100.0	Andorra
ARG	57.158	59.161	54.828	11.12300	10.729000	39.877	62.782000	21.805000	44.781	92.0	...	65.054956	13.616564	10.456664	1.800000	6645.858151	99.1	19943.036802	859.864798	99.0	Argentina
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
WSM	74.942	79.127	71.583	11.16287	10.415249	10.000	23.886000	29.983000	0.197	18.1	...	157.705946	89.401811	69.028170	2.092752	-2094.186715	99.0	0.000000	413.381306	97.5	Samoa
YEM	28.020	19.920	36.918	2.88000	5.146000	0.971	60.352000	45.627000	29.162	37.3	...	47.023411	5.219357	169.761905	2.400000	75.445075	54.9	75.445075	128.076996	72.0	Yemen
ZAF	75.478	74.977	78.207	10.03100	10.291000	45.333	67.908000	10.298000	58.558	66.9	...	53.518514	14.659097	62.055716	2.100000	785.830411	93.2	900.723027	339.941816	99.6	South Africa
ZMB	44.440	38.488	54.068	6.28300	8.176000	17.964	120.112000	78.134000	17.861	44.1	...	43.217366	12.764894	2.835498	4.400000	4758.627519	65.4	6218.256409	93.273846	85.6	Zambia
ZWE	64.935	59.792	70.783	8.06600	8.923000	34.571	86.135000	64.739000	14.645	32.2	...	27.194488	5.213329	31.346226	3.700000	861.160973	76.9	1404.830298	234.543442	97.0	Zimbabwe