In [13]:
import pandas as pd
import sys
sys.path.append("..")
from data.unlabeled.preprocessed import econ, aqua, edu, humdev

Find out the differences in country representation

In [14]:
print("Aqua shape:",aqua.shape,"Econ shape:",econ.shape,"Edu shape:",edu.shape,"Humdev shape:",humdev.shape)
Aqua shape: (200, 23) Econ shape: (149, 33) Edu shape: (241, 85) Humdev shape: (195, 67)
In [15]:
countries_diff_edu_humdev= edu.loc[set(edu.index) - set(humdev.index)]['Short Name']
countries_diff_aqua_humd = aqua.loc[set(aqua.index) - set(humdev.index)]['Country']
countries_not_in_humdev = countries_diff_edu_humdev.append(countries_diff_aqua_humd)
In [16]:
print("Extra not needed indicators in edu dataset",countries_not_in_humdev)
Extra not needed indicators in edu dataset NAC                                     North America
SSA              Sub-Saharan Africa (developing only)
LIC                                        Low income
ABW                                             Aruba
ASM                                    American Samoa
HIC                                       High income
EAS           East Asia & Pacific (all income levels)
VGB                                               NaN
PRI                                       Puerto Rico
ECA           Europe & Central Asia (developing only)
LMY                               Low & middle income
FRO                                    Faeroe Islands
CYM                                    Cayman Islands
UMC                               Upper middle income
TCA                          Turks and Caicos Islands
LDC      Least developed countries: UN classification
OED                                      OECD members
WLD                                             World
CHI                                   Channel Islands
ECS         Europe & Central Asia (all income levels)
EAP             East Asia & Pacific (developing only)
LMC                               Lower middle income
VIR                                    Virgin Islands
CUW                                           Curaçao
GUM                                              Guam
SAS                                        South Asia
PYF                                  French Polynesia
HPC            Heavily indebted poor countries (HIPC)
LAC       Latin America & Caribbean (developing only)
SXM                         Sint Maarten (Dutch part)
MIC                                     Middle income
GIB                                         Gibraltar
LCN     Latin America & Caribbean (all income levels)
XKX                                            Kosovo
BMU                                           Bermuda
ARB                                        Arab World
SSF            Sub-Saharan Africa (all income levels)
EMU                                         Euro area
EUU                                    European Union
GRL                                         Greenland
MNP                          Northern Mariana Islands
MNA      Middle East & North Africa (developing only)
IMN                                       Isle of Man
MEA    Middle East & North Africa (all income levels)
NCL                                     New Caledonia
MAC                                  Macao SAR, China
NIU                                              Niue
VAT                                          Holy See
PRI                                       Puerto Rico
FRO                                     Faroe Islands
COK                                      Cook Islands
TKL                                           Tokelau
dtype: object
In [17]:
countries_diff_humdev_econ = set(humdev.index) - set(econ.index) 
print("Not in humdev",countries_diff_humdev_econ)
Not in humdev {'LKA', 'FJI', 'PLW', 'LCA', 'JAM', 'SSD', 'SMR', 'GNQ', 'TLS', 'CUB', 'LIE', 'HKG', 'GRD', 'SWZ', 'LUX', 'VUT', 'ATG', 'WSM', 'FSM', 'KNA', 'MHL', 'MCO', 'BRB', 'HTI', 'DOM', 'MDA', 'PNG', 'COM', 'SLB', 'NRU', 'TTO', 'BHS', 'VCT', 'DJI', 'BWA', 'TUV', 'MUS', 'CPV', 'SYC', 'KIR', 'MDV', 'LSO', 'STP', 'DMA', 'TON', 'NAM', 'PSE'}

Econ contains to little information and is not used in the final dataset

In [18]:
big_table = humdev.join(edu, how="inner").join(aqua, how="inner")
In [19]:
big_table.dropna(axis=1, inplace=True)
In [20]:
name_columns = set(big_table.columns) - set(big_table.select_dtypes(include="number").columns)
print("Info columns",name_columns)
Info columns {'Table Name', 'Country', 'Short Name', 'Long Name'}

Not needed labels are removed

In [21]:
big_table.drop(['Short Name','Long Name','Table Name'],inplace=True, axis=1)
In [22]:
big_table
Out[22]:
Population with at least some secondary education (% ages 25 and older) Population with at least some secondary education, female (% ages 25 and older) Population with at least some secondary education, male (% ages 25 and older) Mean years of schooling, female (years) Mean years of schooling, male (years) Share of seats in parliament (% held by women) Adolescent birth rate (births per 1,000 women ages 15-19) Vulnerable employment (% of total employment) Total population (millions) Urban population (%) ... SDG 6.4.1. Services Water Use Efficiency SDG 6.4.1. Water Use Efficiency SDG 6.4.2. Water Stress Seasonal variability (WRI) Total internal renewable water resources per capita Total population with access to safe drinking-water (JMP) Total renewable water resources per capita Total water withdrawal per capita Urban population with access to safe drinking-water (JMP) Country
AFG 26.080 13.220 36.920 1.94800 6.006000 27.244 68.957000 79.726000 38.042 25.8 ... 57.148622 0.923778 54.757019 2.500000 1299.037172 55.3 1799.917253 561.297018 78.2 Afghanistan
AGO 30.232 23.133 38.056 4.02300 6.359000 30.000 150.526000 65.995000 31.825 66.2 ... 167.030879 142.467836 1.871883 3.100000 4963.650317 49.0 4977.065588 23.671246 75.4 Angola
ALB 93.174 93.700 92.497 9.70200 10.614000 29.508 19.642000 52.852000 2.881 61.2 ... 21.852239 6.656907 7.139423 2.400000 9326.776621 95.1 10470.953679 492.273511 94.9 Albania
AND 72.327 71.484 73.327 10.43900 10.564000 46.429 18.266334 4.461035 0.077 88.0 ... 146.632709 86.300426 69.033809 1.600000 4098.648070 100.0 4098.648070 422.680401 100.0 Andorra
ARG 57.158 59.161 54.828 11.12300 10.729000 39.877 62.782000 21.805000 44.781 92.0 ... 65.054956 13.616564 10.456664 1.800000 6645.858151 99.1 19943.036802 859.864798 99.0 Argentina
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
WSM 74.942 79.127 71.583 11.16287 10.415249 10.000 23.886000 29.983000 0.197 18.1 ... 157.705946 89.401811 69.028170 2.092752 -2094.186715 99.0 0.000000 413.381306 97.5 Samoa
YEM 28.020 19.920 36.918 2.88000 5.146000 0.971 60.352000 45.627000 29.162 37.3 ... 47.023411 5.219357 169.761905 2.400000 75.445075 54.9 75.445075 128.076996 72.0 Yemen
ZAF 75.478 74.977 78.207 10.03100 10.291000 45.333 67.908000 10.298000 58.558 66.9 ... 53.518514 14.659097 62.055716 2.100000 785.830411 93.2 900.723027 339.941816 99.6 South Africa
ZMB 44.440 38.488 54.068 6.28300 8.176000 17.964 120.112000 78.134000 17.861 44.1 ... 43.217366 12.764894 2.835498 4.400000 4758.627519 65.4 6218.256409 93.273846 85.6 Zambia
ZWE 64.935 59.792 70.783 8.06600 8.923000 34.571 86.135000 64.739000 14.645 32.2 ... 27.194488 5.213329 31.346226 3.700000 861.160973 76.9 1404.830298 234.543442 97.0 Zimbabwe

194 rows × 144 columns

In [23]:
big_table.to_csv("../data/unlabeled/preprocessed_countries_dataset.csv")