Tutorial 1.

Basic statistics

In this tutorial we will load data from a csv file in jupyter to obtain statistical information about the data domain using several Pharo frameworks.

First of all, be sure that NeoCSV, DataFrame, Roassal and JupyterTalk are installed in your kernel Image. Although it is possible to load packages from the jupyter client, we suggest you to load it using the workspace or the catalog browser and save it to keep the changes. Doing "Smalltalk saveSession" could break the image. Normally the kernel will lose the connection and you will be forced to close and halt the kernel from the File Menu but sometimes the Pharo Image could be damaged.

In [1]:
"Install NeoCSV"
Gofer it
   smalltalkhubUser: 'SvenVanCaekenberghe' project: 'Neo';
   configurationOf: 'NeoCSV';
   loadStable.
"install Roassal"
Gofer it
    smalltalkhubUser: 'ObjectProfile' project: 'Roassal2';
    configurationOf: 'Roassal2';
    loadStable.
    
"Install DataFrame and NeoCSV"
Metacello new
  baseline: 'DataFrame';
  repository: 'github://PolyMathOrg/DataFrame';
  load.
    
Metacello new 
    baseline: 'JupyterTalk';
    repository: 'github://jmari/JupyterTalk/repository';
    load:'all'
Out[1]:
In [2]:
"this file is windows-1258 encoded so we have to load into Pharo kernel using the correct encoding"

stream := ZnCharacterReadStream 
                on:'/Users/Cat/Dropbox/Master Ciencia de les dades/S1.1.mineria de dades/PAC2/countries.csv'
                        asFileReference binaryReadStream
                encoding: #windows1258.
                
                  
arrayOfRows := (NeoCSVReader on: stream)
                            separator: $;;
                            upToEnd.

paisos := DataFrame fromRows:(arrayOfRows copyFrom:2 to:arrayOfRows size).
paisos columnNames: (arrayOfRows at:1)
Out[2]:

Loading CSV content.

Before loading country.csv file, examine this file opening it in your favourite text editor, check the character encoding, its character separator and decimal point character. As you will see, it's windows-1258 encoded, separated by ';' and it uses european floating point format, ',' is the decimal point.
We use Zinc Streams and NeoCSV to encode the content and load it into an Array.
Let's create a DataFrame, since this Array has a first row containing the column headers we need to create the DataFrame from rows starting from two up to the end. We will assing the column names from the first row in the array.

Let's show rows from one to thirty three including its heading. Note that encoding is correct

In [3]:
"JupyterTalk will transform those Strings to utf-8, look at:#32 Cameroon | Yaoundé"
self display openInJupyter: (OrderedCollection new 
                                    add:paisos columnNames;
                                    addAll: (paisos asArrayOfRows copyFrom:1 to:33);
                                    yourself) .
Out[3]:
NAMECAPITALTOTAL_AREA_KM2POPULATIONDENSITY_KM2CONTINENTURBAN_POPULATIONRURAL_POPULATIONCRUDE_OIL_BAR_DAYOIL_RES_MIL_BARGAS_TONSGAS_RES_MIL_M_3GDP_MILLION_$GDP_GROW_RATEGDP_$_PER_CAPITADOCTORSILLITERACYPRIMARY_SECTORSECONDARY_SECTORTERTIARY_SECTORPHONES_RATENATURAL_GROWTHMEM_LIFE_EXPWOMEN_LIFE_EXPMAIN_SECTORMAIN_GROUP
Afghanistan Kabul 6475002432695937,6Asia2080nilnilnilnil006004797nil5713290264746PrimaryRural
Albania Tirane287503453505120,1Europe3763nilnilnilnil410061210574050nil18nil19,969,975,5PrimaryRural
Algeria Algiers 23817402918145612,3Africa5644139500092005930000037000001087003,538001119501324360,0328,365,866,6Tertiary Urban
Andorra Andorra la Vella45067569150,2Europenilnilnilnilnilnil1000016200nilnilnilnilnilnilnilnilnilnilnil
AngolaLuanda1246700103393648,3Africa3268nilnilnilnil74004700134897271nil1802742,946,1PrimaryRural
Antigua & Barbuda Saint John's44065619149,1America 3664nilnilnilnil04,26600nilnilnil1872nil9,57073Tertiary Rural
Argentina Buenos Aires27668903467339112,5America 8812805000nilnilnil0-4,48100340nilnil30500,1116774Tertiary Urban
Armenia Yerevan 298003590722120,5Asia6931nilnilnilnil05,22560nilnilnilnilnilnil15nilnilnilUrban
Australia Canberra7686850185622522,4Oceania 8515615000nilnilnil4054003,322100438nil515720,478,473,379,6Tertiary Urban
Austria Vienna83850801461795,6Europe5644nilnilnilnil1520002,4190003450735550,530,972,178,8Tertiary Urban
AzerbaijanBaku86600789271291,1Asia5644nilnilnilnil0-171480nilnilnilnilnilnil26,4nilnilnilUrban
Bahamas; TheNassau1394025941318,6America 8713nilnilnilnil02187008095111620,4814,36774Tertiary Urban
Bahrain Manama620590784952,9Asia9010nilnilnilnil0-212000713nilnil32650,2323,36568,4Tertiary Urban
BangladeshDhaka 144000131066751910,2Asia1882nilnilnilnil04,611306166nil57nil30021,65656PrimaryRural
BarbadosBridgetown430257010597,7America 4753nilnilnilnil029800104225nilnil0,418,271,976,9PrimaryRural
Belarus Minsk 2076001046873050,4Europe7129nilnilnilnil49200-104700nilnilnilnilnilnilnilnilnilnilUrban
Belgium Brussels3051010099019331Europe973nilnilnilnil1970002,3195003170223620,51,47077Tertiary Urban
BelizeBelmopan229602192419,5America 4753nilnilnilnil022750204672713450,0730nilnilTertiary Rural
Benin Porto-Novo112620570658250,7Africa3169nilnilnilnil76006138016025nil67724nil29,9nilnilPrimaryRural
BhutanThimphu 47000182230538,8Asia694nilnilnilnil06730nilnil8703nil21,949,247,8PrimaryRural
Bolivia Sucre / La Paz109858080739207,3America 6139nilnilnilnil03,725302214nil4217380,0229nilnilPrimaryUrban
Bosnia - HerzegovinaSarajevo51233322263562,9Europe4951nilnilnilnil10000300nilnilnilnilnilnilnilnilnilnilRural
BotswanaGaborone60037014252752,4Africa2872nilnilnilnil4500132007185nil4nil660,0235,652,759,3Tertiary Rural
BrazilBrasilia851196516269848619,1America 7822800000nilnilnil04,26100685nil242150,0820,762,367,6PrimaryUrban
BruneiBandar Seri Begawan 577029995352Asia5842nilnilnilnil02158001469nil317690,1624,372,676,4Tertiary Urban
BulgariaSofia 110910875326078,9Europe7129nilnilnilnil432002,44920nilnilnil4833nil11,768,274,4SecondaryUrban
Burkina Ouagadougou 2742001071362539,1Africa2773nilnilnilnil740047001359nil9224028,745,648,9PrimaryRural
Burma Rangoon 6785004593371967,7Asianilnilnilnilnilnil06,81000nilnilnilnilnilnilnilnilnilnilnil
Burundi Bujumbura 278306398950229,9Africa892nilnilnilnil40002,76005725093240325054PrimaryRural
CambodiaPhnom Penh1810401086026060Asia2179nilnilnilnil06,7660270005270nil14024,946,549,4PrimaryRural
CameroonYaoundé 4754401391581329,3Africa4555nilnilnilnil165001,8120012540nilnil6nil032,65154nilRural
CanadaOttawa9976140287444822,9America 772324600006900137700000190000002,1244004494319680,77,673,380Tertiary Urban
Cape VerdePraia 4030448975111,4Africa5446nilnilnilnil4404,610404208nil3332140,0124,46367PrimaryUrban

Drawing a Boxplot.

Now, let's get a basic boxplot from 'URBAN_POPULATION' column. We need to change the column type from String to Integer.
The BoxPlot is part of the Roassal Package. DataFrame has helper methods to draw basic statistical drawings using Roassal.

In [4]:
"CONVERT URBAN_POPULATION TO INTEGER"

newCol := (paisos column:#URBAN_POPULATION) collect:[:v| v ifNil:[0] ifNotNil:[v asInteger]].
paisos column:#URBAN_POPULATION put:newCol.
Out[4]:
In [5]:
b := (paisos column:#URBAN_POPULATION) boxplot.
self display openInJupyter: b
Out[5]:

Draw a Histogram.

Now, let's create a histogram from a quantitative column. We will use POPULATION column so first we need to change column type to Integer and calculate maximun and minimun using DataSerie methods min and max.
We will use a Bag in order to categorize the POPULATION column.

In [6]:
"CONVERT POPULATION TO INTEGER"

newCol := (paisos column:#POPULATION) collect:[:v| v ifNil:[nil] ifNotNil:[v asInteger]].
paisos column:#POPULATION put:newCol.
maxPopulation := (paisos column:#POPULATION) max.
minPopulation := (paisos column:#POPULATION) min.
domainSize := maxPopulation - minPopulation.
bag := Bag new.
newCol do:[:each| bag add:((((each - minPopulation)/ domainSize*100) asInteger) ) ]
Out[6]:
In [7]:
b:= RTGrapher new.
ds := RTData new.
ds barShape width: 10.
ds points: bag valuesAndCounts.
b add: ds.

uuid :=self display openInJupyter: b  extent:600@650
Out[7]:

We can modify an object shown above. #openInJupyter returns an uuid for each displayed object. we can use this uuid to refresh the picture after we change any of its properties.

In [8]:
"we can modify an object shown in other response."
b axisX title:'Population in million of people'.
b axisY title:'Number of countries'.
self display refresh: uuid
Out[8]:

Interactive visualization with Roassal.

Let's visualize a nice drawing using Roassal. We can interact with our drawing.
We need to include the Roassal javascript library in our jupyter markdown document.

In [9]:
self loadScript: IPRoassal js
Out[9]:
In [10]:
| view coll col2 n b |
view := RTView new.
coll := (paisos column:#POPULATION).
n := RTMultiLinearColorForIdentity new objects: coll.
coll
    doWithIndex: [ :r :index | 
        view
            add:
            ((RTBox new
                color: [ :value | n rtValue: r ];
                size: r/ 1000)
            elementOn: index) ].
col2 := (paisos column:#NAME).
RTFlowLayout new applyOn: view elements.
view elements do: [ :e | e @ (RTPopup text: [ :el | col2 at:el ]) ].

b := RTAxisAdaptedBuilder new.
b view: view.
b margin: 20.
b objects: view elements.
b build.

self display 
        interactionOn;
        openInJupyter: b extent:650@640
Out[10]:

Drawing a Bar chart.

Now, let's get a bar chart showing the number of countries in each continent. We use a Bag to get a count for each continent.

In [13]:
bag := Bag new.
(paisos column:#CONTINENT) do:[:each| bag add:each ]
Out[13]:
In [14]:
b:= RTGrapher new.
ds := RTData new.
ds barShape width: 20.
ds interaction highlight.
ds points: bag valuesAndCounts associations.
ds y: [:each| each value].
ds barChartWithBarTitle: [:each|each key].
b add: ds.
b axisX
	noTick;
	noLabel.

self display openInJupyter: b extent:640@600
Out[14]:

Basic statistics.

We will finish this tutorial getting basic statistcs from the column population.

In [15]:
self display openInJupyter:(paisos column:#POPULATION) summary asStringTable
Out[15]:
         |     POPULATION  
---------+-----------------
Min | 16954
1st Qu. | 1952975
Median | 5989065
Mean | 3.122843657e7
3rd Qu. | 20013660
Max | 1215609480