First of all, be sure that NeoCSV, DataFrame, Roassal and JupyterTalk are installed in your kernel Image. Although it is possible to load packages from the jupyter client, we suggest you to load it using the workspace or the catalog browser and save it to keep the changes. Doing "Smalltalk saveSession" could break the image. Normally the kernel will lose the connection and you will be forced to close and halt the kernel from the File Menu but sometimes the Pharo Image could be damaged.
"Install NeoCSV"
Gofer it
smalltalkhubUser: 'SvenVanCaekenberghe' project: 'Neo';
configurationOf: 'NeoCSV';
loadStable.
"install Roassal"
Gofer it
smalltalkhubUser: 'ObjectProfile' project: 'Roassal2';
configurationOf: 'Roassal2';
loadStable.
"Install DataFrame and NeoCSV"
Metacello new
baseline: 'DataFrame';
repository: 'github://PolyMathOrg/DataFrame';
load.
Metacello new
baseline: 'JupyterTalk';
repository: 'github://jmari/JupyterTalk/repository';
load:'all'
"this file is windows-1258 encoded so we have to load into Pharo kernel using the correct encoding"
stream := ZnCharacterReadStream
on:'/Users/Cat/Dropbox/Master Ciencia de les dades/S1.1.mineria de dades/PAC2/countries.csv'
asFileReference binaryReadStream
encoding: #windows1258.
arrayOfRows := (NeoCSVReader on: stream)
separator: $;;
upToEnd.
paisos := DataFrame fromRows:(arrayOfRows copyFrom:2 to:arrayOfRows size).
paisos columnNames: (arrayOfRows at:1)
Before loading country.csv file, examine this file opening it in your favourite text editor, check the character encoding, its character separator and decimal point character. As you will see, it's windows-1258 encoded, separated by ';' and it uses european floating point format, ',' is the decimal point.
We use Zinc Streams and NeoCSV to encode the content and load it into an Array.
Let's create a DataFrame, since this Array has a first row containing the column headers we need to create the DataFrame from rows starting from two up to the end. We will assing the column names from the first row in the array.
Let's show rows from one to thirty three including its heading. Note that encoding is correct
"JupyterTalk will transform those Strings to utf-8, look at:#32 Cameroon | Yaoundé"
self display openInJupyter: (OrderedCollection new
add:paisos columnNames;
addAll: (paisos asArrayOfRows copyFrom:1 to:33);
yourself) .
Now, let's get a basic boxplot from 'URBAN_POPULATION' column. We need to change the column type from String to Integer.
The BoxPlot is part of the Roassal Package. DataFrame has helper methods to draw basic statistical drawings using Roassal.
"CONVERT URBAN_POPULATION TO INTEGER"
newCol := (paisos column:#URBAN_POPULATION) collect:[:v| v ifNil:[0] ifNotNil:[v asInteger]].
paisos column:#URBAN_POPULATION put:newCol.
b := (paisos column:#URBAN_POPULATION) boxplot.
self display openInJupyter: b
Now, let's create a histogram from a quantitative column. We will use POPULATION column so
first we need to change column type to Integer and calculate maximun and minimun using DataSerie methods min and max.
We will use a Bag in order to categorize the POPULATION column.
"CONVERT POPULATION TO INTEGER"
newCol := (paisos column:#POPULATION) collect:[:v| v ifNil:[nil] ifNotNil:[v asInteger]].
paisos column:#POPULATION put:newCol.
maxPopulation := (paisos column:#POPULATION) max.
minPopulation := (paisos column:#POPULATION) min.
domainSize := maxPopulation - minPopulation.
bag := Bag new.
newCol do:[:each| bag add:((((each - minPopulation)/ domainSize*100) asInteger) ) ]
b:= RTGrapher new.
ds := RTData new.
ds barShape width: 10.
ds points: bag valuesAndCounts.
b add: ds.
uuid :=self display openInJupyter: b extent:600@650
We can modify an object shown above. #openInJupyter returns an uuid for each displayed object. we can use this uuid to refresh the picture after we change any of its properties.
"we can modify an object shown in other response."
b axisX title:'Population in million of people'.
b axisY title:'Number of countries'.
self display refresh: uuid
Let's visualize a nice drawing using Roassal. We can interact with our drawing.
We need to include the Roassal javascript library in our jupyter markdown document.
self loadScript: IPRoassal js
| view coll col2 n b |
view := RTView new.
coll := (paisos column:#POPULATION).
n := RTMultiLinearColorForIdentity new objects: coll.
coll
doWithIndex: [ :r :index |
view
add:
((RTBox new
color: [ :value | n rtValue: r ];
size: r/ 1000)
elementOn: index) ].
col2 := (paisos column:#NAME).
RTFlowLayout new applyOn: view elements.
view elements do: [ :e | e @ (RTPopup text: [ :el | col2 at:el ]) ].
b := RTAxisAdaptedBuilder new.
b view: view.
b margin: 20.
b objects: view elements.
b build.
self display
interactionOn;
openInJupyter: b extent:650@640
Now, let's get a bar chart showing the number of countries in each continent. We use a Bag to get a count for each continent.
bag := Bag new.
(paisos column:#CONTINENT) do:[:each| bag add:each ]
b:= RTGrapher new.
ds := RTData new.
ds barShape width: 20.
ds interaction highlight.
ds points: bag valuesAndCounts associations.
ds y: [:each| each value].
ds barChartWithBarTitle: [:each|each key].
b add: ds.
b axisX
noTick;
noLabel.
self display openInJupyter: b extent:640@600
We will finish this tutorial getting basic statistcs from the column population.
self display openInJupyter:(paisos column:#POPULATION) summary asStringTable