| Data ingestion |
That is fundamental and all programming systems have such functionalities to various degrees. |
Robust JSON, CSV, CBOR files ingestion; XML and other formats can be ingested, but not in a robust manner. |
Various (improved) packages for working with JSON, CSC, markup images, PDF, etc. Umbrella ingestion function for them. |
✓ |
★★★ (2.5) |
Data::Importers,
Data::Translators,
Omni-slurping with LLMing [post]
|
| Data wrangling facilitation |
Slicing, splitting, combining, aggregating, summarizing data can be difficult and time consuming. |
No serious efforts, especially, in terms of streamlining data wrangling workflows. |
Two major efforts for streamlining data wrangling workflows one using "pure" Raku (good for exploration) and other interfaces "outside" systems. |
✓ |
★★★ (3.25) |
Data::Reshapers,
Dan,
Introduction to data wrangling with Raku [post]
|
| Statistics for data exploration |
This includes descriptive statistics (mean, median, 5-point summary), summarization, outlier identification, and statistical distribution functions. |
Various attempts, some are basic and "plain-Raku" (e.g. Stats), some connect to GSL (and do not work on macOS.) |
A couple of major efforts exist, one is all-in-one package, the other has is spread out in various packages. |
✓ |
★★★ (2.5) |
Statistics,
Statistics::Distributions,
Statistics::OutlierIdentifiers,
Data::Summarizers,
Data::TypeSystem
|
| Machine Learning (ML) algorithms (both unsupervised and supervised) |
Unsupervised ML is often used for Exploratory Data Analysis (EDA); supervised ML is used to leverage data patterns in some way, but also for certain type of EDA. |
A few packages for doing unsupervised Machine Learning (ML) (like Text::Markov.) |
At least supervised ML package connecting (binding) to external systems, a set of unsupervised ML packages for clustering, associating rules learning, fitting, tries with frequencies, and Recommendation Systems (RS). The RS and tries with tries with frequencies can be used as classifiers. |
✓ |
★★★ (2.5) |
Algorithm::XGBoost,
ML::* packages
|
| Data visualization facilitation |
Insightful plots over data are used in Data Science most of the time. |
A few small packages for plotting, at least one connecting external systems (like GnuPlot), none of them that useful for Data Science. |
There are two "solid" packages Data Science visualizations, JavaScript::D3, JavaScript::Google::Charts; there is also an ASCI-plots package Text::Plot which is useful when basic, coarse plots are sufficient. |
✓ |
★★★★ |
JavaScript::D3,
JavaScript::Google::Charts,
Text::Plot,
The Raku-ju hijack hack for D3.js [video],
Geographics data in Raku demo [video]
|
| Interactive computing environment(s) |
Any data exploration is done in interactive manner with multiple changes of the data, and analysis or pattern finding workflows. |
The (basic) Raku REPL, related Emacs major-mode, and the notebook environment Jupyter::Kernel. |
In addition to pre-2021 work there are RakuMode for Wolfram Notebooks, Jupyter::Chatbook for seamless integration with LLMs. |
✓ |
★★★★ |
Connecting Mathematica and Raku [post],
Exploratory Data Analysis with Raku [video]
|
| Literate programming (LT) |
LT is very important for Data Science (DS) because of the DS needs for Reproducible Research. |
None, except Jupyter::Kernel, but that not useful because of the lack of good graphics. |
LT is fully supported due to having multiple LT solutions, strong graphics capabilities, LLM integration, and computational documents converters. |
✓ |
★★★★★ (4.5) |
Notebook transformations [post],
Raku Literate Programming via command line pipelines [video],
Conversion and evaluation of Raku files [video]
|
| Data generation and retrieval |
For didactical and development purposes random data generation and retrieval of well known dataset is needed. |
Nothing more than the build in Raku random generators, (pick and roll.) |
Generators of random strings, words, pet names, date-times, distribution variates, tabular datasets. Popular datasets from the R-ecosystem can be downloaded and cached. |
𖧋 |
★★★★ (3.5) |
Data::Generators,
Data::ExampleDatasets,
Data::Geographics,
Geographics data in Raku demo [video]
|
| External Data Science (DS) and Machine Learning (ML) orchestration |
Effective way to do DS and ML and easily move the developed computations to other systems. Allows reuse and having confidence that the utilized DS or ML algorithms are properly implemented and fast. |
Various projects connecting to database systems (e.g. MySQL.) |
The project "Dan" provides bindings to the data wrangling library Polars. The project "H2O::Client" aims at providing both data wrangling and ML orchestrations to H2O.ai. |
𖧋 |
★★★ (2.5) |
Dan,
Proc::ZMQed,
WWW::WolframAlpha,
H2O::Client
|
| Interactive interfaces to parameterized workflows (dashboards) |
Very useful for getting data insights by dynamically changing different statistics based on parameters. |
None. |
An effort, Air::Examples, that brings interactivity via HTMX is using the Cro package set and templates; since Google Charts provides interactivity JavaScript::Google::Charts can be extended to have those kind of controls and dashboards. |
𖧋 |
★ |
Air::Examples,
JavaScript::Google::Charts
|