[Photo by jim gade on Unsplash,

modified]

Data Science: a branch of computer science that studies how to use, store, and analyze data in order to derive information from it.

With this mini-series we are going to explore how to use some Rusty tools to accomplish the tasks that are the bread and butter of any Data Scientist.

The final goal is to show that Rust can be employed in this field, and how so. Ultimately our goal is also to sparkle interest in this field of application: the author is persuaded that Rust should prove very useful in the field of Data Science (as well as Machine Learning and ultimately AI).

You can find this article's code in the repo: github.com/davidedelpapa/rdatascience-tut1

## Setting the stage for this tutorial

There are few crates we are going to cover in this tutorial. However, we are going to introduce them as we go.

Let's start our project the standard rusty way.

```
cargo new rdatascience-tut1 && cd rdatascience-tut1
cargo add ndarray ndarray-rand ndarray-stats noisy_float poloto
code .
```

I am using currently `cargo add`

from the good cargo-edit (quick inst: `cargo install cargo-edit`

) to handle dependencies, and VisualStudio Code as dev IDE.

Feel free to handle *Cargo.toml* dependencies by hand, or use a different IDE.

### ndarray: what is it, and why to use it?

ndarray is a Rust crate used to work with arrays.

It covers all the classic uses of an array handling framework (such as `numpy`

for Python). Some use cases which are not covered by the main crate, are covered through some corollary crates, such as ndarray-linalg for linear algebra, ndarray-rand to generate randomness, and ndarray-stats for statistics.

Additionally, `ndarray`

has got also some nice extra, such as support for rayon for parallelization, or the popular BLAS low-level specs, through one of the working back-ends (using blas-src ).

#### Why to use ndarray?

In Rust there are already arrays (or lists), and also vectors, and the language itself allows for many different types of manipulation through powerful iterators.

What is more, what is offered by the bare Rust language (enhanced by the `std`

) is many times even faster than other more popular languages; still, `ndarray`

is specialized to handle n-dimensional arrays with a mathematical end in view.

Thus `ndarray`

builds over the power already provided by the language; Rust power is one of the reasons why the author is persuaded that Rust will be the language of Data Science in the next few years.

## ndarray Quick-Start

At the top of our *src/main.rs* we are going to import as usual:

```
use ndarray::prelude::*;
```

We have almost everything we need in the prelude.

We can start to put stuff inside the `fn main()`

### Array creation

Let's start to see how we can create arrays:

```
let arr1 = array![1., 2., 3., 4., 5., 6.];
println!("1D array: {}", arr1);
```

`ndarray`

provides the `array!`

macro that detects which type of `ArrayBase`

is needed. In this case this is a 1-D, that is, a one dimensional array. Notice that the underlying `ArrayBase`

already implements a `std::fmt::Display`

function.

Compare it to the standard Rust array (let's call them *lists* in order not to confuse them with `ndarray`

's arrays) and Vec:

```
// 1D array VS 1D array VS 1D Vec
let arr1 = array![1., 2., 3., 4., 5., 6.];
println!("1D array: \t{}", arr1);
let ls1 = [1., 2., 3., 4., 5., 6.];
println!("1D list: \t{:?}", ls1);
let vec1 = vec![1., 2., 3., 4., 5., 6.];
println!("1D vector: \t{:?}", vec1);
```

And the result:

```
1D array: [1, 2, 3, 4, 5, 6]
1D list: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
1D vector: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
```

Notice too that `array!`

has written the floats as integers, since they are all `.0`

.

### Array Sum

Let's try to sum 2 arrays element by element:

```
let arr2 = array![1., 2.2, 3.3, 4., 5., 6.];
let arr3 = arr1 + arr2;
println!("1D array: \t{}", arr3);
```

Let see how it compares with standard arrays(lists) and vectors:

```
let arr2 = array![1., 2.2, 3.3, 4., 5., 6.];
let arr3 = arr1 + arr2;
println!("1D array: \t{}", arr3);
let ls2 = [1., 2.2, 3.3, 4., 5., 6.];
let mut ls3 = ls1.clone();
for i in 1..ls2.len(){
ls3[i] = ls1[i] + ls2[i];
}
println!("1D list: \t{:?}", ls3);
let vec2 = vec![1., 2.2, 3.3, 4., 5., 6.];
let vec3: Vec<f64> = vec1.iter().zip(vec2.iter()).map(|(&e1, &e2)| e1 + e2).collect();
println!("1D vec: \t{:?}", vec3);
```

The result is:

```
1D array: [2, 4.2, 6.3, 8, 10, 12]
1D list: [1.0, 4.2, 6.3, 8.0, 10.0, 12.0]
1D vec: [2.0, 4.2, 6.3, 8.0, 10.0, 12.0]
```

As you can see, with Rust standard tools it became more complicated very soon. To perform an element by element sum we need a `for`

or (only for Vec) we need to use iterators, which are powerful, but very complicated to use in such a day-to-day Data Science scenario.

## 2D arrays & more

let's just abandon quickly the examples using Rust's standard constructs, since as we have shown, they are more complex, and let us focus on `ndarray`

.

`ndarray`

offers various methods to create and instantiate (and use) 2D arrays.

Just look at this example:

```
let arr4 = array![[1., 2., 3.], [ 4., 5., 6.]];
let arr5 = Array::from_elem((2, 1), 1.);
let arr6 = arr4 + arr5;
println!("2D array:\n{}", arr6);
```

with its output:

```
2D array:
[[2, 3, 4],
[5, 6, 7]]
```

With the macro `array!`

we need to specify all elements, while with `Array::from_elem`

we need to offer a `Shape`

, in this case `(2,1)`

and an element to fill the array, in this case `1.0`

: it will fill for us the whole shape with the selected element.

```
let arr7 = Array::<f64, _>::zeros(arr6.raw_dim());
let arr8 = arr6 * arr7;
println!("\n{}", arr8);
```

Which outputs:

```
[[0, 0, 0],
[0, 0, 0]]
```

`Array::zeros(Shape)`

creates an array of `Shape`

filled with zero's.

Notice that sometimes the compiler cannot infer the type of zero to feed in (you almost forgot Rust has got a *nice* type system, didn't you?), so we help it with the annotation `Array::<f64, _>`

, which gives the type, letting the compiler infer the shape (`_`

).

The function `.raw_dim()`

, as you can imagine, gives the shape of the matrix.

Let's create an identity matrix now (a 2 dimensional array with all 0 but the diagonal)

```
let identity: &Array2<f64> = &Array::eye(3);
println!("\n{}", identity);
```

Which outputs:

```
[[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]
```

We helped the compiler providing the shape and type, but this time using a specialized form of `ArrayBase`

, that is, `Array2`

that represents 2-dimensional arrays. Notice that we created a reference so that we can re-use the variable without incurring in the ire of the borrow checker (yes, always working, did you forget that as well?)

Let's explore now the use of an identity matrix:

```
let arr9 = array![[1., 2., 3.], [ 4., 5., 6.], [7., 8., 9.]];
let arr10 = &arr9 * identity;
println!("\n{}", arr10);
```

Outputs:

```
[[1, 0, 0],
[0, 5, 0],
[0, 0, 9]]
```

From my math classes I remember something like that the identity matrix should give back the same matrix when multiplied...

Yes, of course, we are not doing *dot* multiplications! With normal multiplication it does not work.

In fact, when using matrices there is a element-wise multiplication, which is done by `arr9 * identity`

, but there's too a matrix multiplication, which is done by

```
let arr11 = arr9.dot(identity);
println!("\n{}", arr11);
```

which finally outputs:

```
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
```

Of course, `ndarray`

can handle also a 0-D array, with 0 meaning that it is just an element:

```
println!("\n{}", array![2.]);
println!("Dimensions: {}", array![2.].ndim());
```

which correctly outputs:

```
[2]
Dimensions: 1
```

Likewise, we could go to 3D or more

```
let arr12 = Array::<i8, _>::ones((2, 3, 2, 2));
println!("\nMULTIDIMENSIONAL\n{}", arr12);
```

Guessed its output?

```
MULTIDIMENSIONAL
[[[[1, 1],
[1, 1]],
[[1, 1],
[1, 1]],
[[1, 1],
[1, 1]]],
[[[1, 1],
[1, 1]],
[[1, 1],
[1, 1]],
[[1, 1],
[1, 1]]]]
```

It's a 2-elements 2 vectors, repeated 3 times, repeated 2 times; just go from right to left to unpack it from smaller to bigger (and vice-versa).

If it is still unclear, don't worry: we are here for the programming more than for the math/stats behind it.

## Let's add some randomness to the mess!

We also loaded `ndarray-rand`

into our *Cargo.toml*, which we briefly described earlier.

This package adds the power of the rand crate (which it re-exports as sub-module) to your *ndarray ecosystem*.

In order to see some examples, let's add the following in the `use`

section of our *src/main.rs*

```
use ndarray_rand::{RandomExt, SamplingStrategy};
use ndarray_rand::rand_distr::Uniform;
```

Then we can get an array of shape `(5, 2)`

, for example, filled with a uniform distribution between 1 and 10 (floats, though):

```
let arr13 = Array::random((2, 5), Uniform::new(0., 10.));
println!("{:5.2}", arr13);
```

Which results, for example, in:

```
[[ 2.04, 0.15, 6.66, 3.06, 0.91],
[ 8.18, 6.08, 6.99, 4.45, 5.27]]
```

Results should vary at each run, being the distribution (pseudo)random.

We can also *"pick"* data from an array (sampling) in the following way:

```
let arr14 = array![1., 2., 3., 4., 5., 6.];
let arr15 = arr14.sample_axis(Axis(0), 2, SamplingStrategy::WithoutReplacement);
println!("\nSampling from:\t{}\nTwo elements:\t{}", arr14, arr15);
```

Which may result in:

```
Sampling from: [1, 2, 3, 4, 5, 6]
Two elements: [4, 2]
```

Let me show another way of sampling, which involves the use of the `rand`

crate and the creation of an array from a vector:

We first need the following added to the `use`

section:

```
use ndarray_rand::rand as rand;
use rand::seq::IteratorRandom;
```

So we use the `rand`

crate as re-exported by `ndarray-rand`

.

Then we can do the following (example in the rand docs, adapted):

```
let mut rng = rand::thread_rng();
let faces = "ššššš š¢";
let arr16 = Array::from_shape_vec((2, 2), faces.chars().choose_multiple(&mut rng, 4)).unwrap();
println!("\nSampling from:\t{}", faces);
println!("Elements:\n{}", arr16);
```

We define the `thread_rng`

to be used first, then we set a string containing the emoji we want to select.

Then we create an array from a vector, giving a shape. The shape we chose is `(2, 2)`

, but the vector is created using a particular `IteratorRandom`

, i.e., `choose_multiple`

, extracting 4 elements (chars) at random from the string.

The output is obvious:

```
Sampling from: ššššš š¢
Elements:
[[š, š],
[š¢, š ]]
```

Beware though not to over-sample, otherwise `choose_multiple`

will simply panic.

Instead, `Array::from_shape_vec`

returns a `Result`

stating if it could create an array or not (Result which we simply unwrap).

## Let's do some stats and visualize something, shall we?

Before introducing visualization, let's introduce the crate ndarray-stats, actually, also the crate noisy_float which is a must when using `ndarray-stats`

.

First of all, we start with a Standard Normal Distribution, randomly created.

First we add:

```
use ndarray_rand::rand_distr::{Uniform, StandardNormal};
```

in its proper place, then:

```
let arr17 = Array::<f64, _>::random_using((10000,2), StandardNormal, &mut rand::thread_rng());
```

This way we have a 2D array with 10,000 couples of elements

Then we add to the `use`

section also the imports we need to do statistics:

```
use ndarray_stats::HistogramExt;
use ndarray_stats::histogram::{strategies::Sqrt, GridBuilder};
use noisy_float::types::{N64, n64};
```

Now we need to transform each element from float into a noisy float; I will not go into explaining a noisy float, just consider it as a float that can't silently fail (be a `NaN`

); besides this way it is order-able, which is what is needed by `ndarray-stats`

to create an histogram.

In order to perform by value an operation on each element of the ndarray, we will use the function `mapv()`

which is akin to the standard `map()`

for iterators.

```
let data = arr17.mapv(|e| n64(e));
```

At this point, we can create a grid for our histogram (a grid is needed to divide the data into bins); we try to infer the best way, using the `strategies::Sqrt`

(a strategy used by many programs, including MS Excel):

```
let grid = GridBuilder::<Sqrt<N64>>::from_array(&data).unwrap().build();
```

Now that we have a grid, that is, a way to divide our raw data to prepare our histogram, we can create such histogram:

```
let histogram = data.histogram(grid);
```

In order to get the underlying counts matrix, we can simply state:

```
let histogram_matrix = histogram.counts();
```

The count matrix just states how many elements are present in each bin and each height, in the grid.

Ok, now we have a histogram... but how could we visualize it?

Well, before visualizing our data we should prepare it for visualization.

The problem we face is that we have the counts of a grid, but to plot it we should really have a number of bin and all elements in that bin, meaning, we should sum vertically all elements.

In order to do so, we need to sum on axis(0) of the ndarray:

```
let data = histogram_matrix.sum_axis(Axis(0));
```

Now we have a 1D ndarray containing all the sums of the grid. At this point we can establish that each sum is a different bin, and enumerate them. We will transform it all to a vector of tuples, in order to prepare it for the visualization tool, where the first element of the tuple is the number of bin, and the second is the height of the bin.

```
let his_data: Vec<(f32, f32)> = data.iter().enumerate().map(|(e, i)| (e as f32, *i as f32) ).collect();
```

Remember: this is just a hoax dataset, based on a pseudorandom generator of a normal distribution (i.e., a Gaussian distribution centered in `0.0`

, with radius approx. `1`

). Still, we should see a rough Gaussian on a histogram.

### DataViz

In order to visualize things we will use poloto, which is one of many plotting crates for Rust.

It is a simple one, meaning we do not need many lines of code to have something to see on our screen.

We will not import it in the `use`

section, because it is very simple. Let me explain how to plot a histogram in three steps:

Step one - create a file to store our graph:

```
let file = std::fs::File::create("standard_normal_hist.svg").unwrap();
```

Step two - create a histogram out of the data:

```
let mut graph = poloto::plot("Histogram", "x", "y");
graph.histogram("Stand.Norm.Dist.", his_data).xmarker(0).ymarker(0);
```

We create a `Plotter`

object, assigning it a title, and legend for each axis.

Then, we plot our histogram on it, assigning the title in the legend (`"Stand.Norm.Dist."`

).

Step three - write the graph on disk:

```
graph.simple_theme(poloto::upgrade_write(file));
```

As simple as that!

Let's admire our work of (random) art:

OK, let's try something different: let's view our graph as a scatter plot. Since our hoax data is a Standard Normal Distribution, if we have N pairs of coordinates, the scatter plot should be like a cloud centered on the `0,0`

coordinates.

Let's visualize it!

```
let arr18 = Array::<f64, _>::random_using((300, 2), StandardNormal, &mut rand::thread_rng());
let data: Vec<(f64, f64)> = arr18.axis_iter(Axis(0)).map(|e| {
let v = e.to_vec();
(v[0], v[1])
}).collect();
```

We created 300 pairs of random numbered centered around `(0, 0)`

, according to a Standard Normal Distribution.

Then we transformed that array to a `Vec<(f64, f64)>`

, because the `poloto`

library only graphs `[f64; 2]`

or whatever can be converted to a `AsF64`

.

We will add also two lines to show the center of our graph:

```
let x_line = [[-3,0], [3,0]];
let y_line = [[0,-3], [0, 3]];
```

Next we create a file, plot, and save, just as we did for the histogram:

```
let file = std::fs::File::create("standard_normal_scatter.svg").unwrap(); // create file on disk
let mut graph = poloto::plot("Scatter Plot", "x", "y"); // create graph
graph.line("", &x_line);
graph.line("", &y_line);
graph.scatter("Stand.Norm.Dist.", data).ymarker(0);
graph.simple_theme(poloto::upgrade_write(file));
```

That's it! We can admire our random creation now:

## Conclusion

I think this should wrap it up for today.

We saw how to use `ndarray`

(in a basic form), and how it differs from Rust arrays and vectors.

We saw also some of its companion crates that complete the ecosystem, to provide randomness and some statistic feats.

We saw also a way to plot graphs with data, showing how to plot a histogram, a scatter plot, and some lines.

I hope this will be a good starting point to delve deeper into the use of Rust for Data Science.

That's all folks for today, see you next time!

## Discussion (5)

You had me worried for a minute. The article only uses 300 points, which produces a disappointing graph. The code in github uses 10,000 points, which is much more satisfying.

let arr18 = Array::::random_using((300, 2), StandardNormal, &mut rand::thread_rng());

šš good catch... in fact I just forgot to update the numbers, but the image refers to the GitHub repo

Small correction: There is a snippet that says

`use`

twice, i.e.Edited. Thank you!

Iām trying.

Thank ā©