Standard Java structures distributed

---
background-image: url(background_deck.png)
class: middle

# Standard Java data structures distributed

### Prepare
* browser windows:
  * latencies scaling - https://static.dzone.com/dz1/dz-files/DZone_PerformanceAndMonitoring_1_infographic.pdf
  * https://infinispan.org/docs/dev/titles/configuring/configuring.html#cache_mode_comparison-configuring
* terminal - 6x docker
* set-up pointing device
* set-up 2 monitors

### Abstract

Java offers a wide set of data structure implementations ready for developers. Collections are a great and powerful example.

These standard data structures are limited by borders of a single JVM. They depend on available memory within one server. They don't scale during high loads.

In-memory data grids (IMDG) may help to solve this problem. They offer distributed versions of Java data structures. Data is spread across multiple servers. Data grids provide failover features and prevent data loss when a server crashes. And you can simply scale them up and down. Let's go through the most popular Java native IMDG implementations and compare distributed data structures provided.

---

## Who is Josef Cacek

* Security Engineer / Java Developer at Hazelcast
* Father
* Runner
* Open-source contributor

???

* Hazelcast - a company developing open-source in-memory data grid and also a stream processing engine as a second project
* 4 children
* enthusiastic runner /  finished several ultramarathons
* when time and family allow, maintaining several OpenSource projects

---

## Agenda

* About caching
* Why distributed
* In-memory data grids
* Standard data structures distributed

???
* why you should consider using a distributed data-grid

---

## Have you ever used the `HashMap` for caching?

???

Hands up, who never used the `HashMap` for caching?

Me too. It's OK when you know about its limitations and you accept them

---

# Story about the caching

* web application (CRUD)
* single SQL database for all systems in the company
* users want expensive reports to be generated

???

* I was working on web applications - like registers:
  * maintaining list of objects
  * querying them and making reports from them
* The Company had a central Oracle database
* some SQL queries in the application were really complex
* so we cached results of queries which were the most time consuming

### I can cache myself ... there is the `HashMap`

```java
Map<Long, User> cache = new HashMap<>();

public User getUserById(Long id) {
  User user = cache.get(id);
  if (user == null) {
    user = dbTool.loadUser(id);
    cache.put(id, user);
  }
  return user;
}
```

---
class: middle

## This works!

# Until it doesn't.

---

## What can go wrong?

* thread safety - concurrent access to the `HashMap`
* caching with no boundaries - no cleanup
* cache may contain stale entries - no expiration mechanism
* solution doesn't scale

???

* `Collections.synchronizedMap()`
    * From Java 5 you can use the `ConcurrentHashMap`
* solution limited by a memory of one running JVM

### How can we solve it "at home"?

* better synchronization - use a `ConcurrentMap`
* implement size limit and check it during inserts
* run a cleanup job periodically

???
* complexity growing:
  * tracking age of the entries
  * scheduling extra jobs

--
* **scaling is hard!**

???

* what we throw away if we reach the limit?

---
class: center, middle
background-image: url(star-wars-sword.png)

# Use .red[in-memory data grid], Luke!

#  
#

# It scales!

---

## What is the In-Memory Data Grid then?

.center.font150[
#

*Distributed system,*  
*which holds data structures in RAM*  
*among multiple servers.*
]

???
.center[*Distributed partitioned hash map with every cluster node owning a portion of the overall data*]
.right[-- Ignite]

---
class: middle

## Why distributed cache?

???
There are 3 main reasons ....

---
background-image: url(background_deck.png)
class: middle

#  Scalability

???
* Scalability - just start another server to increase the available memory;

---
background-image: url(background_deck.png)
class: middle

#  Robustness

???
* Robustness - in default configuration every stored piece of data has its primary location within the cluster and a backup on another server

---
background-image: url(background_deck.png)
class: middle

#  Performance

???
* Performance - data lives in memory, usually in the same data center (so network hops are cheap enough);

The scale of latencies:
* https://static.dzone.com/dz1/dz-files/DZone_PerformanceAndMonitoring_1_infographic.pdf
* https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Why IMDG:
* local cache, clustered cache, remote cache
* clustering for your application
* geographical backup

* Massive heap
* High availability
* Scalability
* Data distribution
* Clients for popular programming languages
* Distributed computing

---
background-image: url(background_imdgs.png)

## Bunch of Java friendly IMDGs is out there

* Apache Ignite
* Hazelcast IMDG
* Infinispan
* ...

???

* I've picked this three free ones, because they are popular enough and I have a personal biases
  * I like Apache Foundation and projets they are covering
  * I like Hazelcast the most of course - we are the fastest :)
  * I worked for Red Hat before and I used Infinispan (within WildFly AS) even before knowing anything about the Hazelcast
* sorted alphabetically here and not by my preference

---
class: middle

## Deployment types

???
What are the **Topologies / Deployment types** of IMDGs?

---

???

2 basic deployment types

* Embedded: Great for microservices and ops simplification
* Client-server: 
  * Great for scale-up / scale-out deployments 
  * cluster lifecycle decoupled from app servers
  * Can be used for non-Java clients.

---
class: middle

## Cache modes

???

we can also group the usage by locations where we actually store the data

---

???
## Data partitioning / cache modes

* Distributed
  * not so fast access as in replicated scenarios
  * if a member fails data has to be rebalanced
  * relatively fast write access
  * **scales well** - it's not so important if you start 5 or 50 members, they know where the data belongs
* Replicated
  * faster read access
  * zero cost of failing member
  * expensive write access
  * **doesn't scale** - every change has to propagate to all members

---

## Hybrid topology

???

* you can also have cluster members which don't hold any data
  * lite members in Hazelcast terminology
  * client nodes in Ignite terminology

---
class: middle

# Java data structures distributed

---

## java.util.Map<K, V>

* object that maps keys to values
* can't contain duplicate keys

### Example

```java
Map<String, Integer> cityInhabitants = new ConcurrentHashMap<>();

cityInhabitants.put("Istanbul", 15_067_724);
cityInhabitants.put("London", 9_126_366);
cityInhabitants.put("Prague", 1_308_632);

//...

System.out.println("London population: " + cityInhabitants.get("London"));

```

---

## Map in Embedded Infinispan

```java
DefaultCacheManager manager = new DefaultCacheManager(
        GlobalConfigurationBuilder.defaultClusteredBuilder().build());
Configuration configuration = new ConfigurationBuilder()
    .clustering()
    .cacheMode(CacheMode.DIST_SYNC)
    .build();
manager.defineConfiguration("cityInhabitants", configuration);

Map<String, Integer> cityInhabitants = manager.getCache("cityInhabitants");

cityInhabitants.put("Istanbul", 15_067_724);
cityInhabitants.put("London", 9_126_366);
cityInhabitants.put("Prague", 1_308_632);

//...

System.out.println("London population: " + cityInhabitants.get("London"));
```

???
*Enterprise-like configuration, but I still remember configuring EJBs on JBoss AS in version 3, so this one is easy-peasy*

A CacheManager is the primary mechanism for retrieving a Cache instance and is often used as a starting point to using the Cache.

CacheManagers are heavyweight objects, and we foresee no more than one CacheManager being used per JVM (unless specific configuration requirements require more than one; but either way, this would be a minimal and finite number of instances).

# Eviction and Expiration

* supported, but vendor-specific

---

## Map in Embedded Hazelcast IMDG

```java
HazelcastInstance hz = Hazelcast.newHazelcastInstance();

Map<String, Integer> cityInhabitants = hz.getMap("cityInhabitants");

cityInhabitants.put("Istanbul", 15_067_724);
cityInhabitants.put("London", 9_126_366);
cityInhabitants.put("Prague", 1_308_632);

//...

System.out.println("London population: " + cityInhabitants.get("London"));
```

???

The configuration is more straight forward here.

Each vendor provides its specific extension of the Map API.
Check the specific return types of methods used to retrieve the Map

---

## Cache in Embedded Apache Ignite

```java
Ignite ignite = Ignition.start();

IgniteCache<String, Integer> cityInhabitants = ignite.getOrCreateCache("cityInhabitants");

cityInhabitants.put("Istanbul", 15_067_724);
cityInhabitants.put("London", 9_126_366);
cityInhabitants.put("Prague", 1_308_632);

//...

System.out.println("London population: " + cityInhabitants.get("London"));
```

???
Maybe you've realized here, I don't use the java.util.Map as a cache type here.

--
* The `IgniteCache` .red[doesn't implement] `java.util.Map`!

???
### It doesn't implement Map, it implements JCache

--
* The `IgniteCache` .red[implements] `javax.cache.Cache`!

---

# JCache specification (JSR-107)

* Cache is a Map-like structure
* standard API - independent on vendor implementations

???

JCache uses the top-level package name of javax.cache, and defines the following five core interfaces:

* Cache. This defines access to the actual cache, which is a map-like data structure that stores key-value pairs.
* Entry. This defines access to the key-value pairs stored in the cache.
* CacheManager. This defines how caches are managed.
* CachingProvider. This defines how CachingManagers are managed.
* ExpiryPolicy. This defines how expiration is handled for entries.

### Maven dependency

```xml
<dependency>
    <groupId>javax.cache</groupId>
    <artifactId>cache-api</artifactId>
    <version>${version.jcache}</version>
</dependency>
```

---

???
There are two defined mechanisms in which an entry can be stored in a cache. The default mechanism is called store-by-value, in which the key-value pairs are stored in the cache, and new copies of the entries are made and returned when accessed from the cache. The other (optional) mechanism is store-by-reference, in which the cache stores and returns reference to application-provided key-value pairs. This lets updates to the application-provided key-value pairs to be seen in subsequent accesses without having to update the cache entries themselves.

---
background-image: url(summary_table.png)

## More data structures

???
Similar to creating or getting a distributed Map there are other standard data structures available in the distributed world.

In this table the colors have following meaning
* green - implements the standard API
* yellow - doesn't implement or extend the Java API, but it provides similar implementation
* red - the data structure is not implemented by the IMDG

* java.util.Set<V> - *A collection that contains no duplicate elements.*
* java.util.List<V> - *Ordered collection - a.k.a. sequence.*
* java.util.concurrent.locks.Lock - provides more fine-grained locking mechanisms than synchronized methods and statements
* java.util.concurrent.Semaphore - contains a set of permits; can be used to limit access to some resource just for given number of threads
* java.util.concurrent.CountDownLatch - allows to wait until a set of operations being performed in other threads completes.
* java.util.concurrent.BlockingQueue - data structure which fits to producer-consumer scenarios; if bounded then both sides may block, otherwise only consumers
* java.util.concurrent.AtomicLong - Long implementation with atomic updates

---

## And there is more ...

* Task execution
* Messaging
* Transactions
* Stream and Batch processing

---

# Demo time

* All 3 IMDGs offer Docker images, let's use them

```bash
docker run -it --rm apacheignite/ignite:2.7.6
docker run -it --rm hazelcast/hazelcast:3.12.4
# docker run -it --rm jboss/infinispan-server:10.0.1.Final

# Can you find a difference? (wink, wink)
docker images

REPOSITORY                TAG                 IMAGE ID            CREATED             SIZE
apacheignite/ignite       2.7.6               ce9ff5b69430        2 months ago        527MB
hazelcast/hazelcast       3.12.4              32c507fed571        5 weeks ago         116MB
jboss/infinispan-server   10.0.1.Final        596b3626a09b        5 weeks ago         366MB
```

---
background-image: url(summary_table.png)

## Summary

---
class: center, middle

# .font200[Thank you]

.grey.font150.center[
`github.com/kwart`  
`twitter.com/jckwart`  
`javlog.cacek.cz`
]