+ - 0:00:00
Notes for current slide

Prepare

Abstract

Java offers a wide set of data structure implementations ready for developers. Collections are a great and powerful example.

These standard data structures are limited by borders of a single JVM. They depend on available memory within one server. They don't scale during high loads.

In-memory data grids (IMDG) may help to solve this problem. They offer distributed versions of Java data structures. Data is spread across multiple servers. Data grids provide failover features and prevent data loss when a server crashes. And you can simply scale them up and down. Let's go through the most popular Java native IMDG implementations and compare distributed data structures provided.

Notes for next slide
  • Hazelcast - a company developing open-source in-memory data grid and also a stream processing engine as a second project
  • 4 children
  • enthusiastic runner / finished several ultramarathons
  • when time and family allow, maintaining several OpenSource projects

Standard Java data structures distributed

`Josef Cacek

Prepare

Abstract

Java offers a wide set of data structure implementations ready for developers. Collections are a great and powerful example.

These standard data structures are limited by borders of a single JVM. They depend on available memory within one server. They don't scale during high loads.

In-memory data grids (IMDG) may help to solve this problem. They offer distributed versions of Java data structures. Data is spread across multiple servers. Data grids provide failover features and prevent data loss when a server crashes. And you can simply scale them up and down. Let's go through the most popular Java native IMDG implementations and compare distributed data structures provided.

Who is Josef Cacek

  • Security Engineer / Java Developer at Hazelcast
  • Father
  • Runner
  • Open-source contributor
  • Hazelcast - a company developing open-source in-memory data grid and also a stream processing engine as a second project
  • 4 children
  • enthusiastic runner / finished several ultramarathons
  • when time and family allow, maintaining several OpenSource projects

Agenda

  • About caching
  • Why distributed
  • In-memory data grids
  • Standard data structures distributed
  • why you should consider using a distributed data-grid

Have you ever used the HashMap for caching?

Hands up, who never used the HashMap for caching?

Me too. It's OK when you know about its limitations and you accept them

Story about the caching

  • web application (CRUD)
  • single SQL database for all systems in the company
  • users want expensive reports to be generated
  • I was working on web applications - like registers:
    • maintaining list of objects
    • querying them and making reports from them
  • The Company had a central Oracle database
  • some SQL queries in the application were really complex
  • so we cached results of queries which were the most time consuming

Story about the caching

  • web application (CRUD)
  • single SQL database for all systems in the company
  • users want expensive reports to be generated

I can cache myself ... there is the HashMap

  • I was working on web applications - like registers:
    • maintaining list of objects
    • querying them and making reports from them
  • The Company had a central Oracle database
  • some SQL queries in the application were really complex
  • so we cached results of queries which were the most time consuming

Story about the caching

  • web application (CRUD)
  • single SQL database for all systems in the company
  • users want expensive reports to be generated

I can cache myself ... there is the HashMap

Map<Long, User> cache = new HashMap<>();
public User getUserById(Long id) {
User user = cache.get(id);
if (user == null) {
user = dbTool.loadUser(id);
cache.put(id, user);
}
return user;
}
  • I was working on web applications - like registers:
    • maintaining list of objects
    • querying them and making reports from them
  • The Company had a central Oracle database
  • some SQL queries in the application were really complex
  • so we cached results of queries which were the most time consuming

This works!

This works!

Until it doesn't.

What can go wrong?

What can go wrong?

  • thread safety - concurrent access to the HashMap
  • caching with no boundaries - no cleanup
  • cache may contain stale entries - no expiration mechanism
  • solution doesn't scale
  • Collections.synchronizedMap()
    • From Java 5 you can use the ConcurrentHashMap
  • solution limited by a memory of one running JVM

What can go wrong?

  • thread safety - concurrent access to the HashMap
  • caching with no boundaries - no cleanup
  • cache may contain stale entries - no expiration mechanism
  • solution doesn't scale

How can we solve it "at home"?

  • better synchronization - use a ConcurrentMap
  • implement size limit and check it during inserts
  • run a cleanup job periodically
  • Collections.synchronizedMap()
    • From Java 5 you can use the ConcurrentHashMap
  • solution limited by a memory of one running JVM
  • complexity growing:
    • tracking age of the entries
    • scheduling extra jobs

What can go wrong?

  • thread safety - concurrent access to the HashMap
  • caching with no boundaries - no cleanup
  • cache may contain stale entries - no expiration mechanism
  • solution doesn't scale

How can we solve it "at home"?

  • better synchronization - use a ConcurrentMap
  • implement size limit and check it during inserts
  • run a cleanup job periodically
  • scaling is hard!
  • Collections.synchronizedMap()
    • From Java 5 you can use the ConcurrentHashMap
  • solution limited by a memory of one running JVM
  • complexity growing:
    • tracking age of the entries
    • scheduling extra jobs
  • what we throw away if we reach the limit?

Use in-memory data grid, Luke!

 

 

It scales!

What is the In-Memory Data Grid then?

 

Distributed system,
which holds data structures in RAM
among multiple servers.

Distributed partitioned hash map with every cluster node owning a portion of the overall data -- Ignite

Why distributed cache?

There are 3 main reasons ....

Scalability

  • Scalability - just start another server to increase the available memory;

Robustness

  • Robustness - in default configuration every stored piece of data has its primary location within the cluster and a backup on another server

Performance

  • Performance - data lives in memory, usually in the same data center (so network hops are cheap enough);

The scale of latencies:

Why IMDG:

  • local cache, clustered cache, remote cache
  • clustering for your application
  • geographical backup

  • Massive heap

  • High availability
  • Scalability
  • Data distribution
  • Clients for popular programming languages
  • Distributed computing

Bunch of Java friendly IMDGs is out there

  • Apache Ignite
  • Hazelcast IMDG
  • Infinispan
  • ...
  • I've picked this three free ones, because they are popular enough and I have a personal biases
    • I like Apache Foundation and projets they are covering
    • I like Hazelcast the most of course - we are the fastest :)
    • I worked for Red Hat before and I used Infinispan (within WildFly AS) even before knowing anything about the Hazelcast
  • sorted alphabetically here and not by my preference

Deployment types

What are the Topologies / Deployment types of IMDGs?

2 basic deployment types

  • Embedded: Great for microservices and ops simplification
  • Client-server:
    • Great for scale-up / scale-out deployments
    • cluster lifecycle decoupled from app servers
    • Can be used for non-Java clients.

Cache modes

we can also group the usage by locations where we actually store the data

Data partitioning / cache modes

  • Distributed
    • not so fast access as in replicated scenarios
    • if a member fails data has to be rebalanced
    • relatively fast write access
    • scales well - it's not so important if you start 5 or 50 members, they know where the data belongs
  • Replicated
    • faster read access
    • zero cost of failing member
    • expensive write access
    • doesn't scale - every change has to propagate to all members

Hybrid topology

  • you can also have cluster members which don't hold any data
    • lite members in Hazelcast terminology
    • client nodes in Ignite terminology

Java data structures distributed

java.util.Map

  • object that maps keys to values
  • can't contain duplicate keys

Example

Map<String, Integer> cityInhabitants = new ConcurrentHashMap<>();
cityInhabitants.put("Istanbul", 15_067_724);
cityInhabitants.put("London", 9_126_366);
cityInhabitants.put("Prague", 1_308_632);
//...
System.out.println("London population: " + cityInhabitants.get("London"));

Map in Embedded Infinispan

DefaultCacheManager manager = new DefaultCacheManager(
GlobalConfigurationBuilder.defaultClusteredBuilder().build());
Configuration configuration = new ConfigurationBuilder()
.clustering()
.cacheMode(CacheMode.DIST_SYNC)
.build();
manager.defineConfiguration("cityInhabitants", configuration);
Map<String, Integer> cityInhabitants = manager.getCache("cityInhabitants");
cityInhabitants.put("Istanbul", 15_067_724);
cityInhabitants.put("London", 9_126_366);
cityInhabitants.put("Prague", 1_308_632);
//...
System.out.println("London population: " + cityInhabitants.get("London"));

Enterprise-like configuration, but I still remember configuring EJBs on JBoss AS in version 3, so this one is easy-peasy

A CacheManager is the primary mechanism for retrieving a Cache instance and is often used as a starting point to using the Cache.

CacheManagers are heavyweight objects, and we foresee no more than one CacheManager being used per JVM (unless specific configuration requirements require more than one; but either way, this would be a minimal and finite number of instances).

Eviction and Expiration

  • supported, but vendor-specific

Map in Embedded Hazelcast IMDG

HazelcastInstance hz = Hazelcast.newHazelcastInstance();
Map<String, Integer> cityInhabitants = hz.getMap("cityInhabitants");
cityInhabitants.put("Istanbul", 15_067_724);
cityInhabitants.put("London", 9_126_366);
cityInhabitants.put("Prague", 1_308_632);
//...
System.out.println("London population: " + cityInhabitants.get("London"));

The configuration is more straight forward here.

Each vendor provides its specific extension of the Map API. Check the specific return types of methods used to retrieve the Map

Cache in Embedded Apache Ignite

Ignite ignite = Ignition.start();
IgniteCache<String, Integer> cityInhabitants = ignite.getOrCreateCache("cityInhabitants");
cityInhabitants.put("Istanbul", 15_067_724);
cityInhabitants.put("London", 9_126_366);
cityInhabitants.put("Prague", 1_308_632);
//...
System.out.println("London population: " + cityInhabitants.get("London"));

Maybe you've realized here, I don't use the java.util.Map as a cache type here.

Cache in Embedded Apache Ignite

Ignite ignite = Ignition.start();
IgniteCache<String, Integer> cityInhabitants = ignite.getOrCreateCache("cityInhabitants");
cityInhabitants.put("Istanbul", 15_067_724);
cityInhabitants.put("London", 9_126_366);
cityInhabitants.put("Prague", 1_308_632);
//...
System.out.println("London population: " + cityInhabitants.get("London"));
  • The IgniteCache doesn't implement java.util.Map!

Maybe you've realized here, I don't use the java.util.Map as a cache type here.

It doesn't implement Map, it implements JCache

Cache in Embedded Apache Ignite

Ignite ignite = Ignition.start();
IgniteCache<String, Integer> cityInhabitants = ignite.getOrCreateCache("cityInhabitants");
cityInhabitants.put("Istanbul", 15_067_724);
cityInhabitants.put("London", 9_126_366);
cityInhabitants.put("Prague", 1_308_632);
//...
System.out.println("London population: " + cityInhabitants.get("London"));
  • The IgniteCache doesn't implement java.util.Map!
  • The IgniteCache implements javax.cache.Cache!

Maybe you've realized here, I don't use the java.util.Map as a cache type here.

It doesn't implement Map, it implements JCache

JCache specification (JSR-107)

  • Cache is a Map-like structure
  • standard API - independent on vendor implementations

JCache uses the top-level package name of javax.cache, and defines the following five core interfaces:

  • Cache. This defines access to the actual cache, which is a map-like data structure that stores key-value pairs.
  • Entry. This defines access to the key-value pairs stored in the cache.
  • CacheManager. This defines how caches are managed.
  • CachingProvider. This defines how CachingManagers are managed.
  • ExpiryPolicy. This defines how expiration is handled for entries.

JCache specification (JSR-107)

  • Cache is a Map-like structure
  • standard API - independent on vendor implementations

Maven dependency

<dependency>
<groupId>javax.cache</groupId>
<artifactId>cache-api</artifactId>
<version>${version.jcache}</version>
</dependency>

JCache uses the top-level package name of javax.cache, and defines the following five core interfaces:

  • Cache. This defines access to the actual cache, which is a map-like data structure that stores key-value pairs.
  • Entry. This defines access to the key-value pairs stored in the cache.
  • CacheManager. This defines how caches are managed.
  • CachingProvider. This defines how CachingManagers are managed.
  • ExpiryPolicy. This defines how expiration is handled for entries.

!=

There are two defined mechanisms in which an entry can be stored in a cache. The default mechanism is called store-by-value, in which the key-value pairs are stored in the cache, and new copies of the entries are made and returned when accessed from the cache. The other (optional) mechanism is store-by-reference, in which the cache stores and returns reference to application-provided key-value pairs. This lets updates to the application-provided key-value pairs to be seen in subsequent accesses without having to update the cache entries themselves.

More data structures

Similar to creating or getting a distributed Map there are other standard data structures available in the distributed world.

In this table the colors have following meaning

  • green - implements the standard API
  • yellow - doesn't implement or extend the Java API, but it provides similar implementation
  • red - the data structure is not implemented by the IMDG
  • java.util.Set - A collection that contains no duplicate elements.
  • java.util.List - Ordered collection - a.k.a. sequence.
  • java.util.concurrent.locks.Lock - provides more fine-grained locking mechanisms than synchronized methods and statements
  • java.util.concurrent.Semaphore - contains a set of permits; can be used to limit access to some resource just for given number of threads
  • java.util.concurrent.CountDownLatch - allows to wait until a set of operations being performed in other threads completes.
  • java.util.concurrent.BlockingQueue - data structure which fits to producer-consumer scenarios; if bounded then both sides may block, otherwise only consumers
  • java.util.concurrent.AtomicLong - Long implementation with atomic updates

And there is more ...

  • Task execution
  • Messaging
  • Transactions
  • Stream and Batch processing

Demo time

  • All 3 IMDGs offer Docker images, let's use them
docker run -it --rm apacheignite/ignite:2.7.6
docker run -it --rm hazelcast/hazelcast:3.12.4
# docker run -it --rm jboss/infinispan-server:10.0.1.Final
# Can you find a difference? (wink, wink)
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
apacheignite/ignite 2.7.6 ce9ff5b69430 2 months ago 527MB
hazelcast/hazelcast 3.12.4 32c507fed571 5 weeks ago 116MB
jboss/infinispan-server 10.0.1.Final 596b3626a09b 5 weeks ago 366MB

Summary

Thank you

github.com/kwart
twitter.com/jckwart
javlog.cacek.cz

Who is Josef Cacek

  • Security Engineer / Java Developer at Hazelcast
  • Father
  • Runner
  • Open-source contributor
  • Hazelcast - a company developing open-source in-memory data grid and also a stream processing engine as a second project
  • 4 children
  • enthusiastic runner / finished several ultramarathons
  • when time and family allow, maintaining several OpenSource projects
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow