Part 2. Twitter graph analysis with Gremlin

Mathieu Guglielmino
6 min readMay 3, 2018

In the previous article I was explaining how we could build a graph database of the Twitter stream using twitter4j and Apache TinkerPop on HGraphDB. If you are not familiar with these concepts I suggest your first read this piece.

Today, I will show how we can compute a few statistics on the graph using Gremlin, the traversal language of Apache TinkerPop, and Gelly, the graph processing engine of Apache Flink in a java application.

Prerequisite: Java, the gremlin console.

Part 1. What is gremlin and how to use it?

On the website, we can read “Apache TinkerPop™ is a graph computing framework for both graph databases (OLTP) and graph analytic systems (OLAP)”. It is widely used and numerous graph databases are built on it (Titan, JanusGraph, HGraphDB, …). Gremlin is a traversal language, meaning it will “walk” along the graph given certain instructions.

Since we are doing experiments in this part, we will use the Gremlin console. To open the HGraphDB base, we first install mandatory plugins, and load the graph:

gremlin> :install org.apache.hbase hbase-client 1.4.1
gremlin> :install org.apache.hbase hbase-common 1.4.1
gremlin> :install org.apache.hadoop hadoop-common 2.5.1
gremlin> :install io.hgraphdb hgraphdb 1.0.5
gremlin> :plugin use io.hgraphdb
gremlin> graph = HBaseGraph.open("tweets_graph", "127.0.0.1", "/hbase")
gremlin> g = graph.traversal()

The last line instantiate a traversal on the graph, which you can picture as a walking man on nodes and edges.

Let’s remember the schema of our graph. If after streaming for a while I wanted to know how many types of vertices I have, I would do:

gremlin> g.V().groupCount().by(label)
==>[tweet:23912,source:285,user:16174,url:2819,hashtag:1882]
g.E().groupCount().by(label)
==>[HAS_LINK:6694,QUOTED_STATUS:4094,POSTED_VIA:22303,POSTED:22303,HAS_TAG:18483,MENTIONS:22799,IN_REPLY_TO:1315,RETWEETED_STATUS:15994]

The groupCount() / by() methods will group the Vertices() or Edges(), and count them. It is equivalent to:

gremlin> g.V().group().by(label).by(count())
==>[tweet:23912,source:285,user:16174,url:2819,hashtag:1882]

For example in this case, I have 23.912 tweets of which 15.995 are retweets and 22.799 mentions another user, posted by 16.174 unique users.

To compute the users with the highest number of followers, we would only look at ‘user’ nodes, which we order() / by() on the ‘followers_count’. At this point, we have a list of nodes, for which it would be nice to have a list the screen name and followers count. If we do a simple .values(), the results will by printed on a different line each, the local() function allows better printing:

gremlin> g.V().hasLabel('user').order().by('followers_count', decr).local(values('screen_name', 'followers_count').fold()).limit(10)
==>[BarackObama,102068344]
==>[ladygaga,78435859]
==>[TheEllenShow,77806502]
==>[YouTube,72084849]
==>[Twitter,62938592]
==>[cnnbrk,54916606]
==>[realDonaldTrump,51220952]
==>[Oprah,42487869]
==>[nytimes,41947275]
==>[CNN,40284454]

The frequencies of each hashtag if trickier: we have to count the in-degree of each hashtag vertex, since there exists an edge between a tweet and a hashtag if and only it this tweet mentioned this hashtag. The out / in function walk from vertice to vertice, along an edge, outV / inV / outE / inV walk in a vertex or an edge.

I may want to know how many times the #MacronTrump hashtag has been used:

gremlin> g.V().has(‘hashtag’, ‘tag’, ‘MacronTrump’).in(‘HAS_TAG’).count()
==>1048

The same on the entire graph:

gremlin> g.V().hasLabel('hashtag').project("h", "in_degree").by('tag').by(inE('HAS_TAG').count()).order().by(select("in_degree"), decr).limit(10)
==>[h:Syrie,in_degree:1813]
==>[h:Russie,in_degree:1315]
==>[h:MacronTrump,in_degree:1048]
==>[h:Macron,in_degree:863]
==>[h:OIAC,in_degree:718]
==>[h:Daesh,in_degree:569]
==>[h:Poutine,in_degree:525]
==>[h:Décla,in_degree:463]
==>[h:Trump,in_degree:379]
==>[h:ArabieSaoudite,in_degree:307]

project followed with by() saves into the given variables their result. Here, I store the tag in ‘h’ and in ‘in_degree’, I count the number of incoming vertices in decreasing order.

We don’t have straight away the number of original tweets posted by an user.

gremlin> g.V().has('user', 'screen_name', 'NewsPolitiques_').out().out().hasLabel('tweet').count()
==>93
gremlin> g.V().has('user', 'screen_name', 'NewsPolitiques_').out().outE().hasLabel('RETWEETED_STATUS').count()
==>70
gremlin> g.V().has('user', 'screen_name', 'NewsPolitiques_').out().outE().hasLabel('QUOTED_STATUS').count()
==>23
gremlin> g.V().has('user', 'screen_name', 'NewsPolitiques_').out().outE().hasLabel('IN_REPLY_TO').count()
==>0

The user @NewsPolitiques_ posted 93 tweets in the period, 70 of them are retweets and 23 are quoted statuses. This is not a very creative user, and certainly a content aggregator.

Last but not least, me way want to know how users interact between them, and how many time they retweet each other.

Part 2. Users interactions

In such a graph we are interested in interactions between users : who retweets who ? who mentions who ? who are the users retweeting the most another user ? All of this can be done with traversal queries.

Let’s consider the user PoutineFrance :

  • who does she retweet the most?
gremlin> g.V().has("user", "screen_name", "PoutineFrance").as("u1").out().out("RETWEETED_STATUS").in("POSTED").as("u2").select("u1", "u2").by("screen_name").groupCount().by(values).unfold().order().by(values, decr)==>[PoutineFrance, sputnik_fr]=25
==>[PoutineFrance, RTenfrancais]=20
==>[PoutineFrance, Barberis888]=15
  • who retweets her the most?
gremlin> g.V().has("user", "screen_name", "PoutineFrance").as("u1").out("POSTED").in("RETWEETED_STATUS").in("POSTED").as("u2").select("u1", "u2").by("screen_name").groupCount().by(values).unfold().order().by(values, decr)==>[PoutineFrance, fandetv]=3
==>[PoutineFrance, maralpoutine]=2
==>[PoutineFrance, FollowMeCalmly]=2
  • who mentions her the most?
gremlin> g.V().has("user", "screen_name", "PoutineFrance").in("MENTIONS").in("POSTED").groupCount().by("screen_name").unfold().order().by(values, decr)==>FollowMeCalmly=11
==>fllx89fllx=11
==>VotezPoisson=11
  • what are the most used hashtags?
gremlin> g.V().has("user", "screen_name", "PoutineFrance").out().out("HAS_TAG").groupCount().by("tag").unfold().order().by(values, decr)==>Syrie=14
==>Russie=5
==>OIAC=5

Part 2. From OLTP to OLAP

In graph processing, two families often pop up: OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing). The queries that we have seen so far are OLTP since their execution is almost instantaneous because they only explore a fraction of the graph.

OLAP systems require more complexity, such as a PageRank calculation. In our case, the graph is too complex for the PageRank algorithm to be performed on, hence we’ll need to simplify it. One way it to compute the retweet network, where each node is an user, and the directed edge between user1 and user2 is the number of times user1 retweeted user2.

We have seen previous how to compute the number of retweets between two users, we only have to adapt the query a bit to do it on the entire graph:

gremlin> g.V().hasLabel("user").as("u1").
out().out("RETWEETED_STATUS").
in("POSTED").as("u2").
select("u1", "u2").by("screen_name").
groupCount().by(values).
unfold().order().by(values, decr).limit(20)
==>[JanJansemn, IntropaJacques]=28
==>[NewsPolitiques_, IntropaJacques]=26
==>[mikaeloff2, mikaeloff2]=26
==>[languillem, sputnik_fr]=26

The last line if for printing, but the core is at the end of the groupCount since we retrieve the (key, value) map where the key is of type [user1, user2] and the value the number of directed retweets between user1 and user2.

What we want to do is to isolate these relationships in order to compute the PageRank and infer a hierarchy between the users. Let’s see how to do that from a Java application.

Part 3. TinkerPop in a java application

The graph of base is complex, and we may want to make subgraphs like the graph of retweets (the nodes are users, the directed edges are weighted with the number of retweets between users), the graph of co-hashtags (hashtags that occur altogether), …

When working with Gremlin in a java application there are a few differences to be aware of since the console usually simplifies the syntax. Let’s try to make the graph of co-hashtags for a practical application.

First, we want to compute the frequency of hashtags. For example, if we want to know the most occuring hashtags with “syria”, we would do:

gremlin> g.V().has("hashtag", "tag", "syria").as("h1").
in("HAS_TAG").out("HAS_TAG").where(neq("h1")).as("h2").
select("h1", "h2").by("tag").
groupCount().by(values)
==>[[syria,nordstream2]:1,[syria,sochi]:1,[syria,iranoutofsyria]:1,[syria,france]:3,[syria,frencharmy]:1,[syria,ukraine]:1,[syria,assad]:3]

We obtain a Map of type Object, Long. Let’s look at the code in Java:

TinkerGraph tg = TinkerGraph.open()r;
tg.io(IoCore.graphml()).readGraph(INPUT_FILE);
GraphTraversalSource g = tg.traversal();
// Map of co-occuring hashtags
Map<?, Long> co_hashtags = g.V().hasLabel("hashtag").as("h1").
in("HAS_TAG").out("HAS_TAG").where(P.neq("h1")).as("h2").
select("h1", "h2").by("tag").
groupCount().by(Column.values).next();

Note that to access neq and values you need to explicitely precise P.neq and Column.values, and next() at the end. From this map we’ll make another graph containing only the hashtags and their frequencies of co-occurences.

TinkerGraph hash_graph = TinkerGraph.open();
GraphTraversalSource hash_g = hash_graph.traversal();
for (Map.Entry<?, Long> entry : co_hashtags.entrySet()) {
ArrayList hashtags = (ArrayList) entry.getKey();
Long weight = entry.getValue();
// Corresponding hashtags
GraphTraversal<Vertex, Vertex> h1 = hash_g.V().has("hashtag", "tag", hashtags.get(0));
GraphTraversal<Vertex, Vertex> h2 = hash_g.V().has("hashtag", "tag", hashtags.get(1));
//If first user does not exist yet
if (!h1.hasNext()) {
hash_graph.addVertex(T.id, hashtags.get(0), T.label, "hashtag",
"tag", hashtags.get(0),
"n", g.V().has("hashtag", "tag", hashtags.get(0)).in("HAS_TAG").count().next());
}
if (!h2.hasNext()) {
hash_graph.addVertex(T.id, hashtags.get(1), T.label, "hashtag",
"tag", hashtags.get(1),
"n", g.V().has("hashtag", "tag", hashtags.get(1)).in("HAS_TAG").count().next());
}
hash_g.V(hashtags.get(0)).as("h1").
V(hashtags.get(1)).as("h2").
addE("CO_HASHTAG").property("weight", weight).
from("h1").to("h2").
iterate();
}

In this code, we iterate over the map and check if the hashtags already exist with a GraphTraversal. If they don’t we add them with addVertex.

This is what we get from my French graph, mainly about politics:

Hashtags with a frequency>10 and co-occuring frequencies>5. Size of nodes proportional to their PageRank

--

--