Link weights recovery in heterogeneous information networks

Hông-Lan Botterman, Robin Lamarche-Perrin

In Computational Social Network, 8 (15), 2021

Socio-technical systems usually consists of many intertwined networks, each connecting different types of objects (or actors) through a variety of means. As these networks are co-dependent, one can take advantage of this entangled structure to study interaction patterns in a particular network from the information provided by other related networks. A method is hence proposed and tested to recover the weights of missing or unobserved links in heterogeneous information networks (HIN) – abstract representations of systems composed of multiple types of entities and their relations. Given a pair of nodes in a HIN, this work aims at recovering the exact weight of the incident link to these two nodes, knowing some other links present in the HIN. To do so, probability distributions resulting from path-constrained random walks i.e., random walks where the walker is forced to follow only a specific sequence of node types and edge types, capable to capture specific semantics and commonly called a meta-path, are combined in a linearly fashion in order to approximate the desired result. This method is general enough to compute the link weight between any types of nodes. Experiments on Twitter and bibliographic data show the applicability of the method.

Download

[Re] Speedup Graph Processing by Graph Ordering

Fabrice Lécuyer, Maximilien Danisch, Lionel Tabourier

In ReScience C 7, 1 (3), 2021

Cache systems keep data close to the processor to access it faster than main memory would. Graph algorithms benefit from this when a cache line contains highly related nodes. Hao Wei extitet al. propose to reorder the nodes of a graph to optimise the proximity of nodes on a cache line. Their contribution, Gorder, creates such an ordering with a greedy procedure. In this replication, we implement ten different orderings and measure the execution time of nine standard graph algorithms on nine real-world datasets. We monitor cache performances to show that runtime variations are caused by cache management. We confirm that Gorder leads to the fastest execution in most cases due to cache-miss reductions. Our results show that simpler procedures are yet almost as efficient and much quicker to compute. This replication validates the initial results but highlights that generating a complex ordering like Gorder is time-consuming.

Download

Ranking Online Social Users by their Influence

Anastasios Giovanidis, Bruno Baynat, Clémence Magnien, and Antoine Vendeville

IEEE Transactions on Networking, 2021

We introduce an original mathematical model to analyse the diffusion of posts within a generic online social platform. The main novelty is that each user is not simply considered as a node on the social graph, but is further equipped with his/her own Wall and Newsfeed, and has his/her own individual self-posting and re-posting activity. As a main result using our developed model, we derive in closed form the probabilities that posts originating from a given user are found on the Wall and Newsfeed of any other. These are the solution of a linear system of equations, which can be resolved iteratively. In fact, our model is very flexible with respect to the modelling assumptions. Using the probabilities derived from the solution, we define a new measure of per-user influence over the entire network, the Ψ-score, which combines the user position on the graph with user (re-)posting activity. In the homogeneous case where all users (re-)post with the same rate, it is shown that a variant of the Ψ-score is equal to PageRank. Furthermore, we compare the new model and its Ψ-score against the empirical influence measured from very large data traces (Twitter, Weibo). The results illustrate that these new tools can accurately rank influencers with asymmetric (re-)posting activity for such real world applications.

Temporal Connectivity and Path Computation for Stream Graph

Léo Rannou

EDITE de Paris, LIP6, Thalès SIX – ThereSIS

Keywords:stream graphs, temporal networks, time-varying graphs, dynamic graphs,dynamic networks, interactions, graphs, networks, connected components, temporalpaths, algorithms, link streamsFor a long time, structured data and temporal data have been analysed separately. Many real world complex networks have a temporal dimension, such as contacts between individuals or financial transactions. Graph theory provides a wide set of tools to model and analyze static connections between entities. Unfortunately, this approach does not take into account the temporal nature of interactions.  Stream graph theory is a formalism to model highly dynamic networks in which nodes and/or links arrive and/or leave over time.  The number of applications of stream graph theory has risen rapidly, along with the number of theoretical concepts and algorithms to compute them. Several theoretical concepts such as connected components and temporal paths in stream graphs were defined recently, but no algorithm was provided to compute them.  Moreover, the algorithmic complexities of these problems are unknown, as well as the insight they may shed on real-world stream graphs of interest. In this thesis, we present several solutions to compute notions of connectivity and path concepts in stream graphs. We also present alternative representations – data structures designed to facilitate specific computations – of stream graphs. We provide implementations and experimentally compare our methods in a wide range of practical cases. We show that these concepts indeed give much insight on features of large-scale datasets. Straph, a python library, was developed in order to have a reliable library for manipulating, analysing and visualising stream graphs, to design algorithms and models, and to rapidly evaluate them.

Download

Measuring Diversity in Heterogeneous Information Networks

Pedro Ramaciotti Morales , Robin Lamarche-Perrin, Raphaël Fournier-S’Niehotta, Rémy Poulain, Lionel Tabourier,  Fabien Tarissan

In Theoretical Computer Science, 859, pp 80-115, 2021 

Diversity is a concept relevant to numerous domains of research varying from ecology, to information theory, andto economics, to cite a few. It is a notion that is steadily gaining attention in the information retrieval, networkanalysis, and artificial neural networks communities. While the use of diversity measures in network-structured datacounts a growing number of applications, no clear and comprehensive description is available for the different waysin which diversities can be measured. In this article, we develop a formal framework for the application of a largefamily of diversity measures to heterogeneous information networks (HINs), a flexible, widely-used network dataformalism. This extends the application of diversity measures, from systems of classifications and apportionments,to more complex relations that can be better modeled by networks. In doing so, we not only provide an effectiveorganization of multiple practices from different domains, but also unearth new observables in systems modeled byheterogeneous information networks. We illustrate the pertinence of our approach by developing different applicationsrelated to various domains concerned by both diversity and networks. In particular, we illustrate the usefulness of thesenew proposed observables in the domains of recommender systems and social media studies, among other fields.

Download

Presentation by Esteban Bautista on Fractional PageRank

Esteban Bautista

Lundi 25 Janvier 2021, à 14h Présentiel: salle 25-26-105,  Revoir La Présentation:

Slides

Graph-Based Semi-Supervised Learning (G-SSL) techniques learn from both labelled and unlabelled data to build better classifiers. This classification paradigm has received considerable attention since modern applications allow to collect large amounts of unlabelled but structured data, naturally encoded by a graph, in a relatively easy and inexpensive manner, while annotated data is expensive to obtain.  Despite successes, the performance of G-SSL techniques can still be improved, particularly in challenging data settings or unbalanced scenarios. To address such limitations, I will present a novel G-SSL method based on the positive γ -th powers of the graph Laplacian, referred to as the Lγ-PageRank method. I will present a theoretical analysis of the new method, showing that (i) for γ < 1, it extends the standard PageRank algorithm to Lévy processes: where random walkers can now perform far-distant jumps in a single step; and (ii) for γ > 1, it classifies data on signed graphs: where nodes belonging to one same class are more likely to share positive edges while nodes from different classes are more likely to be connected with negative edges. I will also show the existence of an optimal Laplacian power maximizing performance, for which I will propose an algorithm for its automatic estimation. Lastly, I will show the significant classification improvements allowed by the proposed approach on several real-world datasets commonly used for classification.  

How to Teach the Undecidability of Malware Detection Problem and Halting Problem

Matthieu Journault, Pascal Lafourcade, Malika More, Rémy Poulain, Léo Robert

WISE13: The 13th World Conference on Information Security Education, 2020

Malware detection is a term that is often associated to Computer Science Security. The underlying main problem is called Virus detection and consists in answering the following question: Is there a program that can always decide if a program is a virus or not? On the other hand, the undecidability of some problems is an important notion in Computer Science : an undecidable problem is a problem for which no algorithm exists to solve it. We propose an activity that demonstrates that virus detection is an undecidable problem. Hence we prove that the answer to the above question is no. We follow the proof given by Cohen in his PhD in 1983. The proof is close to the proof given by Turing in 1936 of the undecidability of the Halting problem. We also give an activity to prove the undecidability of the Halting problem. These proofs allow us to introduce two important ways of proving theorems in Computer Science : proof by contradiction and proof by case disjunction. We propose a simple way to present these notions to students using a maze. Our activity is unplugged, i.e. we use only a paper based model of computer, and is designed for high-school students. This is the reason why we use Scratch to write our « programs ».

Download

GraKeL: A Graph Kernel Library in Python

Giannis Siglidis, Giannis Nikolentzos, Stratis Limnios, Christos Giatsidis, Konstantinos Skianis, Michalis Vazirgiannis

Journal of Machine Learning Research 21, 2020

The problem of accurately measuring the similarity between graphs is at the core of many applications in a variety of disciplines. Graph kernels have recently emerged as a promising approach to this problem. There are now many kernels, each focusing on different structural aspects of graphs. Here, we present GraKeL, a library that unifies several graph kernels into a common framework. The library is written in Python and adheres to the scikit-learn interface. It is simple to use and can be naturally combined with scikit-learn’s modules to build a complete machine learning pipeline for tasks such as graph classification and clustering. The code is BSD licensed and is available at: https://github.com/ysig/GraKeL

Download

LouvainNE: Hierarchical Louvain Method for High Quality and Scalable Network Embedding

Ayan Kumar Bhowmick, Koushik Meneni, Maximilien Danisch, Jean-Loup Guillaume and Bivas Mitra

In Proceedings of the 13th ACM International WSDM conference, 2020

Network embedding, that aims to learn low-dimensional vector representation of nodes such that the network structure is preserved, has gained significant research attention in recent years. However, most state-of-the-art network embedding methods are computationally expensive and hence unsuitable for representing nodes in billion-scale networks. In this paper, we present LouvainNE, a hierarchical clustering approach to network embedding. Precisely, we employ Louvain, an extremely fast and accurate community detection method, to build a hierarchy of successively smaller subgraphs. We obtain representations of individual nodes in the original graph at different levels of the hierarchy, then we aggregate these representations to learn the final embedding vectors. Our theoretical analysis shows that our proposed algorithm has quasi-linear run-time and memory complexity. Our extensive experimental evaluation, carried out on multiple real-world networks of different scales, demonstrates both (i) the scalability of our proposed approach that can handle graphs containing tens of billions of edges, as well as (ii) its effectiveness in performing downstream network mining tasks such as network reconstruction and node classification.

Download

KClist++: A Simple Algorithm for Finding k-Clique Densest Subgraphs in Large Graphs

Bintao Sun, Maximilien Danisch, T-H. Hubert Chan and Mauro Sozio

In Proceedings of the VLDB Endowment, 2020

The problem of finding densest subgraphs has received increasing attention in recent years finding applications in biology, finance, as well as social network analysis. The k-clique densest subgraph problem is a generalization of the densest subgraph problem, where the objective is to find a subgraph maximizing the ratio between the number of k-cliques in the subgraph and its number of nodes. It includes as a special case the problem of finding subgraphs with largest average number of triangles (k=3), which plays an important role in social network analysis. Moreover, algorithms that deal with larger values of k can effectively find quasi-cliques. The densest subgraph problem can be solved in polynomial time with algorithms based on maximum flow, linear programming or a recent approach based on convex optimization. In particular, the latter approach can scale to graphs containing tens of billions of edges. While finding a densest subgraph in large graphs is no longer a bottleneck, the k-clique densest subgraph remains challenging even when k=3. Our work aims at developing near-optimal and exact algorithms for the k-clique densest subgraph problem on large real-world graphs. We give a surprisingly simple procedure that can be employed to find the maximal k-clique densest subgraph in large-real world graphs. By leveraging appealing properties of existing results, we combine it with a recent approach for listing all k-cliques in a graph and a sampling scheme, obtaining the state-of-the-art approaches for the aforementioned problem. Our theoretical results are complemented with an extensive experimental evaluation showing the effectiveness of our approach in large real-world graphs.

Download

Strongly Connected Components in StreamGraphs: Computation and Experimentations

Léo Rannou, Clémence Magnien, and Matthieu Latapy

In Proceedings of the 9th International Conference on Complex Networks and their Applications, 2020

Stream graphs model highly dynamic networks in which nodes and/or links arrive and/or leave over time. Strongly connected components in stream graphs were defined recently, but no algorithm was provided to compute them. We present here several solutions with polynomial time and space complexities, each with its own strengths and weaknesses. We provide an implementation and experimentally compare the algorithms in a wide variety of practical cases. In addition, we propose an approximation scheme that significantly reduces computation costs, and gives even more insight on the dataset.

Download

Testing the Impact of Semantics and Structure on Recommendation Accuracy and Diversity

Pedro Ramaciotti Morales, Lionel Tabourier, Raphaël Fournier-S’niehotta

In Proceedings of the IEEE/ACM International Conference on Advances in Social Network Analysis and Mining (ASONAM), 2020 (virtual)

The Heterogeneous Information Network (HIN) formalism is very flexible and enables complex recommendations models. We evaluate the effect of different parts of a HIN on the accuracy and the diversity of recommendations, then investigate if these effects are only due to the semantic content encoded in the network. We use recently-proposed diversity measures which are based on the network structure and better suited to the HIN formalism. Finally, we randomly shuffle the edges of some parts of the HIN, to empty the network from its semantic content, while leaving its structure relatively unaffected. We show that the semantic content encoded in the network data has a limited importance for the performance of a recommender system and that structure is crucial.

Download

Finding Top-k Nodes for Temporal Closeness in Large Temporal Graphs

Pierluigi Crescenzi, Clémence Magnien and Andrea Marino

In Algorithms, 13 (9), 211, 2020

The harmonic closeness centrality measure associates, to each node of a graph, the average of the inverse of its distances from all the other nodes (by assuming that unreachable nodes are at infinite distance). This notion has been adapted to temporal graphs (that is, graphs in which edges can appear and disappear during time) and in this paper we address the question of finding the top-k nodes for this metric. Computing the temporal closeness for one node can be done in O(m) time, where m is the number of temporal edges. Therefore computing exactly the closeness for all nodes, in order to find the ones with top closeness, would require O(nm) time, where n is the number of nodes. This time complexity is intractable for large temporal graphs. Instead, we show how this measure can be efficiently approximated by using a “backward” temporal breadth-first search algorithm and a classical sampling technique. Our experimental results show that the approximation is excellent for nodes with high closeness, allowing us to detect them in practice in a fraction of the time needed for computing the exact closeness of all nodes. We validate our approach with an extensive set of experiments.

Download

Temporal Cliques Admit Sparse Spanners

Jason Schoeters

Mardi 10 Novembre 2020, à 11h en salle 25-26/105, Jussieu

Let G=(V,E) be an undirected graph on n vertices and λ:E →2^N a mapping that assigns to every edge a non-empty set of positive integer labels. These labels can be seen as discrete times when the edge is present. Such a labeled graph {\cal G}=(G,λ) is said to be temporally connected if a path exists with non-decreasing times from every vertex to every other vertex. In a seminal paper, Kempe, Kleinberg, and Kumar (STOC 2000) asked whether, given such a temporal graph, a sparse subset of edges can always be found whose labels suffice to preserve temporal connectivity—a temporal spanner. Axiotis and Fotakis (ICALP 2016) answered negatively by exhibiting a family of Θ(n^2)-dense temporal graphs which admit no temporal spanner of density o(n^2). The natural question is then whether sparse temporal spanners can always be found in some classes of dense graphs. In this paper, we answer this question affirmatively, by showing that if the underlying graph G is a complete graph, then one can always find temporal spanners of density O(n log n). The best known result for complete graphs so far was that spanners of density (n choose 2)−⌊n/4⌋=O(n^2) always exist. Our result is the first positive answer as to the existence of o(n^2) sparse spanners in adversarial instances of temporal graphs since the original question by Kempe et al., focusing here on complete graphs. The proofs are constructive and directly adaptable as an algorithm.

Modèles et algorithmes pour les graphes dynamiques

Mathilde Vernet

Jeudi 22 Octobre 2020 à 11h en salle 25-26/105, Jussieu

Slides

Les problèmes de graphes ont été largement étudiés dans le cas des graphes statiques. Cependant, ces graphes ne permettent pas de prendre en compte la dimension temporelle, qui est souvent une donnée importante pour les situations à modéliser. Les graphes dynamiques viennent combler ces lacunes en permettant de modéliser des évolutions dans le temps. On peut alors s’interroger sur ces mêmes problèmes de graphes dans un contexte dynamique. Cela passe d’abord par la définition du modèle de graphes dynamiques le plus approprié et la modélisation précise du problème sur ces graphes. Lorsque le problème ne peut pas être résolu efficacement en appliquant directement des méthodes connues sur les graphes statiques, il faut alors concevoir un algorithme de résolution spécifique aux graphes dynamiques et l’analyser théoriquement et expérimentalement. En suivant cette démarche, l’objectif de cette thèse est de s’interroger sur l’extension aux graphes dynamiques des problèmes bien connus sur les graphes statiques. Ce travail s’intéresse à plusieurs problèmes de graphes en contexte dynamique en se focalisant sur les aspects algorithmiques et en s’abstrayant des domaines d’applications. Nous nous intéressons d’abord aux problèmes de flot dans les graphes dynamiques et proposons en particulier pour le problème du flot de cout minimum un algorithme polynomial permettant de résoudre le problème de façon optimale pour un modèle de graphe dynamique spécifique. Des problèmes liés à la connexité des graphes dynamiques sont aussi étudiés. Les composantes connexes persistantes, extension des composantes connexes aux graphes dynamiques, traduisent la connexité d’un ensemble de nœuds pendant un certain nombre de pas de temps consécutifs. De façon analogue à la notion de maximalité des composantes connexes dans un graphe statique, une notion de dominance entre composantes connexes persistantes est définie. Un algorithme polynomial permettant d’identifier toutes les composantes connexes persistantes non dominées est proposé. Plusieurs extensions à la définition de composantes connexes persistantes sont étudiées. Nous proposons enfin des extensions possibles du problème de Steiner aux graphes dynamiques. Nous nous concentrons sur un cas particulier et montrons la NP-complétude de ce problème.  

Do you trade with your friends or become friends with your trading partners? A case study in the G1 cryptocurrency

Nicolas Gensollen, Matthieu Latapy

In Applied Network Science, 5 (25), 2020

We study the interplay between social ties and financial transactions made through a recent cryptocurrency called G1. It has the particularity of combining the usual transaction record with a reliable network of identified users. This gives the opportunity to observe exactly who sent money to whom over a social network. This social network is a key piece of this cryptocurrency, which therefore puts much effort in ensuring that nodes correspond to unique, well identified, real living human users, linked together only if they met at least once in real world. Using this data, we study how social ties impact the structure of transactions and conversely. We show that users make transactions almost exclusively with people they are connected with in the social network. Instead, they tend to build social connections with people they will never make transactions with.

Download

The networks underlying collaborative learning and solving

Marc Santolini

Lundi 12 Octobre 2020 à 11h en salle 25-26/105, Jussieu

Slides

In this talk, I will review some of our recent work in understanding collaborative learning and solving using network approaches on large empirical datasets. First, using fine-grained quantitative data from digital lab notebooks of more than 2,000 teams who participated to the science and engineering iGEM competition in the past 10 years, I will exhibit shared aspects of team work, team structure and team dynamics, as well as features underlying team performance and team improvement throughout participations. I will then introduce our ongoing ‘iGEM TIES’ project aimed at mapping high-resolution team interactions in the lab using a bluetooth-enabled smartphone app. I will contrast these results with behavior observed in large, distributed open-source communities from GitHub. Finally, I will introduce our recent work on collaborative learning using fine-grain social data from online forums and phone call records, and show how interaction data can help predict learning outcomes and identify peer influence in performance and engagement. 

Exploration of Interactions for Influence Modelling in Online Social Networks

Monika Rakoczy

Vendredi 31 novembre 2020, à 14h, en salle 26-00/332, Jussieu

Slides

Online social networks are constantly growing in popularity. They enable users to interact with one another and mapping their relations to the virtual world. Users utilize social media platforms as a mean for a rich variety of activities, such as share and exchange information, create relations, and others. Such online human interactions take place within a dynamic environment where we can observe and distinguish many qualities related to relations between users, concerning influential, trusted or popular individuals. In this talk, we will focus on the qualities of users connected to four important concepts: influence, reputation, trust, and popularity, in the scope of SNA for influence modelling. We will examine some of the existing works utilizing these notions and emphasize the most important features that these concepts should include in order to measure them based on the SNA information. Using the notions, we will concentrate on a practical model for influence estimation, called Action-Reaction Influence Model (ARIM). This model considers the type, quality, quantity, and frequency of actions performed by users in SN, and is adaptive to different SN types. Furthermore, we will discuss a notion not yet explored much in SNA discipline — micro-influence, which targets new phenomena of users with a small but highly involved audience, who are observed to be still highly impactful. Finally, we will focus on the quantification of influence over time and representation of influence causal effect. Particularly, we will consider a specific kind of SN – citation network, which is highly time-sensitive. Accordingly, we will discuss another influence estimation model, which determines influence during a particular time period between communities within time-dependent citation networks.

Investigating the Lack of Diversity in User Behavior: the Case of Musical Content on Online Platforms.

Rémy Poulain, Fabien Tarissan

In Information Processing & Management, 57(2), Elsevier, 2020

Whether to deal with issues related to information ranking (e.g. search engines) or content recommendation (on social networks, for instance), algorithms are at the core of processes that select which information is made visible. Such algorithmic choices have a strong impact on users’ activity de facto, and therefore on their access to information. This raises the question of how to measure the quality of the choices algorithms make and their impact on users. As a first step in that direction, this paper presents a framework with which to analyze the diversity of information accessed by users in the context of musical content. The approach adopted centers on the representation of user activity through a tripartite graph that maps users to products and products to categories. In turn, conducting random walks in this structure makes it possible to analyze how categories catch users’ attention and how this attention is distributed. Building upon this distribution, we propose a new index referred to as the (calibrated) herfindahl diversity, which is aimed at quantifying the extent to which this distribution is diverse and representative of existing categories. To the best of our knowledge, this paper is the first to connect the output of random walks on graphs with diversity indexes. We demonstrate the benefit of such an approach by applying our index to two datasets that record user activity on online platforms involving musical content. The results are threefold. First, we show that our index can discriminate between different user behaviors. Second, we shed some light on a saturation phenomenon in the diversity of users’ attention. Finally, we show that the lack of diversity observed in the datasets derives from exogenous factors related to the heterogeneous popularity of music styles, as opposed to internal factors such as recurrent user behaviors.