Complexité de l’exploration des graphes dynamiques T-intervalle-connexes

Ahmed Wade

Jeudi 02 octobre 2014 à 11h, salle 25-26/101

Slides

Dans cet exposé, je vais parler de l’étude des graphes dynamiques T-intervalle-connexes du point de vue du temps nécessaire à leur exploration par une entité mobile (agent). Un graphe dynamique est T-intervalle-connexe (T >= 1) si pour chaque fenêtre de T unités de temps, il existe un sous-graphe couvrant connexe stable. Cette propriété de stabilité de connexion au cours du temps a été introduite par Kuhn, Lynch et Oshman (STOC 2010).

Community detection and Role extraction in networks

Arnaud Browet

Jeudi 18 septembre à 11h, salle 25-26/101

Slides

Network clustering, also known as graph partitioning, has focused the attention of many researchers during the last decade due to the increasing number of potential applications of network science. However, the amount of collected data makes their analysis a real challenge and dedicated algorithms must be developed in order to comprehend them. In this talk, I will first introduce recent developments on community detection and present a fast and highly parallelizable algorithm which outperforms existing methods in term of computational time. Yet, a community structure is not always representative of the actual distribution of the nodes in a network. For example, bipartite or cycle graphs often do not contain communities and are in fact more closely represented by what is known as a role structure. In a second part of the talk, I will present a pairwise node similarity measure which allows to extract those role structures and demonstrate its performance on synthetic and real graphs. Website: http://perso.uclouvain.be/arnaud.browet/ Author Thesis:http://perso.uclouvain.be/arnaud.browet/files/thesis/thesis.pdf  

DNS Monitoring, looking out for anomalies using the time frame Name – IP association

Lautaro Dolberg

Jeudi 09 octobre 2014 à 10h, salle 25-26/101

Slides

DNS is an essential service in the Internet as it allows to translate human language based domain names into IP addresses. DNS traffic reflects the user activities and behaviours It is thus a helpful source of information in the context of large scale network monitoring. In particular, passive DNS monitoring garnered much interest for the security perspectives by highlighting the services the machines want to access. Im going to show a method for assessing the dynamics of the match between DNS names and IP subnetworks using an efficient aggregating scheme combined with relevant steadiness metrics. The evaluation relies on real data collected over several months and is able to detect anomalies related to malicious domains.

Modeling time-varying multilayer networks (and beyond?)

Artur Ziviani

Jeudi 09 octobre 2014 à 11h, salle 25-26/101

Slides

The talk will present a recent on-going work on the modeling of time-varying multilayer networks developed collaboratively between LNCC in Brazil and ENS-Lyon/INRIA in France. The proposed model has a strong relationship with traditional directed graphs, thus leading to a useful theoretical framework for the analysis of complex networked systems. In the specific case of time-varying graphs, we show that this theoretical framework is a unifying model able to represent several previous (classes of) models for dynamic networks found in the recent literature, which in general are unable to represent each other.

Link Prediction and Threads in Email Networks

Qinna Wang

2014 International Conference on Data Science and Advanced Analytics (DSAA2014)

We tackle the problem of predicting future links in dynamic networks. For this, we work with the Debian Mailing Lists. In this dataset, a user can post a question to the debian list and other users can reply it by email forming a thread. We show that the number of threads shared in the past between users is a better feature to predict future email exchanges than classical features, like the number of common neighbors. We also show that the structure of a thread do not match the traditional definition of a community, particularly a thread does not have many triangles and has many outgoing connections. While the number of shared (detected) communities is also a better feature to predict future email exchanges than traditional features, is not as good as the number of shared threads. We believe our work should raise interests in characterizing and detecting thread-like structures in dynamic networks.

Download

On the importance of considering social capitalism when measuring influence on Twitter

Maximilien Danisch, Nicolas Dugué and Anthony Perez

2014 International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC 2014)

Influence on Twitter has drawn a lot of attention these past few years since this microblogging service is used to share, seek or debate about any kind of information. Several tools providing so-called influential scores have thus been proposed. However, the algorithms behind them are kept secret and it is not clear how they consider influence. Yet, many users rely on such tools to evaluate and even try to improve their influence in the Twitter network. In a recent work, it has been shown that automatic accounts can obtain high influential scores with no intuitive reason. Extending and completing this work, we show that such measures fail at distinguishing so-called social capitalists from real, truthful users. This enlights the fact that actual scores do not seem to consider the way followers and interactions are obtained on the network. To overcome this issue, we define a classifier that discriminates social capitalists from truthful users. To that aim, we crawled the Twitter network to gather examples of certified social capitalists and regular users and obtained features related to the profile and behavior of each user. We then use such a classifier to balance Klout’s score to adjust influential scores. We also developed an application that allows using our classifier online. We believe our work should raise the question of the legitimacy of influence on Twitter, and lead to significant improvements in the way it is measured.

Download

Learning a Proximity Measure to Complete a Community

Maximilien Danisch, Jean-Loup Guillaume and Bénédicte Le Grand

2014 International Conference on Data Science and Advanced Analytics (DSAA2014)

In large-scale online complex networks (Wikipedia, Facebook, Twitter, etc.) finding nodes related to a specific topic is a strategic research subject. This article focuses on two central notions in this context: communities (groups of highly connected nodes) and proximity measures (indicating whether nodes are topologically close). We propose a parameterized proximity measure which, given a set of nodes belonging to a community, learns the optimal parameters and identifies the other nodes of this community, called multi-ego-centered community as it is centered on a set of nodes. We validate our results on a large dataset of categorized Wikipedia pages and on benchmarks, we also show that our approach performs better than existing ones. Our main contributions are (i) a new ergonomic parametrized proximity measure, (ii) the automatic tuning of the proximity’s parameters and (iii) the unsupervised detection of community boundaries.

Download

Direct Generation of Random Graphs Exactly Realising a Prescribed Degree Sequence

Darko Obradovi, Maximilien Danisch

The 6th International Conference on Computational Aspects of Social Networks (CASON2014)

This paper intends to extend the possibilites available to researchers for the evaluation of directed networks with the use of randomly generated graphs. The direct generation of a simple network with a prescribed degree sequence still seems to be an open issue, since the prominent configuration model usually does not realise the degree distribution exactly. We propose such an algorithm using a heuristic for node prioritisation. We demonstrate that the algorithm samples approximately uniformly. In comparison to the switching Markov Chain Monte Carlo algorithms, the direct generation of edges allows an easy modification of the linking behaviour in the random graph, introducing for example degree correlations, mixing patterns or community structure. That way, more specific random graphs can be generated (non-uniformly) in order to test hypotheses on the question, whether specific network features are due to a specific linking behaviour only. Or it can be used to generate series of synthetic benchmark networks with a specific community structure, including hierarchies and overlaps.

Download

Mining bipartite graphs to improve semantic paedophile activity detection

Raphaël Fournier, Maximilien Danisch

IEEE International Conference on Research Challenges in Information Science 2014 (RCIS2014)

Peer-to-peer (P2P) networks are popular to exchange large volumes of data through the Internet. Pedophile activity is a very important topic for our society and some< works have recently attempted to gauge the extent of pedophile exchanges on P2P networks. A key issue is to obtain an efficient detection tool, which may decide if a sequence of keywords is related to the topic or not. We propose to use social network analysis in a large dataset from a P2P network to improve a state-of-the-art filter for pedophile queries. We obtain queries and thus combinations of words which are not tagged by the filter but should be. We also perform some experiments to explore if the original four categories of paedophile queries were to be found by topological measures only.

Download

Comparing Pedophile Activity in Different P2P Systems

Raphaël Fournier, Thibault Cholez, Matthieu Latapy, Isabelle Chrisment, Clémence Magnien, Olivier Festor and Ivan Daniloff

Soc. Sci. 2014, 3(3), 314-325

Peer-to-peer (P2P) systems are widely used to exchange content over the Internet. Knowledge of pedophile activity in such networks remains limited, despite having important social consequences. Moreover, though there are different P2P systems in use, previous academic works on this topic focused on one system at a time and their results are not directly comparable. We design a methodology for comparing KAD and eDonkey, two P2P systems among the most prominent ones and with different anonymity levels. We monitor two eDonkey servers and the KAD network during several days and record hundreds of thousands of keyword-based queries. We detect pedophile-related queries with a previously validated tool and we propose, for the first time, a large-scale comparison of pedophile activity in two different P2P systems. We conclude that there are significantly fewer pedophile queries in KAD than in eDonkey (approximately 0.09% vs. 0.25%).

Download

In-Core Computation of Geometric Centralities with HyperBall: A Hundred Billion Nodes and Beyond

Sebastiano Vigna

Mercredi 28 mai 2014 à 11h, salle 25-26/101

Slides

We approach the problem of computing geometric centralities, such as closeness and harmonic centrality, on very large graphs; traditionally this task requires an all-pairs shortest-path computation in the exact case, or a number of breadth-first traversals for approximated computations, but these techniques yield very weak statistical guarantees on highly disconnected graphs. We rather assume that the graph is accessed in a semi-streaming fashion, that is, that adjacency lists are scanned almost sequentially, and that a very small amount of memory (in the order of a dozen bytes) per node is available in core memory. We leverage the newly discovered algorithms based on HyperLogLog counters, making it possible to approximate a number of geometric centralities at a very high speed and with high accuracy. While the application of similar algorithms for the approximation of closeness was attempted in the MapReduce framework, our exploitation of HyperLogLog counters reduces exponentially the memory footprint, paving the way for in-core processing of networks with a hundred billion nodes using “just” 2TiB of RAM. Moreover, the computations we describe are inherently parallelizable, and scale linearly with the number of available cores.

L’invention des concepts en informatique

Baptiste Mélès

Jeudi 12 juin 2014 à 11h, salle 25-26/101

Comment les informaticiens inventent-ils des concepts ? L’informaticien a parfois l’image d’un bricoleur d’une nouvelle génération, dont les concepts, en compromission permanente avec les contingences de la matière, n’auraient pas la pureté de sciences plus nobles que seraient la logique ou les mathématiques. Telle est la conception que nous remettrons en cause. Le philosophe et résistant Jean Cavaillès (1903-1944) a proposé dans son ouvrage posthume Sur la Logique et la théorie de la science (1947) ce que l’on peut voir comme une grammaire de l’invention des concepts mathématiques, dont les deux procédés principaux s’appellent paradigme et thématisation. En nous appuyant sur l’histoire des concepts de fichier et de processus de CTSS à Unix, nous montrerons que l’on retrouve ces deux procédés, et peut-être plus encore, dans l’invention des concepts informatiques. L’invention de concepts informatiques serait dès lors aussi pure, d’un point de vue rationnel, que l’invention des concepts mathématiques. Ceci nous conduira à caractériser l’informatique comme une science a priori — c’est-à-dire indépendante de toute expérience — à part entière, et dont l’apport est irréductible à la logique et aux mathématiques : bien plutôt qu’une science a posteriori des ordinateurs, elle peut être vue comme la science a priori de l’inscription des processus rationnels dans la matière. Nous conclurons en montrant sur quelques exemples en quoi l’informatique déborde largement la science des ordinateurs.  

Network Decomposition into Fixed Points of Degree Peeling

James Abello and François Queyroi

Social Network Analysis and Mining, Springer, 2014, 4 (1), pp.191

Degree peeling is used to study complex networks. It is a decomposition of the network into vertex groups of increasing minimum degree. However, the peeling value of a vertex is non-local in this context since it relies on the number of connections the vertex has to groups above it. We explore a different way to decompose a network into edge layers such that the local peeling value of the vertices on each layer does not depend on their non-local connections with the other layers. This corresponds to the decomposition of a graph into subgraphs that are invariant with respect to degree peeling, i.e. they are fixed points. We introduce a general method to partition the edges of an arbitrary graph into fixed points of degree peeling, called the iterative-edge-core decomposition. Information from this decomposition is used to formulate a novel notion of vertex diversity based on Shannon’s entropy. We illustrate the usefulness of this decomposition on a variety of social networks including weighted graphs. Our method can be used as a preprocessing step for community detection and graph visualization.

Download

Les réseaux dynamiques : des données aux modèles

Anh-Dung Nguyen

Vendredi 09 mai 2014 à 14h, salle 25-26/101

Slides

La science des réseaux dynamiques est un domaine de recherche très récent mais couvre un large champ d’applications, allant des réseaux biologiques aux réseaux sociaux, en passant par les réseaux informatiques tel que les réseaux mobiles ad hoc et les DTNs. Ces réseaux sont caractérisés par leur évolution spatio-temporelle : les noeuds et liens apparaissent et disparaissent au cours du temps, contrairement aux réseaux statiques dans lesquels les noeuds et liens sont fixes. En conséquence, les modèles classiques pour les réseaux statiques ne sont pas adaptés ou ne parviennent pas à expliquer les phénomènes, propriétés ou processus comme la diffusion d’information dans ces réseaux. Ce séminaire abordera des nouveaux modèles permettant de capturer fidèlement deux caractéristiques spatio-temporelles des réseaux dynamiques : le phénomène « petit-monde » et le niveau de désordre dans les contacts. Nous confronterons ces modèles avec des analyses intensives de traces réelles. Nous montrerons ensuite l’impact de ces caractéristiques sur la capacité de diffusion d’informations de ces réseaux. Enfin, une application sur le routage efficace dans les réseaux mobiles opportunistes sera présentée.

Deepening Our Understanding of Social Media via Data Mining

Huan Liu

Vendredi 30 mai 2014 à 11h, salle 25-26/101

Slides

Social media mining differs from traditional data mining in many ways, offering unique opportunities to advance data mining. It consists of massive amounts of user-generated content and extensive networked data. As detailed in our latest textbook “Social Media Mining: An Introduction”, we face novel challenges such as “the evaluation dilemma” and “the noise removal fallacy”. We will introduce these challenges and present some recent research issues we encounter – a big-data paradox unique to social media where many social networking sites are present but only minimum information is available, and whether distrust is relevant and useful in social media mining.  We will exemplify the intricacies of social media data, and show how to exploit unique characteristics of social media data in developing novel algorithms and tools for social media mining. The textbook’s pdf free download is at http://dmml.asu.edu/smm/   Dr. Huan Liu is a professor of Computer Science and Engineering at Arizona State University. He obtained his Ph.D. in Computer Science at University of Southern California and B.Eng. in CSEE at Shanghai JiaoTong University. At Arizona State University, he was recognized for excellence in teaching and research in Computer Science and Engineering and received the 2014 President’s Award for Innovation. His research interests are in data mining, machine learning, social computing, and artificial intelligence, investigating interdisciplinary problems that arise in many real-world, data-intensive applications with high-dimensional data of disparate forms such as social media. His well-cited publications include books, book chapters, encyclopedia entries as well as conference and journal papers. He is a co-author of Social Media Mining: An Introduction by Cambridge University Press. He serves on journal editorial boards and numerous conference program committees, is an IEEE Fellow and a member of several professional societies. http://www.public.asu.edu/~huanliu  

Dynamic Contact Network of an Hospital

Christophe Crespelle

Vendredi 09 mai 2014 à 11h, salle 25-26/101

Slides

We analyse a fine-grain trace of contact data collected during 6 months on the entire population of a rehabilitation hospital. We investigate both the graph structure of the average daily contact network and the temporal structure of the evolution of contacts in the hospital. Our main results are to unveil striking properties of these two structures in the considered hospital, and to present a methodology that can be used for analysing any dynamic complex network where nodes are classified into groups.

Information Diffusion in Complex Networks: Measurement-Based Analysis Applied to Modelling

Daniel Bernardes

Vendredi 21 mars 2014 à 14h, salle 25-26/105

Understanding information diffusion on complex networks is a key issue from a theoretical and applied perspective. Epidemiology-inspired SIR models have been proposed to model information diffusion. Recent papers have analyzed this question from a data-driven perspective, using on-line diffusion data. We follow this approach, investigating if epidemic models, calibrated with a systematic procedure, are capable of reproducing key structural properties of spreading cascades. We first identified a large-scale, rich dataset from which we can reconstruct the diffusion trail and the underlying network. Secondly, we examine the simple SIR model as a baseline model and conclude that it was unable to generate structurally realistic spreading cascades. We extend this result examining model extensions which take into account heterogeneities observed in the data. In contrast, similar models which take into account temporal patterns (which can be estimated with the interaction data) generate more similar cascades qualitatively. Although one key property was not reproduced in any model, this result highlights the importance of temporal patterns to model diffusion phenomena. We have also analyzed the impact of the underlying network topology on synthetic spreading cascade structure. We have simulated spreading cascades in similar conditions as the real cascades observed in our dataset, namely, with the same time constraints and with the same « seeds ». Using a sequence of uniformly random graphs derived from the real graph and with increasing structure complexity, we have examined the impact of key topological properties for the models presented previously. We show that in our setting, the distribution of the number of neighbors of seed nodes is the most impacting property among the investigated ones.

« Going Viral » and the Structure of Online Diffusion

Sharad Goel

Jeudi 20 mars 2014 à 14h, salle 25-26/101

New products, ideas, norms and behaviors are often thought to propagate through a person-to-person diffusion process analogous to the spread of an infectious disease. Until recently, however, it has been prohibitively difficult to directly observe this process, and thus to rigorously quantify or characterize the structure of information cascades. In one of the largest studies to date, we describe the diffusion structure of billions of events across several domains. We find that the vast majority of cascades are small, and are characterized by a handful of simple tree structures that terminate within one degree of an initial adopting « seed. » While large cascades are extremely rare, the scale of our data allows us to investigate even the one-in-a-million events. To study these rare, large cascades, we develop a formal measure of what we label « structural virality » that interpolates between two extremes: content that gains its popularity through a single, large broadcast, and that which grows via a multi-generational cascade where any one individual is directly responsible for only a fraction of the total adoption. We find that the very largest observed events nearly always exhibit high structural virality, providing some of the first direct evidence that many of the most popular products and ideas grow through person-to-person diffusion. However, medium-sized events — having thousands of adopters — exhibit surprising structural diversity, and are seen to grow both through broadcast and viral means. Finally, we show that our empirical results are largely consistent with an SIR model of contagion on a scale-free network, reminiscent of previous work on the long-term persistence of computer viruses.

Impact de la dynamique du réseau sur quelques problèmes d’algorithmique distribuée et classification de graphes dynamiques

Arnaud Casteigts

Vendredi 02 mai 2014 à 11h, salle 25-26/101

Slides

En partant de quelques problèmes de base en algorithmique distribuée (diffusion, élection, comptage, arbres couvrants), je discuterai de l’impact que peut avoir la dynamique du réseau sur ces problèmes en termes de redéfinition, conditions nécessaires, conditions suffisantes, etc. Cela nous permettra de passer en revue quelques classes de graphes dynamiques, qui seront ensuite resituées dans un contexte plus vaste comprenant une vingtaine de classes. La discussion sortira alors du cadre de l’algorithmique distribuée pour évoquer la dynamique des graphes en général.

RankMerging : une méthode d’apprentissage supervisé pour prédire les liens dans un réseau social

Lionel Tabourier

Vendredi 11 avril 2014 à 11h, salle 25-26/101

Au cours de cet exposé, je présenterai une méthode d’apprentissage supervisé pour la prédiction de liens dans les réseaux sociaux, et plus précisément pour détecter des liens qui n’ont pas été collectés lors de l’acquisition des données. Pour illustrer l’utilisation de la méthode, nous utilisons un CDR (Call Detail Record) portant sur environ 1 million d’utilisateurs de téléphone portable et simulons la situation dans laquelle se trouve un opérateur téléphonique: celui-ci a connaissance des appels entre ses clients, et entre ses clients et des clients de concurrents. Mais avoir accès aux interactions existant entre les clients de ses concurrents serait aussi avantageux, car le taux d’attrition est étroitement lié à la structure du réseau social d’un utilisateur. Cependant, cette tâche est difficile: il s’agit de prédire des relations non-observées, dans un contexte où les classes de prédiction sont fortement asymétriques: alors que beaucoup de liens sont possibles, peu existent. C’est pourquoi les méthodes non-supervisées classiques, qui utilisent différentes caractéristiques structurelles du réseau pour classer les paires de noeuds, sont peu performantes dans ce contexte. Je décrirai RankMerging, une méthode d’apprentissage supervisée simple et peu coûteuse computationnellement, qui agrège les classements issus de différentes sources d’information pour améliorer les performances de prédiction. L’opérateur apprend les paramètres en utilisant les données de ses propres clients et les utilise ensuite sur les clients de ses concurrents. La méthode est adaptée à la situation dans laquelle nous nous trouvons: nous ne cherchons pas à obtenir une très bonne précision sur un petit nombre de prédictions, mais plutôt un bon compromis sur une bonne partie de l’espace Precision-Recall, permettant à l’opérateur d’ajuster sa stratégie. Ensuite, je discuterai du cas des réseaux ego-centrés, pour lesquels l’utilisation de cet outil est pertinente. En effet, dans le cas où l’on n’a accès qu’aux interactions d’un noeud avec ses voisins immédiats, l’information structurelle est très pauvre et nous devrons donc chercher d’autres sources d’information puis les agréger. Ici, nous discuterons comment la temporalité des interactions peut être exploitée comme source d’information pour améliorer les performances de la prédiction.