VisualFacts

Democratizing Visual Analytics, A Self-Service Platform for Big Data Exploration

One the major challenges of the Big Data era is that it has realized the availability of a great amount and variety of open big datasets for analysis by non-corporate data analysts as well, such as research scientists, data journalists, policy makers, SMEs and individuals. The level of difficulty in transforming a data-curious user into someone who can competently access, analyze and consume that data is even more burdensome now for a great number of users with little or no support and expertise on the data (pre)processing part.

Self-service Data analytics is a recent trend to visual analytics that enables corporate business users to access and work with data even though they do not have a background in statistical analysis, business intelligence (BI) or data mining.

Self-service visual analytics is a new paradigm, widely promoted in modern corporate environments, in which business users are enabled and encouraged to directly manipulate (explore, blend, analyze) underlying data in rich visual ways, in order to derive insights from business information as quickly and efficiently as possible. Allowing less tech-savvy end users to make decisions based on their own queries and analyses, frees up the organization’s business intelligence and information technology (IT) teams from the tedious work of data preparation.

The aim of the VisualFacts project is to develop a scalable platform for providing self-service visual analytic capabilities to a wide range of corporate and non-corporate users to access, explore, analyze open and privately-held data and collaborate on the analytic results of their work by sharing, annotating and reusing them in the form of visual facts.

Self-Service

Current visual platforms and solutions (such as Tableau, SAS, Spotfire, QlikView etc.) do not target the above communities, but focus mainly on closed-world corporate environments. This is mainly due to.

  • the cost of business platforms and hardware requirements that makes it prohibitive for individuals, SMEs and non-corporate users to purchase,
  • their lack of openness and technological transparency that makes it difficult or impossible to extend and customize for more specific application needs,
  • the deployment and customization effort that obliges IT teams to be involved in the data preprocessing (data ingestion, transformation and modelling, cleansing), as well as in the setup, operation (implementation of analytic functions) and optimization (indexing, tuning) of these tools, and finally
  • the closed set of provided visual analysis techniques that mostly focus on business related visualizations (e.g. OLAP analysis) rather than on ad-hoc customizable visualization results (e.g., graph analytics), that can be produced in a more collaborative fashion, shared and reused in an open world setting, as in the case of Open Science where research communities now work on a global open information space of resources (publications, datasets, persons) to identify research challenges and advance state-of-the-art.

Democratizing self-service visual analytics, thus enabling a greater number of data scientists with diverse analytic needs to seamlessly and collaboratively perform data analysis in a most intuitive and productive way, without the support of expert IT users in the data preparation, analysis and optimization phase is the main goal of VisualFacts. It involves innovative research work for addressing the following questions:

  • How do we enable non-expert users to intuitively perform visual data exploration and analysis through rich visualization ways, without requiring from them to exhibit data manipulation or complex analytic skills (e.g., SQL, SPARQL, R, etc.), thus leaving IT-experts out of the loop?
  • How do we enable non-expert users to seamlessly integrate and analyze on query-time heterogeneous (e.g., different schemas, disparate data sources) datasets that involve highly noisy data of different quality, without the need for manual and tedious data preprocessing, cleansing and deduplication?
  • How do we achieve fast and user responsive (within seconds) analysis of voluminous datasets available on the web, gaining rapid value out of raw data without compromising the performance of the system or the complexity of the provided analytic functions?
  • How do we promote an open collaborative way of sharing and gaining insights out of visually presented facts and findings rather than raw data sharing?

Research Objectives

VisualFacts is a 3-years project funded by Hellenic Foundation for Research and Innovation : 1st Call for H.F.R.I. Research Projects for the support of Post-doctoral Researchers and the hosting organization is ATHENA Research Center. Its main objectives are:

  • Objective 1: Novel visualization and exploration techniques for collaborative self-service analytics. To provide self-service analytics functionality, visualization techniques and a new way on the visual interaction and analysis of large raw datasets by non-expert users, enabling them to share and collaborate on visually presented insights (visual facts) rather than plain open data.
  • Objective 2: Distributed indexing techniques for scalable visual analysis. To develop a new distributed technique for data management, focusing on indexing and query processing methods that facilitate interactive complex visual analysis over data with emergent schemas.
  • Objective 3: Efficient entity resolution techniques for visual analysis over dirty data. To develop and incorporate an efficient query-driven entity resolution process within the aforementioned data management framework for minimizing the time to visual analysis over dirty data.
  • Objective 4: Use case driven development and evaluation. To maximize the uptake of the platform by following a use case driven methodology for the development and evaluation of the research and technological objectives through two real-world use cases of the open science (scholarly visual analytics) and the data journalism domain (analytics on the panama papers).
  • Objective 5: Dissemination of research output and further exploitation of the platform. To broadly disseminate the scientific results in high-ranked conferences and journals (at least 6) of the data management and the information visualization community and further exploit the produced technology towards commercialization.
  • Key Objective: Self-service visual analytics platform. To deliver a self-service visual analytics platform, deployed as a cloud-based tool, that will provide scalable techniques for interactive visual analytics to a wide variety of end users (individuals, SMEs and business users) to work with huge volumes of heterogeneous web data and collaborate on visual results.

Research Activities

VisualFacts research activities are structured around the following areas

Data Visualization & Visual Analytics

The main objective of VisualFacts is the provision of an easy-to-use interface that will allow the visual exploration of big, heterogeneous data, the visual application of analytics (e.g., trend analysis, visual recommendations and outlier detection) and the collaborative sharing of visual artefacts. First, VisualFacts will offer a variety of charts like bar, line, scatter, heat map, network diagrams, tree map etc, which can be organized in publicly available dashboards. VisualFacts will develop all the functionality and models required to support collaborative editing and publishing of dashboards. Next, even though data visualization can convey a lot of information about correlated variables, outlier values and existing trends in an intuitive way, applying data analytics to enrich visualizations can further help reveal hidden insights. Thus, a second contribution is that it will offer visual ways to perform analytic functions on the charts, such as regression analysis for trends or outlier detection in scatter diagrams, and provide exploration assistance in the form of visual recommendations. The latter addresses a common problem when dealing with big data: potentially important parts may never be explored. Moreover, determining the most suitable visualization type (pie, chart, map, etc.) for a scenario at hand, usually proves to be a tedious task, as users might not know in advance the types of data under analysis. To jump start the data exploration process and highlight areas with such patterns of interest, VisualFacts will provide visualization recommendations based on the characteristics of the data (e.g. data types, statistical properties). Furthermore, it will support interactive visual operations (e.g. pan, zoom, filter) for addressing visual clutter and information overload in exploration scenarios. VisualFacts will allow the effective abstraction and summarization of the data under analysis by providing a) dynamically calculated statistical information regarding the profile of the data, and b) hierarchical approaches for multi-level navigation, that will offer an intuitive way to find areas of interest in the dataset. These hierarchical views will be constructed on-the-fly by considering schema characteristics, as well as user preferences and environment parameters (e.g., screen resolution). Finally, for minimizing the overall visual analysis time, VisualFacts will employ data caching and prefetching techniques. By using information regarding previous user interaction as well as statistics about the data, the system will attempt to identify which parts of the dataset are more likely to be requested by future user queries and bring these to the cache.

Indexing and Query Processing

One of the objectives of VisualFacts is the timely delivery of smart visual analytics over dirty graph data. To address this objective, the Visual Analytics functionality must be backed by proper data structures and retrieval techniques that can support the specificities and complexity of visual analytics processing. Unfortunately, this requires support from novel data management techniques, as existing techniques fall short because of technical drawbacks, the main of which are: (i) inability to address schema heterogeneity proactively, (ii) inability to address query processing over highly complex queries, especially in the case of graph data, and (iii) a general lack of algorithms for integrating query processing with entity resolution. Furthermore, the specific requirements set by the functionality of VisualFacts must rely on data structures that exhibit hierarchical, dynamic, and metadata-rich characteristics, so that scenarios of hierarchical exploration, visual data profiling, and visualization recommendation can be supported. For these reasons, VisualFacts will build on the Extended Characteristic Set (ECS) data structure, which is an indexing structure for graph data that targets heterogeneity and join-heavy query processing. ECS satisfies the aforementioned requirements because of the following factors: (i) it is decoupled from the explicit schema of the data, and instead is built on the implicit, emergent schema, making it able to capture heterogeneity at its core, (ii) it is inherently hierarchical, as it supports generalizations and specializations in its structure, and (iii) it partitions the data in relatively small and semantically rich parts, and is thus a good candidate for pruning unnecessary records while performing entity resolution and other cleansing tasks on query-time. However, in its present form, ECS indexing cannot efficiently address data diversity, as this leads to the creation of many sparse ECSs that are costly to fetch and process. For this reason, VisualFacts will extend ECS indexing for detecting and merging similar ECSs. We will design and implement techniques for detecting hierarchical relations between merged ECSs, for faster access of related data records, which will enable efficient real-time visual exploration scenarios. To address scalability, the indexes will be implemented in a distributed setting, over multiple nodes. To this end, VisualFacts will implement and deploy parallel query processing algorithms. The index will be adaptively updated, in order to provide fast access to raw data. This will be driven by the incremental emergent schema detection process, and the feedback by the entity resolution results, that will enable further linkage and merging of existing ECSs

Query-driven Entity Resolution

Delivering quality visual analytics is directly related to the quality of the data. However, the aggregation of data from remote sources often leads to inherent dirtiness, as far as both the structure (i.e. heterogeneity) and the content (i.e., duplicates, missing values etc.) are concerned. While VisualFacts addresses structural problems with the use of the aforementioned emergent schema detection techniques, content dirtiness must be addresses by Entity Resolution algorithms. In a self-service manner, VisualFacts must ensure that these processes take place automatically, without user intervention. Thus, VisualFacts will develop and integrate Entity Resolution algorithms into its data management framework. These algorithms will focus on deduplication and record linkage, which are inherently quadratic problems, as they require comparisons between all entities in the data. To speed up these processes, VisualFacts will employ the entity blocking / Meta-blocking approach, a technique for grouping together closely related entities into a graph with the aim to prune redundant and unnecessary comparisons, thereby speeding up the Entity Resolution processes [Chri12, PKPN14]. These methods are traditionally applied independently of query evaluation as a data preprocessing step; on the contrary, VisualFacts will develop ER methods that are directly integrated in the form of operators within the query evaluation process. More specifically, VisualFacts will extend the Meta-blocking technique so that it is deployed over the ECS index, in order to detect probable duplicates on query-time. This process will be efficient and fully-automated, without requiring any user intervention. For addressing the high computational cost and complexity of the actual comparisons between similar data, VisualFacts will design and implement a parallelization algorithm for efficient block distribution among the available cloud nodes. Each node will hold the essential information for executing comparisons pertained to each block locally, entailing minimum shuffling among nodes. Finally, VisualFacts implement an update operator for reusing ER results by each query and enriching the Meta-blocking graph structure with information about entity matchings, in order to improve the quality and performance of future queries

VisualFacts Platform

The resulting components will be integrated into a cloud-based platform. The architecture of the platform contains the Data Layer, the Core Platform Layer and the Presentation Layer. The Data Layer consists of the primary tools and modules related to the physical storage of input data and generated indexes. This layer will be responsible for the deployment of the database structures that store the indexed data along with their indexes on disk for future querying. The Core Platform Layer is responsible for all the core backend functionality of the platform, which addresses issues such as management of input data, creation and update of indexes, query optimization and query evaluation, as well as all the methods related to the processing and generation of visual analytics. It consists of three main components, namely the Data Staging and Indexing component (responsible for all data preparation and indexing tasks), the Query Processing component (for processing and evaluating incoming queries), and the Visual Analytics Processing Component (handles the incoming requests from the user interface including input of visual analytics queries, and generation of visualizations).

arch