All software is avaialble as open source, under GPLv3 , please check our github page where all of our software is released. Feel free to download, experiment, use, give us feedback. Also, join us in the development if you want to contribute!
VisualFacts is a prototype system, a self-service visual analytics platform for big geo-located data that helps data explorers perform ad hoc analysis of raw data files collected from different sources of varying quality (with duplicates or missing data) in rich visual ways, even though they do not have a background in notebooks, data integration, or machine learning techniques. VisualFacts platform offers in-situ visual exploration and analytics, as well as entity resolution analysis.
VisualFacts platform allows users to open their own data file(s) and via a map-centric Dashboard UI start visually interacting with the data without loading or indexing the data in a database. The backbone of the platform is a visual aware in-memory index, which is constructed on the fly and adjusted to user interaction, as well as a powerful deduplication engine which offers on-the-fly visual entity matching and clustering over dirty data. The platform can scale up the visualization, interactive exploration and analysis to million data points on a map, with the use of commodity hardware.
The VisualFacts platform provides a REST API and a graphical user interface offering numerous functionalities.
The VisualFacts platform integrates functionality from the following software, which are alse released and can be used as standalone tools.
RawVis system enables efficient in-situ visual exploration and analytics directly over large raw data files without the need of an underlying DBMS or a query engine. RawVis exhibited low response time over large datasets (e.g., 50G & 100M objects) using commodity hardware.
RawVis basic functionality is presented in this video
Taxi Use Case [link] In this use case, each object refers to a specific taxi ride described by several attributes, such as: Geographic pick-up location (Lat, Long), Payment type, Passenger count, Tip amount, Trip distance. Each visualized point/cluster corresponds to a pick-up location of a taxi ride. The objects are from the NYC Yellow Taxi Trip dataset.
Telecommunication Use Case [link] Data are from an anonymized telecommunication dataset containing latency and signal strength measurements. Each visualized point/cluster refers to a network latency and signal measurement, described by several attributes, such as the geographic location (Lat, Long), latency, signal strength and network bandwidth, as well as categorical attributes such as Network type, Network Operator Name, Device Manufacturer, Roaming, etc.
QueryER is an SQL engine which integrates entity resolution (ER) operations in the planning and execution of select-project-join queries. Entity Resolution (ER) constitutes a fundamental task for data integration which aims at matching different representations of entities coming from various sources. Due to its quadratic complexity, it typically scales to large datasets through approximate methods, i.e., blocking. In traditional settings, it is a pre-processing step prior to making “dirty” data, available to analysis. With the increasing demand of real-time analytical applications, recent research considers new approaches for integrating Entity Resolution with Query Processing. QueryER executes analysis-aware deduplication by weaving ER operators into the query plan. It offers three novel query operators, which (1) identify and resolve duplicates within a table employing a schema agnostic resolution approach; (2) enables joins between two or more tables containing duplicate entities; and (3) group/merge deduplicated entities into a single representation.
Characteristic sets (CS) organize RDF triples based on the set of properties associated with their subject nodes. This concept was recently used in indexing techniques, as it can capture the implicit schema of RDF data. While most CS-based approaches yield significant improvements in space and query performance, they fail to perform well when answering complex query workloads in the presence of schema heterogeneity, i.e., when the number of CSs becomes very large, resulting in a highly partitioned data organization. We address this problem by introducing a novel technique, for merging CSs based on their hierarchical structure. Our method employs a lattice to capture the hierarchical relationships between CSs, identifies dense CSs and merges dense CSs with their ancestors. RaxonDB employes these techniqies on top of a relational backbone, where each merged CS is stored in a relational table, and therefore, CS merging results in a smaller number of required tables to host the source triples of a dataset.