INCT CID - National Institute of Science and Technology in Data Sciences
LNCC

Motivation

Motivation of the Institute

The INCT-CiD proposes to contribute in the science-industry-government axis through this new science. The challenge, and the main goal, of the INCT-CiD is to identify the fundamental principles, methods, and techniques for managing and analyzing large volumes of data, overcoming the inherent difficulties of large-scale data analysis. Motivated by the previously addressed challenges by the teams of the associated laboratories in areas so diverse as astronomy, biology, biodiversity, cyber defense, education, sports, Internet, urban mobility, oil & gas, healthcare, information security, and mobile telecommunications; we propose to contribute to the scientific structuring of this area both in the training of highly qualified human resources and in basic and applied research in the frontiers of knowledge of data sciences. In particular, the INCT-CiD will address three main research areas: (i) data management; (ii) data analysis; (iii) complex network analysis.


INCT CID motivation

The main motivation of the INCT-CID proposal emerges from the previous experience by the proponent group in R&D activities for data management and analysis, as well as complex network analysis, related to application scenarios in several areas. Examples of such areas are: astronomy, biology, biodiversity, cyber defense, sports, Internet, urban mobility, oil & gas, health, information security, mobile communication, and education. The associated figure illustrates this research motivation cycle within the INCT-CiD that pushes basic and applied research motivated by application scenario.


Examples of application scenarios

  • Astronomy: LNCC is a member of the Inter-instiutional Laboratory of eAstronomy (LIneA) (Inter-Institutional e-Astronomy Lab), where data collected from large astronomy surveys are managed and processed. Such surveys produce data from telescope images photographed by on-land instruments. Based on these images, celestial bodies are identified and their features annotated, producing a database called Astronomic Catalog. From a data management perspective, theses catalogs may contain hundreds of billions of celestial bodies. Processing such a huge volume of data efficiently requires data distribution in a cluster. Data partitioning and allocation strategies must meet the requirements of several data analysis workflows. The difficulty relies on determining adequate criteria that satisfies distinct applications, which is an open research problem. The integration of a set of catalogs, generated by different surveys also raises the problem of entity resolution, given that the resolution of celestial objects is based on their position, which varies among different telescopes.
  • Bioinformatics: Large scale DNA sequencing technologies led to studies in the area of Systems Biology. In this area, biological phenomena are studied through dynamic models, since they are characterized by components with complex interactions, forming network of complex systems. In particular, the study of gene interaction and metabolic networks are based on the inference and construction of interaction networks as a way of understanding the intracellular dynamics. At LNCC, there are ongoing studies related to complex networks inference of gene interaction associated to the process of the HIV virus infection. The inference of the network takes into account gene expressions throughout the infectious process and infers interactions, with the help of information extracted from public databases of gene interaction.
  • Biodiversity: For monitoring changes in the biodiversity, it is essential to collect, document, store, and analyze the indicators on the space-temporal distribution of species, in addition to obtaining information about how they interact among themselves, and with the environment where they live in. In this context, the Brazilian Biodiversity Information System (Sistema de Informação sobre a Biodiversidade Brasileira - SiBBr) aims at integrating and disseminating data collected and published by several Brazilian institutions. The computational infrastructure for the SiBBr is provided by the LNCC. SiBBr is also the Brazilian node for the Global Biodiversity Information Facility (GBIF). Among the functionalities provided by SiBBr include the aggregation of data by species, and occurrences made available by several academic institutions and governmental agencies. A first prototype of a scientific workflow for modeling species distribution has been developed, which allows a scalable execution with provenance information.
  • CyberDefense: In this area there is a demand for the analysis of large volumes of data extracted directly from the data link layer (OSI layers) in the form of traces. The capture, classification, storage, retrieval, processing, analysis and visualization of such data is still a research challenge in this area. A typical scenario is the search for suspicious behavior within the network traces in order to identify the need for installing a cyber security system. The creation of a cyber shield as well as other security solutions form the main goal of the Cyber Defense Center of the Brazilian Army (Defense Ministry), which relies on the IME AL as an active participant, aligned to the National Defense Strategy (http://www.defesa.gov.br/projetosweb/estrategia/arquivos/estrategia_defesa_nacional_ingles.pdf).
  • Education: Extracting information from educational content is an interesting research area. In particular, such information can be used to increase students learning skills, from a variety of fields, including the computer science. This research area comprises the development and usage of games and educational applications, data capturing, pre-processing, and data analysis methods. This research line is of particular interest for CEFET/RJ, since this institution offers professional qualification ranging from high school to graduate level programs. Such educational application could provide integration opportunities among these diverse education program levels.
  • Sports: The adoption and use of knowledge extraction techniques to improve athlete's performance has recently gained a lot of attention. Since 2010, LNCC takes part of a FINEP funded initiative to constitute the first Olympic Laboratory in Latin America, coordinated by the Brazilian Olympic Committee. The Olympic Lab aims at integrating data from different scientific disciplines providing a holistic view on the condition of the athletes and giving insights to training preparation. In this context, LNCC is developing the SAHA system that provides a single representation for data captured by the different disciplines in the lab. All elements of observation are uniformly described and quantified. A common trajectory data model enables expressing the follow-up of athletes in time and during different training states.
  • Internet: The Internet poses great challenges on the characterization of its structure and behavior. It presents itself as a set of independent complex networks, ranging from communication networks providing a basic interconnection infrastructure up to online social networks, and content delivery applications comprising billion of users. Modeling, characterizing, and analyzing such networks in the Internet also pose a particular challenge of preserving user privacy. Therefore efficient collecting and handling of bulky amount of detailed personal data are required while conducting the research studies. To face this challenge, several approaches based on measurements have been proposed to behavior characterization, analyzing and modeling of Internet dynamics and its properties at several privacy levels. Partners of INCT-CiD have particular experiences on diverse aspects of the previous reported complex networks challenges from and of the Internet, as well as the impacts on its applications.
  • Urban Mobility: UTFPR has been conducting applied research on specific subjects of urban mobility. The research studies apply data handling techniques aimed at analyzing a massive amount of actual data provided by the Curitiba City Integrated Public Transport System. Examples of on-going works are: (i) defining models of electrical bus operation in specific zones (georeferenced) or bus lines; (ii) defining methods and algorithms for storage, distribution and data replication; (iii) defining algorithms based on map-reduce strategies aimed at bus traffic pattern characterization (e.g., driving profiling); (iv) optimizing the allocation of charging stations in electro mobility scenarios; (v) defining algorithms for simulation and optimizing traffic light control through collective intelligence; and (vi) defining new video processing techniques to the actual video monitoring system.
  • Oil and Gas: Research on deepwater oil and gas exploitation is a big challenge in Brazil, including large fields beyond 5 kms below the surface. The investigation on these fields uses techniques based on the capturing of reflected seismic waves. The latter sent to underwater are reflected by rocky layers at sea bottom and recaptured by sensors at sea surface. Once captured and processed, data from seismic waves are combined in large datasets representing the region under survey. The analysis activity of seismic signals to feature profiling is called geophysical interpretation and has relevant economic value. In this sense, the development of new techniques that could support fault detection over large oil fields, as the Brazilian pre-salt area, is a broad and new research problem. In addition to the basic problem of big data management, the inference of features of interest from basic seismic data trace is huge research challenge.
  • Healthcare: The healthcare area usually handles massive data volume. This data volume steadily increases due the increasing adoption of information systems on health and patient’s electronic medical records. The LNCC team has expertise on information systems for health services. There are many challenges at managing and analyzing data related to the health area, as such, aggregation, maintenance, interoperability, interpreting, in special, privacy preserving issues related to the high sensitivity data value. The trend is a faster expansion of data volume due to increasing adoption of monitoring devices and mobile devices for data collection at residential and pre-hospital environments. Another recent trend is the modeling approach of health related problems by complex networks, whether related to diseases, or health services coordination for care and resources management.
  • Information Security: Critical infrastructures for intelligent societies (such as energy, water, telecommunications, transportation, banks, financial services, garbage collection, health) must have an efficient resource management. This should be done through intelligent networks and processing of large volumes of data, guaranteeing data integrity and protection, as well as preserving the privacy of citizens, institutions, and enterprises. In order to assure the reliable operation of a critical infrastructure, some recent works were proposed by the UTFPR team. These works include the use of appropriate technologies, such as cyber security, for prevention, detection and mitigation of failures. Besides, the identification and simulation of malicious accesses of service denial is the focus of study at the CEFET/RJ through the partnership with the CLAVIS Information Security Company. Similar initiatives, in the context of the Brazilian army, are under study by the IME-RJ.
  • Mobile Communication: Data collected from cellphones has a huge potential for providing valuable information about the dynamics of individuals] or human mobility at a relative low cost and at an unprecedented scale. The analysis of massive amount of cellphone data may have an impact on several areas. They include planning and dimensioning of telecommunication networks, and also urban planning or studies of human mobility. The LNCC team has expertise on network load dynamics and urban mobility due to large-scale events. The challenges of these application scenarios are management and analysis of massive amount of data as well as analysis of complex network relations that typically arises from them.

Besides those areas in which the INCT-CiD partners have previous working experience on applying large-scale data analytics to them, we are also discussing the extension to other areas of knowledge with different partners.