Bridging the Gap: Integrating Big Data Analytics and Cloud-Based Supercomputing

  • Updated on August 3, 2023
  • Alex Lesser
    By Alex Lesser
    Alex Lesser

    Experienced and dedicated integrated hardware solutions evangelist for effective HPC platform deployments for the last 30+ years.

Table of Contents

    The increased reliance by mainstream companies on analytics and artificial intelligence (AI) for business intelligence and business process management chores is driving a need for a type of system that integrates Big Data analytics and cloud-based supercomputing capabilities.

    Assembling a system for one or the other (Big Data analytics or cloud-based supercomputing) is challenging enough; getting all the needed elements into one optimized cloud-based system is even harder. To put the issue into perspective, consider the characteristics of systems optimized for one versus the other.

    Key features of cloud-based supercomputing systems include:

    Scalable computing with high bandwidth, low-latency, global memory architectures Tightly integrated instances, memory, interconnect, and network storage Minimal data movement (accomplished by loading data into memory) In contrast, key characteristics of a Big Data analytics system typically include:

    Distributed computing Service-Orientated Architecture Lots of data movement (processes include sorting or streaming all the data all the time) Low-cost instances, memory, interconnect, and local storage Factors Impacting Architectural Choices One of the biggest challenges is matching compute, storage, memory, and interconnection capabilities to workloads. Unfortunately, the only constant is that AI and Big Data analytics workloads are highly variable. Some factors to consider include:

    AI Training Companies need enormous compute and storage capacity to train AI models. The demand for such capabilities is exploding. Since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time.

    Given that with Moore’s Law compute power doubles in 18 months, how is it possible to deliver such compute capacity and keep pace as requirements grow? The answer, as most likely know, is to accelerate these workloads with field-programmable gate arrays (FPGAs) and graphics processing units (GPUs) that complement a cloud-based system’s CPUs. Intel® and Nvidia® are both staking their claim with new FPGA and GPU products that are becoming standards. Nvidia® Tesla and RTX series GPUs are synonymous with AI and machine learning. Intel®’s recent acquisition of Altera has jump-started their entry into this market.

    AdobeStock_286068167-1000x650 (1)

    Where do these elements play a role? Deep learning powers many AI scenarios, including autonomous cars, cancer diagnosis, computer vision, speech recognition, and many other intelligent use cases. GPUs accelerate deep learning algorithms that are used in AI and cognitive computing applications.

    GPUs help by offering thousands of cores capable of performing millions of mathematical operations in parallel. Like a GPU’s use for graphic rendering, GPUs for deep learning deliver a significant number of matrix multiplication operations per second.

    Big Data Ingestion and Analysis System requirements can vary greatly depending on the type of data being analyzed. Running SQL queries against a structured database has vastly different systems requirement than say real-time analytics or cognitive computing algorithms on streaming data from smart sensors, social media threads, or clickstreams.

    Depending on the data and the analytics routines, a solution might include a data warehouse, NoSQL database, graphic analysis capabilities, or Hadoop or MapReduce processing. (Or a combination of several of these elements.)

    Beyond architecting a system to meet the demands of the database and analytic processing algorithms, attention must be paid to data movement. Issues to be considered include how to ingest the data from its original source, stage data in preparation for analysis, and ensure processes are satiated to make efficient use of the cloud-based instances.

    Optimized, Integrated Solutions Vendors and the cloud computing community are working to address the Big Data challenge in a variety of ways – especially with the general acceptance of AI and its dependence on large data sets.

    A suitable cloud-based system must be able to be dynamically provisioned by the users to handle different data workflows, including databases (both relational database systems and NoSQL style databases), Hadoop/HDFS based workflows (including MapReduce and Spark), and more custom workflows perhaps leveraging a flash-based parallel file system.

    The challenge for most companies is that they might have system expertise in cloud-based supercomputing or Big Data, but often not in both. As such, there may be a skills gap that makes it hard to bring together the essential elements and fine-tune the cloud-based system’s performance for critical workloads.

    To overcome such issues, many companies are looking for turnkey solutions that combine the needed processing, storage, memory, and interconnect technologies for Big Data analytics and cloud-based supercomputing into a tightly integrated cloud-based system.

    Delivering such a solution requires expertise and real-world best practices across both cloud-based supercomputing and Big Data domains, plus deep industry knowledge about the specific data sources and analytics applications.

    NZO Cloud has a more than 30 years history of delivering cloud-based systems that meet the most demanding workloads across industries, government, and academia.

    Its offerings include the PowerServe Uniti Server and NZO Cloud + HPC Instance lines that leverage the latest components from Intel® and Nvidia®. These servers are ideal for applications including AI and deep learning; design and engineering; life sciences including genomics and drug discovery and development; as well as for computational and data analysis for chemical and physical sciences.

    NZO Cloud also offers CloudOOP Big Data Instances and CloudOOP Rax Big Data Instances that deliver the highest level of performance one would expect in an enterprise server combined with the cost-effectiveness of direct attach storage for Big Data applications. The instances are certified compatible with Cloudera, MapR, Hortonworks, Cassandra, and Hbase. And deliver 200+ MB/sec sustained IO speeds per hard drive (which is 30%+ faster than other OEMs.)

    These solutions and other NZO Cloud systems are designed to meet the requirements of modern Big Data analytics in enterprises today. Such systems will increasingly become more important as companies explore new ways to make use of the explosive amounts of data to improve operations, better engage customers, and quickly react to new business opportunities.

    The bottom line is that the world is entering an era that requires extreme-scale cloud-based supercomputing coupled with data-intensive analytics. Companies that succeed will need cloud-based systems that bring essential compute, storage, memory, and interconnect elements together in a manner where each element complements the others to deliver peak performance at a nominal

    One fixed, simple price for all your cloud computing and storage needs.