AI & Big Data Research & Technology Commercialization
Researchers at
COMStar Tech
and
The George Washington University
have addressed the challenges of AI & Big Data by developing a groundbreaking approach to information storage, information retrieval,
knowledge extraction, feature clustering, and pattern prediction based on a new type of brain
computational model. It moves away from the John von Neumann’s computational model that has dominated
the development of computers since the 1940s, toward a design that mimics known characteristics
of the human brain’s processing of vast amounts of data. In contrast to traditional information
systems aimed at exact results or approximate facts, the new technology is striving for knowledge
extraction.
The initial idea of the computational brain model was first presented in the paper “
On Clusterization of "Big Data" Streams” published by
ACM. This paper has been already recognized and raised substantial interest. This is a foundation
for many future developments.
The objective of this research is to bring forward a new simple and efficacious tool for one
of the most demanding operations of “Big Data” methodology – storing, searching, clustering,
predicting of diverse information items in a data stream mode and retrieve them in efficient
speed.
Current Research
The following research focuses cover different aspects of this side and show how this can be
applied to the design of the brain to copy with Big Data problems
These combined techniques for the Big Data systems improve speed enormously while still
achieving adequate levels of accuracy. Specifically, the approximate search can achieve “speed-up”
ratios of 500x - 6000x over baseline naïve search technique. Among the fields that would
benefit the most from this innovation, we have “Big Data” processing, machine learning, database
management, artificial intelligence, and web applications. All algorithms were implemented
using both C/C++ and Python.
Intelligent Software-Defined Storage (SDS)
It refers to storage infrastructure managed and automated by intelligent software, as opposed to by the storage hardware itself. It can enable the access to very large data of diversified files and enhances speed and efficiency to the storage for various data.
Intelligent Scalable Clustering
This new approach can find models and discover useful knowledge and insights as well as uncover hidden features or characteristics that naturally divide the cases. We provide a fast and noise-robust pattern prediction and classification algorithm by using some characteristics derived from our novel intelligent scalable clustering scheme. The average complexity to cluster each input pattern is O(1).
Cyber-Physical Stream Processing
This is a novel approach that in very high probabilities and space requirements efficiency. CPS extracts most frequent item with appearance frequency as low as 2%. This algorithm can process data streams using single-pass (or on-the-fly) algorithms to provide up to the moment analysis and statistics on current arrival streams.
A New Multi-Core Pipelined Architecture for Executing Sequential Programs for Parallel Computing
this promising architecture is to process efficiently Big Data streams on-the-fly while it can process sequential programs on a parallel-pipelined model. The new architecture offers several advantages over conventional models. It reduces complexity, data dependency, high-latency, and cost overhead of parallel computing.
Multi-core Pipelining with FPGAs
The experiments of the pipeline prototype have further confirmed a noticeable improvement by pipelining on the reconfigurable FPGA design.
Extracting Parallelism in Mobile GPUs/CPUs with Combinatorial Architectures
Advantages
- Intelligent systems feeding by “Big Data” streams
- Artificial intelligence systems applied to diagnoses, decision-making, and gaming
- Produces huge cost savings by reducing storage and processing costs for very large (> terabytes) data systems –“Big Data” systems
- Provides a highly efficient data access which surpass current approaches limited by regular computational models
- Improves response time with “speed-up” ratios of 100x-6000x
- High accuracy and fast speed
- Linear algorithms and approaches for on-the-fly streaming processing
- Both structured data and unstructured data can be processed and analyzed efficiently
- High dimensional data can be processed
Patents
- Intelligent Scalable Clustering (Being filed by COMStar Tech in 2016)
- Multi-core Pipelining (Being filed by COMStar Tech in 2016)
- Intelligent Software Defined Storage (filed by GWU in 2014)
- Cyber-Physical Stream Processing (filed by GWU in 2014)
- Multi-Layer Multi-Processor Information Conveyor with Periodic Transferring of Processor's States for On-The-Fly Transformation of Continuous Information Flows and Operating Method Therefore, US PATENT No. 6145071, owned by George Washington University.
Applications
These inventions could get numerous applications in our Big Data intelligent system. We have developed several demo systems and verified then using real-world data for several important applications as follows. More applications could benefit from these systems are mentioned below as well.
Fuzzy Searching, Text Mining, Voice Search
Searching through a large volume of data is very critical for companies, scientists, and searching engines applications due to time complexity and memory complexity. We have developed a new approach for using fuzzy techniques in searching through big data. This method is introduced by linear time complexity for generating the dictionary and constant time complexity to access the data and update by new data sets, also updating for new data sets is linear time depends on new data points. The demo system is based on searching for letters of English. It can be used for any other languages as well. The searching speed is very fast. Potentially, this technique can be used for speech recognition, image searching, etc.
Biomedical applications and searching for genes disorders in genome databases
We have developed one demo system for such biomedical applications. This demo basically shows how a new disease could be
recognized in synthetic data. The system receives a patient record based on answering
tens of yes or no questions and then assigned the disorder that this patient might
have with a probability (chances) of occurring. The assignment is computed by collecting
the vote of the majority on each cluster. Our system has the following advantages:
• Accumulates several millions of patients’ records
• Presents biomarker data in the specific form that are clustered automatically
• For a new biomarker case entering the system, its diagnostics changes are
assigned immediately
Disease Diagnosis
We have developed the diagnosis system using our new approach. The diagnosis experimental results for real-world disease data, Breast Cancer Wisconsin (Diagnostic) Dataset and SPECT Heart Diagnosis Dataset, have been verified that our new tech increases the quality of the clustering and prediction, and reduces the overall computational cost. In comparison of conventional machine learning algorithms, K-NN, Naive Bayes, and other statistical models, our new intelligent algorithm can outperform them and reach very high accuracy (around 90%) and very fast speed.
Protein Prediction
Progress in bioinformatics has resulted in the accumulation of information about genomic sequences and protein structures
and these information is anticipated to be applied to drug discovery. We are working
on achieving this objective and aims at building a technology to predict changes
in the activity of proteins (gain or loss of function) based on information about
genomic variation; in particular, mutations involving amino acid changes. Our new
approach has been verified to have abilities to
• precisely predict changes in protein activity based on information about
genomic variations in cancer cases (gain or loss of function)
• experimentally verify the predicted activity changes
• highly versatile and accurate prediction not restricted to single-family
genes
• improve the technology based on the results
Fraud Detection
Fighting fraud continues to rank high among strategic business drivers in retail banking. The common way for banking institutions to relay on after-the-fact manual reports and threshold rules. The bank is always trying to find that knowledge from structured data. However, most of the knowledge is obtained from unstructured databases. Furthermore, frauds need to be detected in real time. That is, the transaction should be evaluated, and an authorization or decline decision should take place prior to funds movement. Moreover, Analytics is the only way to actually detect fraudulent patterns of behaviors efficiently. But most of these related data are unstructured, such as customer account info, credit score, behavior, transaction type, payment, history records, geographical locations, comparisons, are need to be analyzed. Both good and bad behavior changes unpredictably over time. The detection is not just set the threadsholds to determine frauds. Fortunately, recently we have developed a novel approach based our new big data technology to handle bank fraud detection, identity protection, etc. in collaboration with some banks in U.S.
Computer security
Security is becoming an increasingly important concern as applications become more frequently accessible over networks and are, as a result, vulnerable to a wide variety of threats. This system can be used for Intrusion Prevention System (IPS) or Intrusion Detection System (IDS) for both knowledge-based and behavior-based cases. The 23-Questions will be utilized to examine the behavior of a given user, program or malware and issues alarm if this thing might cause harm. It is very useful for tracking unauthorized intruders in many areas, such as bank fraud detection.
New-Generation Multi-core Architectures
The pipelined and combinatorial architectures we proposed could be used for new generation unconventional multi-core architectures, heterogeneous computing, and streaming computing.