Tutorial #8 - Generative AI, Large Language Model, Deep Learning and Transformer

About the tutorial
This tutorial is aimed to provide readers with some good understanding of what generative AI, larger language model (LLM), deep learning, Transformer are, and how they are related and how they work individually and collaboratively.

This tutorial is for learners at level of University Year 3


Generative AI

Generative AI, sometimes called gen AI, is artificial intelligence (AI) that can create original content—such as text, images, video, audio or software code—in response to a user’s prompt or request.

Generative AI relies on sophisticated machine learning models called deep learning models—algorithms that simulate the learning and decision-making processes of the human brain. These models work by identifying and encoding the patterns and relationships in huge amounts of data, and then using that information to understand users' natural language requests or questions and respond with relevant new content.

Good examples of generative AI include Chat-GPT, GPT 3 and GPT 4 from Open AI.

How Generative AI work

To build a generative AI system, the following three steps need to be taken:

Training

The training step is to create a foundation model that can serve as the basis of multiple gen AI applications. It begins with a foundation model—a deep learning model that serves as the basis for multiple different types of generative AI applications. The most common foundation models today are large language models (LLMs), created for text generation applications, but there are also foundation models for image generation, video generation, and sound and music generation—as well as multimodal foundation models that can support several kinds content generation.

To create a foundation model, practitioners train a deep learning algorithm on huge volumes of raw, unstructured, unlabeled data—e.g., terabytes of data culled from the internet or some other huge data source. During training, the algorithm performs and evaluates millions of ‘fill in the blank’ exercises, trying to predict the next element in a sequence—e.g., the next word in a sentence, the next element in an image, the next command in a line of code—and continually adjusting itself to minimize the difference between its predictions and the actual data (or ‘correct’ result).

The result of this training is a neural network of parameters—encoded representations of the entities, patterns and relationships in the data—that can generate content autonomously in response to inputs, or prompts.

This training process is compute-intensive, time-consuming and expensive: it requires thousands of clustered graphics processing units (GPUs) and weeks of processing, all of which costs millions of dollars. Open-source foundation model projects, such as Meta's Llama-2, enable gen AI developers to avoid this step and its costs.

Tuning

The tuning step is to tailor the foundation model to a specific gen AI application. Metaphorically speaking, a foundation model is a generalist: It knows a lot about a lot of types of content, but often can’t generate specific types of output with desired accuracy or fidelity. For that, the model must be tuned to a specific content generation task. This can be done in a variety of ways with the following two most commonly used:

Fine tuning

Fine tuning involves feeding the model labeled data specific to the content generation application—questions or prompts the application is likely to receive, and corresponding correct answers in the desired format. For example, if a development team is trying to create a customer service chatbot, it would create hundreds or thousands of documents containing labeled customers service questions and correct answers, and then feed those documents to the model.

Fine-tuning is labor-intensive. Developers often outsource the task to companies with large data-labeling workforces.

Reinforcement learning with human feedback (RLHF)

In RLHF, human users respond to generated content with evaluations the model can use to update the model for greater accuracy or relevance. Often, RLHF involves people ‘scoring’ different outputs in response to the same prompt. But it can be as simple as having people type or talk back to a chatbot or virtual assistant, correcting its output.

Generation, evaluation and retuning

Developers and users continually assess the outputs of their generative AI apps, and further tune the model—even as often as once a week—for greater accuracy or relevance. (In contrast, the foundation model itself is updated much less frequently, perhaps every year or 18 months.)

Another option for improving a gen AI app's performance is retrieval augmented generation (RAG). RAG is a framework for extending the foundation model to use relevant sources outside of the training data, to supplement and refine the parameters or representations in the original model. RAG can ensure that a generative AI app always has access to the most current information. As a bonus, the additional sources accessed via RAG are transparent to users in a way that the knowledge in the original foundation model is not.

Gen AI foundational models

Truly generative AI models—deep learning models that can autonomously create content on demand—have evolved over the last dozen years or so. The milestone model architectures during that period include

Variational autoencoders (VAEs), which drove breakthroughs in image recognition, natural language processing and anomaly detection.

Generative adversarial networks (GANs) and diffusion models, which improved the accuracy of previous applications and enabled some of the first AI solutions for photo-realistic image generation.

Transformers, the deep learning model architecture behind the foremost foundation models and generative AI solutions today.

Variational autoencoders (VAEs)

An autoencoder is a deep learning model comprising two connected neural networks: One that encodes (or compresses) a huge amount of unstructured, unlabeled training data into parameters, and another that decodes those parameters to reconstruct the content. Technically, autoencoders can generate new content, but they’re more useful for compressing data for storage or transfer, and decompressing it for use, than they are for high-quality content generation.

Introduced in 2013, variational autoencoders (VAEs) can encode data like an autoencoder, but decode multiple new variations of the content. By training a VAE to generate variations toward a particular goal, it can ‘zero in’ on more accurate, higher-fidelity content over time. Early VAE applications included anomaly detection (e.g., medical image analysis) and natural language generation.

Generative adversarial networks (GANs)

GANs, introduced in 2014, also comprise two neural networks: A generator, which generates new content, and a discriminator, which evaluates the accuracy and quality the generated data. These adversarial algorithms encourages the model to generate increasingly high-quality outpits.

GANs are commonly used for image and video generation, but can generate high-quality, realistic content across various domains. They've proven particularly successful at tasks as style transfer (altering the style of an image from, say, a photo to a pencil sketch) and data augmentation (creating new, synthetic data to increase the size and diversity of a training data set).

Diffusion models

Also introduced in 2014, diffusion models work by first adding noise to the training data until it’s random and unrecognizable, and then training the algorithm to iteratively diffuse the noise to reveal a desired output.

Diffusion models take more time to train than VAEs or GANs, but ultimately offer finer-grained control over output, particularly for high-quality image generation tool. DALL-E, Open AI’s image-generation tool, is driven by a diffusion model.

Transformers

First documented in a 2017 paper published by Ashish Vaswani and others, transformers evolve the encoder-decoder paradigm to enable a big step forward in the way foundation models are trained, and in the quality and range of content they can produce. These models are at the core of most of today’s headline-making generative AI tools, including ChatGPT and GPT-4, Copilot, BERT, Bard, and Midjourney to name a few.

Transformers use a concept called attention—determining and focusing on what’s most important about data within a sequence—to

  • process entire sequences of data—e.g., sentences instead of individual words—simultaneously;

  • capture the context of the data within the sequence;

  • encode the training data into embeddings (also called hyperparameters) that represent the data and its context.

In addition to enabling faster training, transformers excel at natural language processing (NLP) and natural language understanding (NLU), and can generate longer sequences of data—e.g., not just answers to questions, but poems, articles or papers—with greater accuracy and higher quality than other deep generative AI models. Transformer models can also be trained or tuned to use tools—e.g., a spreadsheet application, HTML, a drawing program—to output content in a particular format.

What Gen AI can do for us

Generative AI can create many types of content across many different domains. Types of contents it can generate include:

Text

Generative models. especially those based on transformers, can generate coherent, contextually relevant text—everything from instructions and documentation to brochures, emails, web site copy, blogs, articles, reports, papers, and even creative writing. They can also perform repetitive or tedious writing tasks (e.g., such as drafting summaries of documents or meta descriptions of web pages), freeing writers’ time for more creative, higher-value work.

Images and video

Image generation such as DALL-E, Midjourney and Stable Diffusion can create realistic images or original art, and can perform style transfer, image-to-image translation and other image editing or image enhancement tasks. Emerging gen AI video tools can create animations from text prompts, and can apply special effects to existing video more quickly and cost-effectively than other methods.

Sound, speech and music

Generative models can synthesize natural-sounding speech and audio content for voice-enabled AI chatbots and digital assistants, audiobook narration and other applications. The same technology can generate original music that mimics the structure and sound of professional compositions.

Software code

Gen AI can generate original code, autocomplete code snippets, translate between programming languages and summarize code functionality. It enables developers to quickly prototype, refactor, and debug applications while offering a natural language interface for coding tasks.

Design and art

Generative AI models can generate unique works of art and design, or assist in graphic design. Applications include dynamic generation of environments, characters or avatars, and special effects for virtual simulations and video games.

Simulations and synthetic data

Generative AI models can be trained to generate synthetic data, or synthetic structures based on real or synthetic data. For example, generative AI is applied in drug discovery to generate molecular structures with desired properties, aiding in the design of new pharmaceutical compounds.

Larger Language Models (LLM)

LLMs are a class of foundation models, which are trained on enormous amounts of data to provide the foundational capabilities needed to drive multiple use cases and applications, as well as resolve a multitude of tasks.

LLMs are designed to understand and generate text like a human, in addition to other forms of content, based on the vast amount of data used to train them. They have the ability to infer from context, generate coherent and contextually relevant responses, translate to languages other than English, summarize text, answer questions (general conversation and FAQs) and even assist in creative writing or code generation tasks.

How LLMs work

LLMs operate by leveraging deep learning techniques and vast amounts of textual data. These models are typically based on a transformer architecture, like the generative pre-trained transformer, which excels at handling sequential data like text input. LLMs consist of multiple layers of neural networks, each with parameters that can be fine-tuned during training, which are enhanced further by a numerous layer known as the attention mechanism, which dials in on specific parts of data sets.

During the training process, these models learn to predict the next word in a sentence based on the context provided by the preceding words. The model does this through attributing a probability score to the recurrence of words that have been tokenized— broken down into smaller sequences of characters. These tokens are then transformed into embeddings, which are numeric representations of this context.

To ensure accuracy, this process involves training the LLM on a massive corpora of text (in the billions of pages), allowing it to learn grammar, semantics and conceptual relationships through zero-shot and self-supervised learning. Once trained on this training data, LLMs can generate text by autonomously predicting the next word based on the input they receive, and drawing on the patterns and knowledge they've acquired. The result is coherent and contextually relevant language generation that can be harnessed for a wide range of NLU and content generation tasks.

Model performance can also be increased through prompt engineering, prompt-tuning, fine-tuning and other tactics like reinforcement learning with human feedback (RLHF) to remove the biases, hateful speech and factually incorrect answers known as “hallucinations” that are often unwanted byproducts of training on so much unstructured data. This is one of the most important aspects of ensuring enterprise-grade LLMs are ready for use and do not expose organizations to unwanted liability, or cause damage to their reputation.

What LLMs can do for us

LLMs are redefining an increasing number of business processes and have proven their versatility across a myriad of use cases and tasks in various industries. They augment conversational AI in chatbots and virtual assistants (like IBM watsonx Assistant and Google’s BARD) to enhance the interactions that underpin excellence in customer care, providing context-aware responses that mimic interactions with human agents.

LLMs also excel in content generation, automating content creation for blog articles, marketing or sales materials and other writing tasks. In research and academia, they aid in summarizing and extracting information from vast datasets, accelerating knowledge discovery. LLMs also play a vital role in language translation, breaking down language barriers by providing accurate and contextually relevant translations. They can even be used to write code, or “translate” between programming languages.

Moreover, they contribute to accessibility by assisting individuals with disabilities, including text-to-speech applications and generating content in accessible formats. From healthcare to finance, LLMs are transforming industries by streamlining processes, improving customer experiences and enabling more efficient and data-driven decision making.

Most excitingly, all of these capabilities are easy to access, in some cases literally an API integration away.

Here is a list of some of the most important areas where LLMs benefit organizations:

Text generation: language generation abilities, such as writing emails, blog posts or other mid-to-long form content in response to prompts that can be refined and polished. An excellent example is retrieval-augmented generation (RAG).

Content summarization: summarize long articles, news stories, research reports, corporate documentation and even customer history into thorough texts tailored in length to the output format.

AI assistants: chatbots that answer customer queries, perform backend tasks and provide detailed information in natural language as a part of an integrated, self-serve customer care solution.

Code generation: assists developers in building applications, finding errors in code and uncovering security issues in multiple programming languages, even “translating” between them.

Sentiment analysis: analyze text to determine the customer’s tone in order understand customer feedback at scale and aid in brand reputation management.

Language translation: provides wider coverage to organizations across languages and geographies with fluent translations and multilingual capabilities.

LLMs stand to impact every industry, from finance to insurance, human resources to healthcare and beyond, by automating customer self-service, accelerating response times on an increasing number of tasks as well as providing greater accuracy, enhanced routing and intelligent context gathering.

Deep learning

Deep Learning is a machine learning technology using deep neural networks (DNN). A deep neural network is a multiple-layered neural network with more than three or more layers and trained on large amount of data. Deep learning used for GPT 3 and 4 or others in the same rank usually consists of some billions of parameters or neurons.

Artificial Neural Network

Artificial neural network, or neural network for simplicity is a mathematical model used for machine learning.