AI Data Engineering Project for Beginners

AI Data Engineering Project for Beginners

Hello, dears, today we are gonna make AI Data Engineering Pet Project for beginners: including LangChain + Vertex AI PaLM API on BigQuery.

This hands-on tutorial will show you how you can add generative AI features to your data warehouse with just a few lines of code using LangChain and LLMs on Google Cloud.

We will build together a sample Python application that will be able to understand and respond to human language queries about the relational data stored in your Data warehouse.

  • 🏆 This could be a great feature if you want to enable or showcase to your management how to talk to the data in a natural way and make their life easier.
  • 🦾 It's basically creating your own AI Data assistant

Github repo: https://github.com/nataindata/ai-data-engineering-project

After completing the steps:

  • You will get hands-on experience with using the open-source LangChain framework to develop applications powered by large language models. And LangChain makes it vendor-agnostic.
  • You will learn about the powerful features in Google PaLM models made available through Vertex AI and apply them on your BigQuery dataset

Dataset:

This notebook uses an example of TheLook data - a fictitious eCommerce clothing site developed by the Looker team.

  • It’s public and could be found atbigquery-public-data.thelook_ecommerce.inventory_items

Before you begin

⚠️ Running this codelab will incur Google Cloud charges. You may also be billed for Vertex AI API usages. jfyi, it took me some peanuts when creating this tutorial

[GCP link https://cloud.google.com/free/docs/free-cloud-features]

But you can create a new Cloud project with free trial cloud credits.

So:

  • You need to have an active Google Cloud account to complete this tutorial.
  • Make a copy of the notebook and save a copy in the Drive.
  • The account is the same as your Google Cloud account, so the sample notebook is connected to Google Cloud project, but nothing else is needed other than your Google Cloud project.
  • At the end of the tutorial, you can optionally clean-up these resources to avoid further charges.

Short and sweet explanation about Langchain:

it’s an open-source framework that allows AI developers to combine LLM like PaLM with external sources of computation and data

Large language models or LLMs such as ChatGPT/Vertex AI can answer questions about a lot of topics, but an LLM in isolation knows only what it was trained on, which doesn't include your personal/company data, such as if you're in a company and have proprietary documents not on the internet, as well as data or articles that were written after the LLM was trained.

So wouldn't it be useful if you or your colleagues could have a conversation with your data and get answers from it?

LangChain is an open-source developer framework for building LLM applications. LangChain consists of several modular components as well as more end-to-end templates. The modular components in LangChain are:

  • prompts,
  • models,
  • indexes,
  • chains,
  • and agents

Now let’s talk about the components we are going to use here:

  • Prompt Template

An object that helps create prompts based on a combination of user input, other non-static information, and a fixed template string. Think of it as an f-string in Python but for prompts

You are simply indicating variables, passing those into PromptTemplate class, and enjoying the output

  • Language Model

A model that does text in ➡️ text out!

  • SQLDatabaseChain

Querying Tabular Data - Common type of data in the world sits in tabular form. It is super powerful to be able to query this data with LangChain and pass it through to an LLM SQLDatabaseChain refers to a built-in chain that allows you to interact with SQL databases. It essentially enables you to bridge the gap between natural language and structured data stored in SQL databases.

Function:

  • Allows you to query databases using natural language, similar to asking questions in plain English.
  • Can be used to build chatbots, dashboards, and other applications that interact with SQL data.
  • Supports various SQL dialects through SQLAlchemy, including MySQL, PostgreSQL, and Oracle.

We’ve built AI Data Engineering Pet Project for beginners: including LangChain + Vertex AI PaLM API on BigQuery.

Please comment if you’ve liked this video and you want me to proceed! Your feedback is important, cause it motivates me to create :) We can build more fun projects and explore GenAI together. Until then, stay curious!