加密货币大模型RAG系统开发
在这个教程中,我们将使用维基百科的加密货币文章为 LLM 实现检索增强生成系统,并提供本教程的Python笔记本代码。
一键发币: SOL | BNB | ETH | BASE | Blast | ARB | OP | POLYGON | AVAX | FTM | OK
大型语言模型是令人惊叹的工具,可以帮助人类获得问题的答案,总结大量文本,将文档从一种语言翻译成另一种语言,并帮助我们编码等等。
然而,LLM有一个主要问题:幻觉。 当LLM从其训练数据中吐出随机事实时,即使它可能与用户的提示没有真正的联系,幻觉也会发生。 大型语言模型很难对他们没有答案的问题说“我不知道”。
检索增强生成 (RAG) 是一种人工智能框架,有两个主要目标:通过将模型连接到外部知识源来提高生成响应的质量,并确保用户可以访问模型的源,以便n你可以对其进行事实检查。 答案的准确性。
借助 RAG,我们还可以通过将大型语言模型连接到可以检索信息的自定义数据源来确保大型语言模型能够访问专有数据。
下图可以清楚地了解 RAG 的工作原理。 首先,用户向 LLM 提出问题。 在到达模型之前,问题到达了检索器。 该检索器将负责从知识库中查找和检索相关文档来回答问题。 然后,问题以及相关文件将被发送给LLM,LLM将能够根据收到的文件来源生成基于来源的答案。
在这个教程中,我们将使用维基百科的加密货币文章(我几天前在 Kaggle 上上传的数据集)为 LLM 实现检索增强生成系统。可以从这里访问本教程的Kaggle笔记本。
1、开发环境搭建
在开始编写代码之前,我们先安装一些相关的软件包,包括:
- Chromadb:一个开源的向量嵌入数据库,允许我们将LLM接入知识库。 它允许我们存储和查询嵌入及其元数据。
- LangChain:一个允许我们开发由LLM支持的多个应用程序的框架。
- Sentence Transformers:该框架提供了一种简单的方法,通过利用预先训练的 Transformer 模型来计算句子、段落和图像的密集向量表示。
- bitsandbytes:一个库,旨在通过模型权重的 4 位量化来优化大型模型的训练和部署,从而减少内存占用并提高内存效率。
# Auto DataViz tool
!pip install ydata-profiling
# Chromadb
!pip install chromadb
# LangChain
!pip install langchain
# Sentence Transformers
!pip install sentence_transformers
# bitsandbytes
!pip install bitsandbytes
初始代码如下:
# Importing libs
# Data Handling
import pandas as pd
import numpy as np
# Auto EDA
from ydata_profiling import ProfileReport
# Torch and Transformers
import torch
from torch import bfloat16
import transformers
from transformers import AutoTokenizer
# LangChain
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
# Hiding warnings
import warnings
warnings.filterwarnings("ignore")
# Checking if GPU is available
if torch.cuda.is_available():
gpu_name = torch.cuda.get_device_name(torch.cuda.current_device())
total_memory = torch.cuda.get_device_properties(0).total_memory
total_memory_gb = total_memory / (1024**3) # Converting memory to Gb
print("GPU is available. \nUsing GPU")
print("\nGPU Name:", gpu_name)
print(f"Total GPU Memory: {total_memory_gb:.2f} GB")
device = torch.device('cuda')
else:
print("GPU is not available. \nUsing CPU")
device = torch.device('cpu')
运行上面代码,输出如下:
GPU is available.
Using GPU
GPU Name: Tesla T4
Total GPU Memory: 14.75 GB
2、数据检查
正如我之前提到的,我们将使用维基百科的加密文章数据集作为模型的知识源库。
我们将使用 YData Profiling(一种自动可视化工具),只需几行代码即可从数据集中提取一些信息。
# Loading dataframe
df = pd.read_csv('/kaggle/input/wikipedia-crypto-articles/Wikipedia Crypto Articles.csv')
# Generating report
report = ProfileReport(df, title = 'Wikipedia Crypto Articles')
数据集由两列组成: title
和 article
。 我们有九个条目为空文章。 我们将删除这些行。
Dataframe Length: 227 rows
Length After Dropping Empty Values: 218 rows
让我们看一下dataframe的内容。 我将打印最后一个条目的 title
和 article
中的文本内容。
itle: NEO (cryptocurrency)
Neo is a blockchain-based cryptocurrency and application platform used to run smart contracts and decentralized applications. The project, originally named Antshares, was founded in 2014 by Da HongFei and Erik Zhang and rebranded as Neo in 2017. In 2017 and 2018, the cryptocurrency maintained some success in the Chinese market despite the recently-enacted prohibition on cryptocurrency in that country.
== Technical specifications ==
The Neo network runs on a proof-of-stake decentralized Byzantine fault tolerant (dBFT) consensus mechanism between a number of centrally approved nodes, and can support up to 10,000 transactions per second. The base asset of the Neo blockchain is the non-divisible Neo token which generates GAS tokens. These GAS tokens, a separate asset on the network, can be used to pay for transaction fees, and are divisible with the smallest unit of 0.00000001. The inflation rate of GAS is controlled with a decaying half-life algorithm that is designed to release 100 million GAS over approximately 22 years.X.509 Digital Identities allow developers to tie tokens to real-world identities which aid in complying with KYC/AML and other regulatory requirements.
== History ==
In 2014, Antshares was founded by Da Hongfei and Erik Zhang. In the following year, it was open-sourced on GitHub and by September 2015, the white paper was released.A total of 100 million Neo were created in the genesis block. 50 million Neo were sold to early investors through an initial coin offering in 2016 that raised US 4.65 million, with the remaining 50 million Neo locked into a smart contract. Each year, a maximum of 15 million Neo tokens are unlocked which are used by the Neo development team to fund long-term development goals.Neo was officially rebranded from Antshares in June 2017, with the idea of combining the past and the future.Neo3 or N3 was first announced by Erik Zhang in 2018 as an upgrade to the previous Neo protocol (now known as Neo Legacy). Certain new features do not have backward compatibility with the Neo Legacy blockchain. N3 was implemented and launched with a new genesis block.In March 2018, Neo's parent company Onchain distributed 1 Ontology token (ONT) for every 5 NEO held in a user's cryptocurrency wallet. These tokens were intended to be used to vote on system upgrades, identity verification mechanisms, and other governance issues on the Neo platform.
== References ==
== External links ==
Official Website
3、存储数据
LangChain 有一个名为文档加载器的工具,它允许我们从源加载多种类型的数据作为文档。 文档包含文本和关联的元数据。 我们将使用 DataFrameLoader
类从 pandas DataFrame 加载数据。
# Loading dataframe content into a document
articles = DataFrameLoader(df,
page_content_column = "title")
# Loading entire dataframe into document format
document = articles.load()
在从文档创建嵌入之前,我们必须将其分割成更小的块。 我们这样做有几个原因。
首先,嵌入模型可能有最大令牌限制,并且分割数据可确保每个块都符合这些限制。
其次,较小的块具有更高的内存效率,从而降低了计算成本。
第三,嵌入更小、更连贯的文本片段可能会带来更高的准确性和更有意义的表示。
我们将使用 LangChains 的 RecursiveCharacterTextSplitter
来分割数据。
# Splitting document into smaller chunks
splitter = RecursiveCharacterTextSplitter(chunk_size = 1000,
chunk_overlap = 20)
splitted_texts = splitter.split_documents(document)
我们将使用 HuggingFaceEmbeddings
从 SentenceTransformers 🤗 加载模型。 更具体地说,我们将加载 Sentence-Transformers/all-MiniLM-L6-v2
模型,它将句子和段落映射到 384 维密集向量空间。
# Loading model to create the embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
通过 Chroma,我们将创建嵌入数据的索引数据库。 我们使用 Chroma 类中的 .from_documents( )
方法并输入文本块 splitted_texts
、嵌入模型 embedding_model
,最后指定存储索引数据库的目录。
# Creating and indexed database
chroma_database = Chroma.from_documents(splitted_texts,
embedding_model,
persist_directory = 'chroma_db')
你可以在下面看到 chroma_database
是一个矢量存储:
# Visualizing the database
chroma_database
<langchain_community.vectorstores.chroma.Chroma at 0x7bc54e8b2ec0>
我们使用向量存储来存储嵌入数据。 当我们向模型询问某件事时,链将嵌入该查询,并根据嵌入向量与嵌入查询之间的“相似性”,使用它从向量存储中检索嵌入向量。 矢量存储仅负责存储嵌入数据并执行矢量搜索以获得我们问题的高质量答案。
下面,我们定义一个检索器。 检索器负责根据给定的查询从向量存储中检索文档。 它们接受字符串查询作为输入,并返回文档列表作为输出。
# Defining a retriever
retriever = chroma_database.as_retriever()
VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7bc54e8b2ec0>)
可以在上面看到检索器对象是 VectorStoreRetriever
的实例,并且它链接到矢量存储 chroma_database
。
4、加载 Mistral 7b 模型
我们可以使用 HuggingFacePipeline 类加载大型语言模型,这使我们能够访问 Huggingface.co 上公开提供的超过 120,000 个开源模型。 我们将加载的模型是用于文本生成任务的 Mistralai/Mistral-7B-v0.1
模型。
Mistral 7B 是一款开源 7.3B 参数模型,在所有基准测试中均优于 Meta 的 Llama 2 13B。 它是 2024 年 1 月推出的最强大的开源模型之一。
由于这是一个大型模型,我们将使用 bitsandbytes 库来创建 quantization_config
变量,该变量将以 4 位量化格式加载模型并启用双量化。 我们还将计算数据类型设置为 bfloat16。 这些设置可优化模型的大小和性能,避免内存使用问题。
# Configuring BitsAndBytesConfig for loading model in an optimal way
quantization_config = transformers.BitsAndBytesConfig(load_in_4bit = True,
bnb_4bit_quant_type = 'nf4',
bnb_4bit_use_double_quant = True,
bnb_4bit_compute_dtype = bfloat16)
我们最终将模型加载到 llm
变量中。 使用 model_kwargs
字典,我们为模型定义了一些行为。 例如,温度是一个范围从 0.0 到 1.0 的参数,它定义了模型的“创造性”程度。 较低的值使响应更可预测。 max_length
定义生成输出的最大长度,而 quantization_config
应用之前定义的量化配置来优化模型:
# Loading Mistral 7b model
llm = HuggingFacePipeline.from_model_id(model_id='/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1',
task = 'text-generation',
model_kwargs={'temperature': .3,
'max_length': 1024,
'quantization_config': quantization_config},
device_map = "auto")
在 LangChain 中,我们有称为 Chains 的模块,这是对 LLM 的一系列调用。 其中一个链是 RetrievalQA,它从索引数据库中获取相关文档,然后将这些文档传递到 LLM 以生成响应。 让我们定义一个问答链。
# Defining a QnA chain
QnA = RetrievalQA.from_chain_type(llm = llm,
chain_type = 'stuff',
retriever = retriever,
verbose = False)
5、查询模型
下面,我将定义 get_answers
函数,该函数接受我们上面创建的 QnA 链和一个查询,这是我们向 LLM 提出的问题。
# Defining function to fetch documents according to a query
def get_answers(QnA, query):
answer = QnA.run(query)
print(f"\033[1mQuery:\033[0m {query}\n")
print(f"\033[1mAnswer:\033[0m ", answer)
我们现在可以向模型提出问题! 让我们从几个例子开始。
Query: Who created the Bitcoin? When was it created?
Answer: Bitcoin was created by an unknown person or group of people using the name Satoshi Nakamoto. It was created in 2009.
Query: What was the biggest scam in the history of cryptocurrencies?
Answer: The biggest scam in the history of cryptocurrencies was the 2021 Squid Game cryptocurrency scam.
Query: How much will one Bitcoin cost in 2030?
Answer: I don't know.
Query: Cite the names of five relevant people in crypto?
Answer: 1. Andreas Antonopolous 2. Brian Armstrong 3. Changpeng Zhao 4. Andreas Antonopolous 5. Andreas Antonopolous
Query: What exchanges can I use to buy crypto?
Answer: Crypto.com is a cryptocurrency exchange that offers an initial exchange offering (IEO) for various cryptocurrencies. It is available in Europe and other parts of the world.
Query: Who conceived Ethereum?
Answer: Vitalik Buterin conceived Ethereum.
当被问及谁创建了比特币以及它何时创建时,该模型正确回答。 老实说,我不知道 Squid Game 骗局是否是最大的骗局,但这是一个发生在 2021 年左右的骗局,你可以在维基百科上了解它。
当被问及 2030 年一枚比特币的价格时,模型回答“我不知道”,这很好。 我们不希望模型对其无法访问的信息进行猜测。
当被问及加密货币领域五个相关人员的名字时,该模型多次重复了 Andreas Antonopoulos 的名字,甚至拼错了他的姓氏。 这里肯定还有改进的空间,但是该模型仍然正确地命名了与加密货币相关的人员。
在下面的代码中,我们将使用查询从向量存储中检索文档,并显示检索到的文档的数量和这些文档的来源,即维基百科上显示的文章标题。 我还将打印文档元数据中文章文本的前 350 个字符,以便你可以阅读其中的一部分。
# Obtaining the source and documents searched
docs = chroma_database.similarity_search(query)
print(f'Query: {query}')
print(f'Retrieved documents: {len(docs)}')
for doc in docs:
details = doc.to_json()['kwargs']
print("\nSource (Article Title):", details['page_content'])
print("\nText", details['metadata']['article'][:350] + ". . .")
print('\n\n\n')
Query: Who conceived Ethereum?
Retrieved documents: 4
Source (Article Title): Ethereum
Text Ethereum is a decentralized blockchain with smart contract functionality. Ether (Abbreviation: ETH; sign: Ξ) is the native cryptocurrency of the platform. Among cryptocurrencies, ether is second only to bitcoin in market capitalization. It is open-source software.
Ethereum was conceived in 2013 by programmer Vitalik Buterin. Additional founders of. . .
Source (Article Title): History of bitcoin
Text Bitcoin is a cryptocurrency, a digital asset that uses cryptography to control its creation and management rather than relying on central authorities. Originally designed as a medium of exchange, Bitcoin is now primarily regarded as a store of value. The history of bitcoin started with its invention and implementation by Satoshi Nakamoto, who integ. . .
Source (Article Title): Ethereum Classic
Text Ethereum Classic is a blockchain-based distributed computing platform that offers smart contract (scripting) functionality. It is open source and supports a modified version of Nakamoto consensus via transaction-based state transitions executed on a public Ethereum Virtual Machine (EVM).
Ethereum Classic maintains the original, unaltered history o. . .
Source (Article Title): The Rise and Rise of Bitcoin
Text The Rise and Rise of Bitcoin is a 2014 American documentary film directed by Nicholas Mross. The film interviews multiple companies and people that have played important roles in the history expansion of Bitcoin. It first premiered at the Tribeca Film Festival in New York on April 23, 2014. The film was nominated for the “Best International Documen. . .
# Trying a different query
query = """What exchanges can I use to buy crypto?"""
docs = chroma_database.similarity_search(query)
print(f'Query: {query}')
print(f'Retrieved documents: {len(docs)}')
for doc in docs:
details = doc.to_json()['kwargs']
print("\nSource (Article Title):", details['page_content'])
print("\nText", details['metadata']['article'][:350] + ". . .")
print('\n\n\n')
Query: What exchanges can I use to buy crypto?
Retrieved documents: 4
Source (Article Title): Cryptocurrency exchange
Text A cryptocurrency exchange, or a digital currency exchange (DCE), is a business that allows customers to trade cryptocurrencies or digital currencies for other assets, such as conventional fiat money or other digital currencies. Exchanges may accept credit card payments, wire transfers or other forms of payment in exchange for digital currencies or . . .
Source (Article Title): Crypto.com
Text Crypto.com is a cryptocurrency exchange company based in Singapore. As of June 2023, the company reportedly had 80 million customers and 4,000 employees. The exchange issues its own exchange token named Cronos (CRO).
== History ==
The company was initially founded in Hong Kong by Bobby Bao, Gary Or, Kris Marszalek, and Rafael Melo in 2016 as "Mon. . .
Source (Article Title): Initial exchange offering
Text An Initial exchange offering (IEO) is the cryptocurrency exchange equivalent to a stock launch or Initial public offering (IPO). An IEO is the process of digital asset (e.g. coins or tokens) procurement through an established exchange for the purpose of raising capital for start-up companies. Exchanges act as a middleman between investors and the s. . .
Source (Article Title): Cryptocurrencies in Europe
Text The general notion of cryptocurrencies in Europe denotes the processes of legislative regulation, distribution, circulation, and storage of cryptocurrencies in Europe. In April 2023, the EU Parliament passed the Markets in Crypto Act (MiCA) unified legal framework for crypto-assets within the European Union.
== The legality of cryptocurrencies in. . .
6、结束语
在教程中,我们探索了检索增强生成(RAG)作为提高LLM的过程,让法学硕士通过索引向量化数据库访问外部数据以检索信息。
我们使用了 Mistral 7b 模型,这是 2024 年 1 月最有效的开源模型之一。为了进行测试,我们使用了维基百科加密文章,这是我通过从维基百科获取加密货币相关文章而创建的数据集。