Adobe PDF Extract API
Adobe PDF Extract API is a machine-learning based service that extracts content from
This current implementation of a loader using
Adobe PDF Extract API
can either incorporate content as one document inJSON
format, split it into chunks (optimized for RAG), or extract only figures and tables.
Figures are represented as placeholders inside the generated chunks with the base64 encoded image contained in the metadata. If you want to incorporate with an LLM that supports vision, you can split the message at its placeholder and insert the image as a separate message.
Extracted tables are formatted in markdown, and figures are extracted as base64 encoded images.
Setupโ
Adobe PDF Extraction API credentials - follow this document to get one if you don't have. You will be passing <client_id>
and <client_secret>
as parameters to the loader.
%pip install --upgrade --quiet langchain langchain-community adobe-pdf-extraction
"chunk" modeโ
The first example uses a local file which will be sent to Adobe PDF Extract API.
With the initialized document analysis client, we can proceed to create an instance of the DocumentIntelligenceLoader:
from langchain_community.document_loaders import GenericLoader
from langchain_community.document_loaders.parsers import AdobePDFExtractParser
file_path = "<filepath>"
client_id = "<client_id>"
client_secret = "<client_secret>"
parser = AdobePDFExtractParser(
client_id=client_id,
client_secret=client_secret,
mode="chunk",
)
loader = GenericLoader.from_filesystem(file_path, parser=parser)
documents = loader.load()
The output contains Documents with the extracted chunks.
documents
"json" modeโ
The extraction result can also be returned in raw JSON format.
from langchain_community.document_loaders import GenericLoader
from langchain_community.document_loaders.parsers import AdobePDFExtractParser
file_path = "<filepath>"
client_id = "<client_id>"
client_secret = "<client_secret>"
parser = AdobePDFExtractParser(
client_id=client_id,
client_secret=client_secret,
mode="json",
)
loader = GenericLoader.from_filesystem(file_path, parser=parser)
documents = loader.load()
documents
"data" modeโ
To extract only figures and tables from the PDF, set the mode to data
.
from langchain_community.document_loaders import GenericLoader
from langchain_community.document_loaders.parsers import AdobePDFExtractParser
file_path = "<filepath>"
client_id = "<client_id>"
client_secret = "<client_secret>"
parser = AdobePDFExtractParser(
client_id=client_id,
client_secret=client_secret,
mode="data",
)
loader = GenericLoader.from_filesystem(file_path, parser=parser)
documents = loader.load()
The resulting output will be langchain documents with the extracted figures and tables.
for document in documents:
if document.metadata["content_type"] == "markdown":
print(f"Table Content: {document.page_content}")
elif document.metadata["content_type"] == "base64":
print(f"Figure Content: {document.page_content}")
Relatedโ
- Document loader conceptual guide
- Document loader how-to guides