Constructing a Customized PDF Parser with PyPDF and LangChain

June 14, 2025

4

Constructing a Customized PDF Parser with PyPDF and LangChain

Picture by Creator | Canva

PDF information are all over the place. You’ve in all probability seen them in varied locations, corresponding to school papers, electrical energy payments, workplace contracts, product manuals, and extra. They’re tremendous frequent, however working with them is just not as straightforward because it appears. Let’s say you wish to extract helpful data from a PDF, like studying the textual content, splitting it into sections, or getting a fast abstract. This may increasingly sound easy, however you’ll see it’s not so clean when you strive.

In contrast to Phrase or HTML information, PDFs don’t retailer content material in a neat, readable method. As a substitute, they’re designed to look good, to not be learn by applications. The textual content will be in every single place, break up into bizarre blocks, scattered throughout the web page, or combined up with tables and pictures. This makes it onerous to get clear, structured information from them.

On this article, we’re going to construct one thing that may deal with this mess. We’ll create a customized PDF parser that may:

Extract and clear textual content from PDFs on the web page stage, with non-compulsory structure preservation for higher formatting
Deal with picture metadata extraction
Take away undesirable headers and footers by detecting repeated strains throughout pages to scale back noise
Retrieve detailed doc and page-level metadata, corresponding to writer, title, creation date, rotation, and web page dimension
Chunk the content material into manageable items for additional NLP or LLM processing

Let’s get began.

Folder Construction

Earlier than beginning, it’s good to prepare your undertaking information for readability and scalability.

custom_pdf_parser/
│
├── parser.py           
├── langchain_loader.py  
├── pipeline.py          
├── instance.py     
├── necessities.txt     # Dependencies listing
└── __init__.py         # (Non-compulsory) to mark listing as Python package deal

You’ll be able to depart the __init.py__ file empty, as its principal function is just to point that this listing must be handled as a Python package deal. I’ll clarify the aim of every of the remaining information step-by-step.

Instruments Required(necessities.txt)

The mandatory libraries required are:

PyPDF: A pure Python library to learn and write PDF information. Will probably be used to extract the textual content from PDF information
LangChain: A framework to construct context-aware purposes with language fashions (we’ll use it to course of and chain doc duties). Will probably be used to course of and arrange the textual content correctly.

Set up them with:

pip set up pypdf langchain

If you wish to handle dependencies neatly, create a necessities.txt file with:

And run:

pip set up -r necessities.txt

Step 1: Set Up the PDF Parser(parser.py)

The core class CustomPDFParser makes use of PyPDF to extract textual content and metadata from every PDF web page. It additionally consists of strategies to scrub textual content, extract picture data (non-compulsory), and take away repeated headers or footers that usually seem on every web page.

It helps preserving structure formatting
It extracts metadata like web page quantity, rotation, and media field dimensions
It could actually filter out pages with too little content material
Textual content cleansing removes extreme whitespace whereas preserving paragraph breaks

The logic that implements all of those is:

import os
import logging
from pathlib import Path
from typing import Checklist, Dict, Any
import pypdf
from pypdf import PdfReader
# Configure logging to indicate information and above messages
logging.basicConfig(stage=logging.INFO)
logger = logging.getLogger(__name__)
class CustomPDFParser:
  def __init__(
      self,extract_images: bool = False,preserve_layout: bool = True,remove_headers_footers: bool = True,min_text_length: int = 10
  ):
      """
      Initialize the parser with choices to extract photographs, protect structure, take away repeated headers/footers, and minimal textual content size for pages.
      Args:
          extract_images: Whether or not to extract picture information from pages
          preserve_layout: Whether or not to maintain structure spacing in textual content extraction
          remove_headers_footers: Whether or not to detect and take away headers/footers
          min_text_length: Minimal size of textual content for a web page to be thought-about legitimate
      """
      self.extract_images = extract_images
      self.preserve_layout = preserve_layout
      self.remove_headers_footers = remove_headers_footers
      self.min_text_length = min_text_length
  def extract_text_from_page(self, web page: pypdf.PageObject, page_num: int) -> Dict(str, Any):
      """
      Extract textual content and metadata from a single PDF web page.
      Args:
          web page: PyPDF web page object
          page_num: zero-based web page quantity
      Returns:
          dict with keys:
              - 'textual content': extracted and cleaned textual content string,
              - 'metadata': web page metadata dict,
              - 'word_count': variety of phrases in extracted textual content
      """
      strive:
 # Extract textual content, optionally preserving the structure for higher formatting
          if self.preserve_layout:
              textual content = web page.extract_text(extraction_mode="structure")
          else:
              textual content = web page.extract_text()
        # Clear textual content: take away further whitespace and normalize paragraphs
          textual content = self._clean_text(textual content)
        # Collect web page metadata (web page quantity, rotation angle, mediabox)
          metadata = {
              "page_number": page_num + 1,  # 1-based numbering
              "rotation": getattr(web page, "rotation", 0),
              "mediabox": str(getattr(web page, "mediabox", None)),
          }
          # Optionally, extract picture information from web page if requested
          if self.extract_images:
              metadata("photographs") = self._extract_image_info(web page)
          # Return dictionary with textual content and metadata for this web page
          return {
              "textual content": textual content,
              "metadata": metadata,
              "word_count": len(textual content.break up()) if textual content else 0
          }
      besides Exception as e:
          # Log error and return empty information for problematic pages
          logger.error(f"Error extracting web page {page_num}: {e}")
          return {
              "textual content": "",
              "metadata": {"page_number": page_num + 1, "error": str(e)},
              "word_count": 0
          }
  def _clean_text(self, textual content: str) -> str:
      """
      Clear and normalize extracted textual content, preserving paragraph breaks.
      Args:
          textual content: uncooked textual content extracted from PDF web page
      Returns:
          cleaned textual content string
      """
      if not textual content:
          return ""
      strains = textual content.break up('n')
      cleaned_lines = ()
      for line in strains:
          line = line.strip()  # Take away main/trailing whitespace
          if line:
              # Non-empty line; hold it
              cleaned_lines.append(line)
          elif cleaned_lines and cleaned_lines(-1):
              # Protect paragraph break by protecting empty line provided that earlier line exists
              cleaned_lines.append("")
      cleaned_text="n".be a part of(cleaned_lines)
#Scale back any situations of greater than two consecutive clean strains to 2
      whereas 'nnn' in cleaned_text:
          cleaned_text = cleaned_text.exchange('nnn', 'nn')
      return cleaned_text.strip()
  def _extract_image_info(self, web page: pypdf.PageObject) -> Checklist(Dict(str, Any)):
      """
      Extract primary picture metadata from web page, if accessible.
      Args:
          web page: PyPDF web page object
      Returns:
          Checklist of dictionaries with picture information (index, identify, width, peak)
      """
      photographs = ()
      strive:
          # PyPDF pages can have an 'photographs' attribute itemizing embedded photographs
          if hasattr(web page, 'photographs'):
              for i, picture in enumerate(web page.photographs):
                  photographs.append({
                      "image_index": i,
                      "identify": getattr(picture, 'identify', f"image_{i}"),
                      "width": getattr(picture, 'width', None),
                      "peak": getattr(picture, 'peak', None)
                  })
      besides Exception as e:
          logger.warning(f"Picture extraction failed: {e}")
      return photographs

  def _remove_headers_footers(self, pages_data: Checklist(Dict(str, Any))) -> Checklist(Dict(str, Any)):
      """
      Take away repeated headers and footers that seem on many pages.
      That is performed by figuring out strains showing on over 50% of pages
      initially or finish of web page textual content, then eradicating these strains.
      Args:
          pages_data: Checklist of dictionaries representing every web page's extracted information.
      Returns:
          Up to date listing of pages with headers/footers eliminated
      """
      # Solely try removing if sufficient pages and choice enabled
      if len(pages_data) < 3 or not self.remove_headers_footers:
          return pages_data
      # Gather first and final strains from every web page's textual content for evaluation
      first_lines = (web page("textual content").break up('n')(0) if web page("textual content") else "" for web page in pages_data)
      last_lines = (web page("textual content").break up('n')(-1) if web page("textual content") else "" for web page in pages_data)
      threshold = len(pages_data) * 0.5  # Greater than 50% pages
      # Determine candidate headers and footers showing continuously
      potential_headers = (line for line in set(first_lines)
                          if first_lines.depend(line) > threshold and line.strip())
      potential_footers = (line for line in set(last_lines)
                          if last_lines.depend(line) > threshold and line.strip())
      # Take away recognized headers and footers from every web page's textual content
      for page_data in pages_data:
          strains = page_data("textual content").break up('n')
          # Take away header if it matches a frequent header
          if strains and potential_headers:
              for header in potential_headers:
                  if strains(0).strip() == header.strip():
                      strains = strains(1:)
                      break
          # Take away footer if it matches a frequent footer
          if strains and potential_footers:
              for footer in potential_footers:
                  if strains(-1).strip() == footer.strip():
                      strains = strains(:-1)
                      break

          page_data("textual content") = 'n'.be a part of(strains).strip()
      return pages_data
  def _extract_document_metadata(self, pdf_reader: PdfReader, pdf_path: str) -> Dict(str, Any):
      """
      Extract metadata from the PDF doc itself.
      Args:
          pdf_reader: PyPDF PdfReader occasion
          pdf_path: path to PDF file
      Returns:
          Dictionary of metadata together with file information and PDF doc metadata
      """
      metadata = {
          "file_path": pdf_path,
          "file_name": Path(pdf_path).identify,
          "file_size": os.path.getsize(pdf_path) if os.path.exists(pdf_path) else None,
      }
      strive:
          if pdf_reader.metadata:
              # Extract frequent PDF metadata keys if accessible
              metadata.replace({
                  "title": pdf_reader.metadata.get('/Title', ''),
                  "writer": pdf_reader.metadata.get('/Creator', ''),
                  "topic": pdf_reader.metadata.get('/Topic', ''),
                  "creator": pdf_reader.metadata.get('/Creator', ''),
                  "producer": pdf_reader.metadata.get('/Producer', ''),
                  "creation_date": str(pdf_reader.metadata.get('/CreationDate', '')),
                  "modification_date": str(pdf_reader.metadata.get('/ModDate', '')),
              })
      besides Exception as e:
          logger.warning(f"Metadata extraction failed: {e}")
      return metadata
  def parse_pdf(self, pdf_path: str) -> Dict(str, Any):
      """
      Parse your entire PDF file. Opens the file, extracts textual content and metadata web page by web page, removes headers/footers if configured, and aggregates outcomes.
      Args:
          pdf_path: Path to the PDF file
      Returns:
          Dictionary with keys:
              - 'full_text': mixed textual content from all pages,
              - 'pages': listing of page-wise dicts with textual content and metadata,
              - 'document_metadata': file and PDF metadata,
              - 'total_pages': whole pages in PDF,
              - 'processed_pages': variety of pages saved after filtering,
              - 'total_words': whole phrase depend of parsed textual content
      """
      strive:
          with open(pdf_path, 'rb') as file:
              pdf_reader = PdfReader(file)
              doc_metadata = self._extract_document_metadata(pdf_reader, pdf_path)
              pages_data = ()
              # Iterate over all pages and extract information
              for i, web page in enumerate(pdf_reader.pages):
                  page_data = self.extract_text_from_page(web page, i)
                  # Solely hold pages with enough textual content size
                  if len(page_data("textual content")) >= self.min_text_length:
                      pages_data.append(page_data)
              # Take away repeated headers and footers
              pages_data = self._remove_headers_footers(pages_data)
           # Mix all web page texts with a double newline as a separator
              full_text="nn".be a part of(web page("textual content") for web page in pages_data if web page("textual content"))
              # Return closing structured information
              return {
                  "full_text": full_text,
                  "pages": pages_data,
                  "document_metadata": doc_metadata,
                  "total_pages": len(pdf_reader.pages),
                  "processed_pages": len(pages_data),
                  "total_words": sum(web page("word_count") for web page in pages_data)
              }
      besides Exception as e:
          logger.error(f"Did not parse PDF {pdf_path}: {e}")
          increase

Step 2: Combine with LangChain (langchain_loader.py)

The LangChainPDFLoader class wraps the customized parser and converts parsed pages into LangChain Doc objects, that are the constructing blocks for LangChain pipelines.

It permits chunking of paperwork into smaller items utilizing LangChain’s RecursiveCharacterTextSplitter
You’ll be able to customise chunk sizes and overlap for downstream LLM enter
This loader helps clear integration between uncooked PDF content material and LangChain’s doc abstraction

The logic behind that is:

from typing import Checklist, Non-compulsory, Dict, Any
from langchain.schema import Doc
from langchain.document_loaders.base import BaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from parser import CustomPDFParser  # import the parser outlined above
class LangChainPDFLoader(BaseLoader):
   def __init__(
       self,file_path: str,parser_config: Non-compulsory(Dict(str, Any)) = None,chunk_size: int = 500, chunk_overlap: int = 50
   ):
       """
       Initialize the loader with the PDF file path, parser configuration, and chunking parameters.
       Args:
           file_path: path to PDF file
           parser_config: dictionary of parser choices
           chunk_size: chunk dimension for splitting lengthy texts
           chunk_overlap: chunk overlap for splitting
       """
       self.file_path = file_path
       self.parser_config = parser_config or {}
       self.chunk_size = chunk_size
       self.chunk_overlap = chunk_overlap
       self.parser = CustomPDFParser(**self.parser_config)
   def load(self) -> Checklist(Doc):
       """
       Load PDF, parse pages, and convert every web page to a LangChain Doc.
       Returns:
           Checklist of Doc objects with web page textual content and mixed metadata.
       """
       parsed_data = self.parser.parse_pdf(self.file_path)
       paperwork = ()
       # Convert every web page dict to a LangChain Doc
       for page_data in parsed_data("pages"):
           if page_data("textual content"):
               # Merge document-level and page-level metadata
               metadata = {**parsed_data("document_metadata"), **page_data("metadata")}
               doc = Doc(page_content=page_data("textual content"), metadata=metadata)
               paperwork.append(doc)
       return paperwork
   def load_and_split(self) -> Checklist(Doc):
       """
       Load the PDF and break up massive paperwork into smaller chunks.
       Returns:
           Checklist of Doc objects after splitting massive texts.
       """
       paperwork = self.load()
       # Initialize a textual content splitter with the specified chunk dimension and overlap
       text_splitter = RecursiveCharacterTextSplitter(
           chunk_size=self.chunk_size,
           chunk_overlap=self.chunk_overlap,
           separators=("nn", "n", " ", "")  # hierarchical splitting
       )
       # Cut up paperwork into smaller chunks
       split_docs = text_splitter.split_documents(paperwork)
       return split_docs

Step 3: Construct a Processing Pipeline (pipeline.py)

The PDFProcessingPipeline class offers a higher-level interface for:

Processing a single PDF
Choosing output format (uncooked dict, LangChain paperwork, or plain textual content)
Enabling or disabling chunking with configurable chunk sizes
Dealing with errors and logging

This abstraction permits straightforward integration into bigger purposes or workflows. The logic behind that is:

from typing import Checklist, Non-compulsory, Dict, Any
from langchain.schema import Doc
from parser import CustomPDFParser
from langchain_loader import LangChainPDFLoader
import logging
logger = logging.getLogger(__name__)
class PDFProcessingPipeline:
   def __init__(self, parser_config: Non-compulsory(Dict(str, Any)) = None):
       """
       Args:
          parser_config: dictionary of choices handed to CustomPDFParser
       """
       self.parser_config = parser_config or {}
   def process_single_pdf(
       self,pdf_path: str,output_format: str = "langchain",chunk_documents: bool = True,chunk_size: int = 500,chunk_overlap: int = 50
   ) -> Any:
       """
       Args:
           pdf_path: path to PDF file
           output_format: "uncooked" (dict), "langchain" (Paperwork), or "textual content" (string)
           chunk_documents: whether or not to separate LangChain paperwork into chunks
           chunk_size: chunk dimension for splitting
           chunk_overlap: chunk overlap for splitting
       Returns:
           Parsed content material within the requested format
       """
       if output_format == "uncooked":
           # Use uncooked CustomPDFParser output
           parser = CustomPDFParser(**self.parser_config)
           return parser.parse_pdf(pdf_path)
       elif output_format == "langchain":
           # Use LangChain loader, optionally chunked
           loader = LangChainPDFLoader(pdf_path, self.parser_config, chunk_size, chunk_overlap)
           if chunk_documents:
               return loader.load_and_split()
           else:
               return loader.load()
       elif output_format == "textual content":
           # Return mixed plain textual content solely
           parser = CustomPDFParser(**self.parser_config)
           parsed_data = parser.parse_pdf(pdf_path)
           return parsed_data.get("full_text", "")
       else:
           increase ValueError(f"Unknown output_format: {output_format}")

Step 4: Take a look at the Parser (instance.py)

Let’s check the parser as follows:

import os
from pathlib import Path
def principal():
   print("👋 Welcome to the Customized PDF Parser!")
   print("What would you love to do?")
   print("1. View full parsed uncooked information")
   print("2. Extract full plain textual content")
   print("3. Get LangChain paperwork (no chunking)")
   print("4. Get LangChain paperwork (with chunking)")
   print("5. Present doc metadata")
   print("6. Present per-page metadata")
   print("7. Present cleaned web page textual content (header/footer eliminated)")
   print("8. Present extracted picture metadata")
   alternative = enter("Enter the variety of your alternative: ").strip()
   if alternative not in {'1', '2', '3', '4', '5', '6', '7', '8'}:
       print("❌ Invalid choice.")
       return
   file_path = enter("Enter the trail to your PDF file: ").strip()
   if not Path(file_path).exists():
       print("❌ File not discovered.")
       return
   # Initialize pipeline
   pipeline = PDFProcessingPipeline({
       "preserve_layout": False,
       "remove_headers_footers": True,
       "extract_images": True,
       "min_text_length": 20
   })
   # Uncooked information is required for many choices
   parsed = pipeline.process_single_pdf(file_path, output_format="uncooked")
   if alternative == '1':
       print("nFull Uncooked Parsed Output:")
       for ok, v in parsed.objects():
           print(f"{ok}: {str(v)(:300)}...")
   elif alternative == '2':
       print("nFull Cleaned Textual content (truncated preview):")
       print("Previewing the primary 1000 characters:n"+parsed("full_text")(:1000), "...")
   elif alternative == '3':
       docs = pipeline.process_single_pdf(file_path, output_format="langchain", chunk_documents=False)
       print(f"nLangChain Paperwork: {len(docs)}")
       print("Previewing the primary 500 characters:n", docs(0).page_content(:500), "...")
   elif alternative == '4':
       docs = pipeline.process_single_pdf(file_path, output_format="langchain", chunk_documents=True)
       print(f"nLangChain Chunks: {len(docs)}")
       print("Pattern chunk content material (first 500 chars):")
       print(docs(0).page_content(:500), "...")
   elif alternative == '5':
       print("nDocument Metadata:")
       for key, worth in parsed("document_metadata").objects():
           print(f"{key}: {worth}")
   elif alternative == '6':
       print("nPer-page Metadata:")
       for i, web page in enumerate(parsed("pages")):
           print(f"Web page {i+1}: {web page('metadata')}")
   elif alternative == '7':
       print("nCleaned Textual content After Header/Footer Elimination.")
       print("Exhibiting the primary 3 pages and first 500 characters of the textual content from every web page.")
       for i, web page in enumerate(parsed("pages")(:3)):  # First 3 pages
           print(f"n--- Web page {i+1} ---")
           print(web page("textual content")(:500), "...")
   elif alternative == '8':
       print("nExtracted Picture Metadata (if accessible):")
       discovered = False
       for i, web page in enumerate(parsed("pages")):
           photographs = web page("metadata").get("photographs", ())
           if photographs:
               discovered = True
               print(f"n--- Web page {i+1} ---")
               for img in photographs:
                   print(img)
       if not discovered:
           print("No picture metadata discovered.")
if __name__ == "__main__":
   principal()

Run this and you can be directed to enter the selection no and path to the PDF. Enter that. The PDF I’m utilizing is publicly accessible, and you’ll obtain it utilizing the hyperlink.

👋 Welcome to the Customized PDF Parser!
What would you love to do?
1. View full parsed uncooked information
2. Extract full plain textual content
3. Get LangChain paperwork (no chunking)
4. Get LangChain paperwork (with chunking)
5. Present doc metadata
6. Present per-page metadata
7. Present cleaned web page textual content (header/footer eliminated)
8. Present extracted picture metadata.
Enter the variety of your alternative: 5
Enter the trail to your PDF file: /content material/articles.pdf

Output:
LangChain Chunks: 16
First chunk preview:
San José State College Writing Heart
www.sjsu.edu/writingcenter
Written by Ben Aldridge

Articles (a/an/the), Spring 2014.                                                                                   1 of 4
Articles (a/an/the)

There are three articles within the English language: a, an, and the. They're positioned earlier than nouns
and present whether or not a given noun is normal or particular.

Examples of Articles

Conclusion

On this information, you’ve discovered the right way to construct a versatile and highly effective PDF processing pipeline utilizing solely open-source instruments. As a result of it’s modular, you’ll be able to simply lengthen it, possibly add a search bar utilizing Streamlit, retailer chunks in a vector database like FAISS for smarter lookups, and even plug this right into a chatbot. You don’t must rebuild something, you simply join the following piece.PDFs don’t must really feel like locked packing containers anymore. With this method, you’ll be able to flip any doc into one thing you’ll be able to learn, search, and perceive in your phrases.

Kanwal mehreen Kanwal is a machine studying engineer and a technical author with a profound ardour for information science and the intersection of AI with drugs. She co-authored the book “Maximizing Productiveness with ChatGPT”. As a Google Technology Scholar 2022 for APAC, she champions range and educational excellence. She’s additionally acknowledged as a Teradata Range in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower ladies in STEM fields.

Constructing a Customized PDF Parser with PyPDF and LangChain

Folder Construction

Instruments Required(necessities.txt)

Step 1: Set Up the PDF Parser(parser.py)

Step 2: Combine with LangChain (langchain_loader.py)

Step 3: Construct a Processing Pipeline (pipeline.py)

Step 4: Take a look at the Parser (instance.py)

Conclusion

Related Articles

Italian Grinder Sliders on Hawaiian Buns

The winner of FIFA’s Membership World Cup will get greater than bragging rights. $1 billion is on the road.

13 Celebrities Who Had been Completely Reworked By Parenthood

LEAVE A REPLY Cancel reply

Latest Articles

Italian Grinder Sliders on Hawaiian Buns

The winner of FIFA’s Membership World Cup will get greater than bragging rights. $1 billion is on the road.

13 Celebrities Who Had been Completely Reworked By Parenthood

Chocolate Zucchini Cake – Sally’s Baking Habit

The Hyperlink Up: Em’s New Beloved Summer time Reads, Marlee’s Summer time Sweater, And Peel & Stick Ground Tiles We Extremely Recommed