Python Image Extraction Sequence From Pdf

June 29, 2023 Post a Comment

I was trying to extract images from a pdf using PyMuPDF (fitz). My pdf has multiple images in a single page. I am maintaining a proper sequence number while saving my images. I saw

Solution 1:

I have the same problem I've used the following code:

import fitz 
import io
from PIL import Image


file = "file_path"
pdf_file = fitz.open(file)


for page_index inrange(len(pdf_file)):
    # get the page itself
    page = pdf_file[page_index]
    image_list = page.getImageList()
    # printing number of images found in this pageif image_list:
        print(f"[+] Found  {len(image_list)} images in page {page_index}")
    else:
        print("[!] No images found on the given pdf page", page_index)
    for image_index, img inenumerate(page.getImageList(), start=1):
        print(img)
        print(image_index)
        # get the XREF of the image
        xref = img[0]
        # extract the image bytes
        base_image = pdf_file.extractImage(xref)
        image_bytes = base_image["image"]
        # get the image extension
        image_ext = base_image["ext"]
        # load it to PIL
        image = Image.open(io.BytesIO(image_bytes))
        # save it to local disk
        image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

The most probable way is to locate the 'img' var and order them. I'd love to hear any further sggestions or if you found better idea/solution.

Python Dummy

Python Image Extraction Sequence From Pdf

Solution 1:

Post a Comment for "Python Image Extraction Sequence From Pdf"