Extract the top half of multiple pages and add them to new pages

Hello, I’ve got a request I believe should be simple, but I can’t get it to work properly. I want to take a series of USPS label pages, cut out the top half(the section with the label), then take each label/page half and combine them into a 2 in one. So if there are 8 pages of labels, I want to condense them all into 4 pages with each label stacked on top of each other. How would I go about executing this task? Thank you very much.

Hi @provscon and welcome to the forum!
So I guess if you know the exact rect of where each label will be on each page then you just need to grab that rect and create a new page for it and then put that in a document.

Something like this, (let’s call the file “extract.py”):

import pymupdf
import sys
import os

def extract_top_half_pages(input_pdf_path, output_pdf_path):“”"Extract the top half of each page from a PDF and combine them into a new document.
Args:
    input_pdf_path (str): Path to the input PDF file
    output_pdf_path (str): Path for the output PDF file
"""
try:
    # Open the input PDF
    input_doc = pymupdf.open(input_pdf_path)
    
    # Create a new PDF document
    output_doc = pymupdf.open()
    
    print(f"Processing {len(input_doc)} pages...")
    
    for page_num in range(len(input_doc)):
        # Get the current page
        page = input_doc[page_num]
        
        # Get the page dimensions
        page_rect = page.rect
        page_width = page_rect.width
        page_height = page_rect.height
        
        # Define the crop rectangle for the top half ( this should be the rectangle where your label will be )
        # pymupdf.Rect(x0, y0, x1, y1) where (x0,y0) is top-left, (x1,y1) is bottom-right
        top_half_rect = pymupdf.Rect(0, 0, page_width, page_height / 2)
        
        # Create a new page in the output document with the top half dimensions
        new_page = output_doc.new_page(width=page_width, height=page_height / 2)
        
        # Copy the top half content to the new page
        new_page.show_pdf_page(new_page.rect, input_doc, page_num, clip=top_half_rect)
        
        print(f"Processed page {page_num + 1}/{len(input_doc)}")
    
    # Save the output document
    output_doc.save(output_pdf_path)
    
    # Close documents
    input_doc.close()
    output_doc.close()
    
    print(f"Successfully created '{output_pdf_path}' with top halves of all pages.")
    
except Exception as e:
    print(f"Error processing PDF: {str(e)}")
    return False

return True
def main():“”"Main function to handle command line arguments and execute the extraction.“”"if len(sys.argv) != 3:print(“Usage: python script.py <input_pdf> <output_pdf>”)print(“Example: python script.py document.pdf document_top_half.pdf”)sys.exit(1)
input_pdf = sys.argv[1]
output_pdf = sys.argv[2]

# Check if input file exists
if not os.path.exists(input_pdf):
    print(f"Error: Input file '{input_pdf}' does not exist.")
    sys.exit(1)

# Check if input file is a PDF
if not input_pdf.lower().endswith('.pdf'):
    print("Error: Input file must be a PDF.")
    sys.exit(1)

# Ensure output has .pdf extension
if not output_pdf.lower().endswith('.pdf'):
    output_pdf += '.pdf'

# Extract top halves
success = extract_top_half_pages(input_pdf, output_pdf)

if success:
    print("Operation completed successfully!")
else:
    print("Operation failed!")
    sys.exit(1)
if __name__ == "__main__":
    main()

Usage would be, e.g. python extract.py input.pdf output.pdf

extract.py (2.9 KB)
Just attaching the Python file as it has come out a bit strangely there in the code above!

Hm, it looks like it’s working, but it’s giving me the wrong rectangle location. I can’t upload any of the labels I have since they have people’s addresses on them, but I did find an example picture that is nearly identical to what I’m working with.

Here’s essentially what I’m attempting to do. Grab each top half of the page for a number of pages and set 2 labels on a single page. Used an image editor to do this:

double_label.pdf (130.8 KB)

Try this attached code.

extract2.py (2.9 KB)

Inspired by docs here: The Basics - PyMuPDF 1.26.3 documentation

Let me know how it works for you!

Strange, it’s like the orientation of the rect is wrong when it captures the input. It’s capturing a portion of the label and a portion of the bottom Instructions rather than only the label itself. I feel like the rect is capturing a top half rect in the Landscape orientation when I need a top half rect in the Portrait orientation. Not sure how to describe it. I’ve attached a picture of what it sort of looks like.

It will very likely be that there is some rotation applied to the pages which is causing the issue ( this can occur but not always be visible in a viewer as it accommodates for the rotation ).

Try: The Basics - PyMuPDF 1.26.3 documentation