I was trying to do entity extraction on a certain number of pdf file in a document, but the code seems to stuck(the code is still running but the output is not available(i also have wait 10 minutes to make sure it's not lagged or anything)) after a random amount of PDF that has been processed, I have debugged some possible problem before :
for filename in os.listdir(pdf_folder_path): if filename.endswith(".pdf"): pdf_path = os.path.join(pdf_folder_path, filename) print(f"Processing: {filename}")
# Load the entire PDF text
try:
loader = PyPDFLoader(pdf_path)
documents = loader.load() # Load the entire document
# Check if documents are empty
if not documents:
print(f"Warning: No content found in {filename}. Skipping.")
continue # Skip to the next PDF if no documents found
# Concatenate all page content (if multiple pages)
full_text = " ".join(doc.page_content for doc in documents if doc.page_content.strip())
# If full_text is empty after concatenation, skip the document
if not full_text.strip():
print(f"Warning: Empty content after extraction in {filename}. Skipping.")
continue
# Process the full document through the chain
chain_result = chain.invoke({"input": full_text})
print(f"Entities extracted from {filename}:\n{chain_result}\n")
extracted_data.append({
"filename": filename,
"entities": chain_result,
})
except Exception as e:
print(f"Error processing {filename}: {e}")
logging.error(f"Error processing {filename}: {e}")
with open(output_file, "w", encoding="utf-8") as f: json.dump(extracted_data, f, ensure_ascii=False, indent=4)
print(f"All PDFs processed successfully. Results saved to {output_file}.")
I have made an exception if PDF cannot be read / cannot be parsed / contains nothing it will skipped it and already tried to used GPT-4o model and GPT-4o-mini, and I cannot use chunking because if I use chunking, there might be a chance that a single file turned into two entities.
Do you guys might know what happened, and how to debug it?