提取全部文本 #460

Open
opened 2025-01-03 16:46:53 +08:00 by goodboychen · 1 comment

def extract_text_from_pdf(filename, page_numbers=None, min_line_length=1):
'''从 PDF 文件中(按指定页码)提取文字'''
paragraphs = []
buffer = ''
full_text = ''
# 提取全部文本
for i, page_layout in enumerate(extract_pages(filename)):
# 如果指定了页码范围,跳过范围外的页
if page_numbers is not None and i not in page_numbers:
continue
for element in page_layout:
if isinstance(element, LTTextContainer):
full_text += element.get_text() + '\n'
# 按空行分隔,将文本重新组织成段落
lines = full_text.split('\n')
for text in lines:
if len(text) >= min_line_length:
buffer += (' ' + text) if not text.endswith('-') else text.strip('-')
elif buffer:
paragraphs.append(buffer)
buffer = ''
if buffer:
paragraphs.append(buffer)
return paragraphs

paragraphs = extract_text_from_pdf("oracle_dba.pdf", page_numbers=[2, 3,4,5,6,7,8,9,10,11,100,101,102], min_line_length=10)

如何提取整个文档的内容?

def extract_text_from_pdf(filename, page_numbers=None, min_line_length=1): '''从 PDF 文件中(按指定页码)提取文字''' paragraphs = [] buffer = '' full_text = '' # 提取全部文本 for i, page_layout in enumerate(extract_pages(filename)): # 如果指定了页码范围,跳过范围外的页 if page_numbers is not None and i not in page_numbers: continue for element in page_layout: if isinstance(element, LTTextContainer): full_text += element.get_text() + '\n' # 按空行分隔,将文本重新组织成段落 lines = full_text.split('\n') for text in lines: if len(text) >= min_line_length: buffer += (' ' + text) if not text.endswith('-') else text.strip('-') elif buffer: paragraphs.append(buffer) buffer = '' if buffer: paragraphs.append(buffer) return paragraphs paragraphs = extract_text_from_pdf("oracle_dba.pdf", page_numbers=[2, 3,4,5,6,7,8,9,10,11,100,101,102], min_line_length=10) 如何提取整个文档的内容?

如果这个代码可以提取的话,就一页一页提取,然后放到一个向量数据库里面就行

如果这个代码可以提取的话,就一页一页提取,然后放到一个向量数据库里面就行
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: HswOAuth/llm_course#460
No description provided.