Python BeautifulSoup – Extrahieren von Inhaltsblöcken nach bestimmten Unterüberschriften innerhalb eines größeren Abschn

Anonymous · Post by **Anonymous** » 29 Nov 2025, 15:53

Ich durchsuche das Dead by Daylight Fandom-Wiki (insbesondere TOME-Seiten, z. B. https://deadbydaylight.fandom.com/wiki/ ... _Awakening), um Speicherprotokolle zu extrahieren.
Das Ziel besteht darin, den Memory Title (

Code: Select all

mw-headline

) und der entsprechende Memory Body (Text, der in nachfolgenden Elementen wie , usw. enthalten ist) als separate Datensätze, wobei der Haupt-TOME-Einleitungstext oben auf der Seite strikt ignoriert wird.
Das Problem
Mein aktuelles Skript identifiziert erfolgreich alle Speichertitel, aber die Funktion zum Extrahieren des Hauptinhalts schließt oft fälschlicherweise den allgemeinen TOME ein Einleitungstext (der große Übersichtsabsatz ganz oben im Artikel) in den Hauptteil des ersten extrahierten Speicherprotokolls. Dies führt zu doppeltem, falschem Textkörper für viele nachfolgende Speicherdatensätze.
Das Kernproblem besteht darin, die Inhaltsextraktion richtig festzulegen: Ich muss sicherstellen, dass bei der Suche nach Speicherkörperinhalten nach einem Speichertitel nur die Elemente bis zum nächsten Speichertitel berücksichtigt werden.
Mein aktueller Ansatz (vereinfacht)
Ich habe zwei Hauptfunktionen:

Code: Select all
```
crawl_and_extract_tags
```
: Findet den Hauptabschnitt „Erinnerungen und Protokolle“ und durchläuft einzelne Speichertitel (
Code: Select all
```
mw-headline 
```
).
Code: Select all
```
extract_content_after_headline
```
: Nimmt ein Speichertitel-Tag und durchläuft seine nächsten Geschwister, um den Textinhalt bis zur nächsten Hauptüberschrift zu finden.

Die Hauptlogik von extract_content_after_headline (wo das Problem wahrscheinlich liegt):

Code: Select all

def extract_content_after_headline(headline_tag):
body_content = []

# Finds the immediate parent heading (e.g., ) of the specific memory title ()
parent_heading = headline_tag.find_parent(['h2', 'h3', 'h4'])
if not parent_heading:
return "Parent tag not found", ""

# Start searching from the next sibling of the parent heading
current_element = parent_heading.next_sibling

# Loop until the next major heading (h2, h3, h4) is found
while current_element and current_element.name not in ['h2', 'h3', 'h4']:
if current_element.name in ['td', 'p', 'div', 'blockquote', 'li']:
element_text = current_element.get_text(separator=' ', strip=True)
if element_text:
body_content.append(element_text)

current_element = current_element.next_sibling

return "\n\n".join(body_content), "" # Omitted italics content for brevity

Die Anfrage

Wie kann ich extract_content_after_headline ändern, um zuverlässig nur den Inhalt zu erfassen, der zu diesem spezifischen Speicherprotokoll gehört, ohne die allgemeine Seiteneinleitung einzubeziehen?
Gibt es eine bessere Möglichkeit, den Extraktionsfluss zu strukturieren (z. B. die Grenzen der Haupt-„Speicher und Protokolle“ zu finden) Abschnitt strenger), um zu verhindern, dass der Einleitungstext als Körper der ersten Erinnerung angesehen wird?

Irgendwelche Vorschläge zur Verwendung der Selektoren oder Traversalmethoden von BeautifulSoup (

Code: Select all

find_next_sibling

usw.) in dieser Fandom-Wiki-Struktur effektiver zu gestalten, wäre sehr dankbar.
Vollständiger Code:

Code: Select all

import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
# pandas가 Excel 파일 읽기/쓰기를 위해 openpyxl을 사용합니다.

# --- 상수 설정 ---
# 파일 확장자를 .xlsx로 변경
DEFAULT_TEXT_FILENAME = "DbD_TOME_Extracted_Data.xlsx"

def extract_content_after_headline(headline_tag):
"""
주어진 mw-headline  태그 뒤에 오는 메모리 본문 (모든 텍스트 요소)과
이탤릭체 내용 ()을 다음 헤드라인이 나타날 때까지 추출합니다.

핵심 개선: 텍스트 노드와 함께  태그 뒤에 따라오는 도 본문으로 포함합니다.
"""
body_content = []
italics_content = []

# 1. 헤드라인의 부모  또는  태그를 찾습니다.
parent_heading = headline_tag.find_parent(['h2', 'h3', 'h4'])
if not parent_heading:
return "본문 태그 찾기 실패", ""

# 2. 다음 형제 요소들을 탐색합니다.  (Next Siblings)
# 다음 , ,  태그가 나타날 때까지 반복
current_element = parent_heading.next_sibling

while current_element and current_element.name not in ['h2', 'h3', 'h4', 'script', 'style']:
if current_element.name:
# 2.1.  태그 내용 추출 (이탤릭체 내용)
i_tags = current_element.find_all('i')
for i_tag in i_tags:
i_text = i_tag.get_text(separator=' ', strip=True)
if i_text:
italics_content.append(i_text)

# 2.2. 일반 텍스트 내용 추출:
# , 
, , , 와 같은 주요 블록 요소를 메모리 본문에 추가합니다.
if current_element.name in ['td', 'p', 'div', 'blockquote', 'li', 'dd', 'dt']:
element_text = current_element.get_text(separator=' ', strip=True)
if element_text:
# 불필요한 공백 제거 후 텍스트만 추가
body_content.append(element_text)

# 2.3. 이미지 캡션 (figure/figcaption) 내용도 본문에 추가 (DbD 위키 구조 고려)
figcaption_tags = current_element.find_all('figcaption')
for figcaption in figcaption_tags:
caption_text = figcaption.get_text(separator=' ', strip=True)
if caption_text:
body_content.append(caption_text)

current_element = current_element.next_sibling

# 결과를 통합하여 반환
# 여러 블록 요소들을 줄바꿈 두 개로 구분하여 본문으로 통합
return "\n\n".join(body_content), "\n---\n".join(italics_content)

def crawl_and_extract_tags(url):
"""
웹페이지에 접속하여 'Memories and Logs' 섹션 내부의 'mw-headline '을 기준으로
메모리 블록 단위로 데이터를 추출하고, 개별 행으로 구성된 리스트를 반환합니다.
"""

tome_title = "N/A"
list_of_data_rows = []

try:
print(f"\n[작업 시작] URL: {url} 크롤링을 시작합니다...")

# 1. HTTP 요청 보내기 및 파싱
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers, timeout=15)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')

# 2. TOME 제목 추출 (H1)
h1_tag = soup.find('h1', {'id': 'firstHeading'})
if h1_tag:
tome_title = h1_tag.get_text(strip=True)

# 3. **탐색 범위 설정: 'Memories and Logs' 섹션 찾기**
# Fandom 위키의 목차 구조를 보면 'Memories and Logs' 또는 'Memories' 섹션이 있습니다.

# 3.1. 'Memories' 또는 'Memories and Logs' 헤드라인의  태그를 찾습니다.
memory_section_span = soup.find('span', class_='mw-headline', string=['Memories', 'Memories and Logs'])

# 3.2. 해당 섹션의 상위 Heading 태그(예: )를 탐색의 시작점으로 설정합니다.
search_start_element = None
if memory_section_span:
search_start_element = memory_section_span.find_parent(['h2', 'h3', 'h4'])

# 3.3. 탐색 시작 요소가 없으면 (Memories 섹션이 없으면) 전체 문서에서 검색합니다.
if not search_start_element:
print("⚠️ 'Memories and Logs' 섹션을 찾을 수 없습니다. 전체 문서에서 메모리 헤드라인을 검색합니다.")

# 4. **메모리 헤드라인 (mw-headline ) 목록 추출**
# 탐색 범위를 'Memories and Logs' 섹션 내부로 제한합니다.

memory_headline_tags = []
if search_start_element:
# 'Memories' 섹션 뒤에 나오는 모든 요소를 탐색하여 그 안에 있는 mw-headline을 수집
current_element = search_start_element.next_sibling
while current_element and current_element.name not in ['h2', 'h3']: # 다음 큰 섹션까지 탐색
if current_element.name:
# 현재 요소 또는 그 자식 요소에서 mw-headline을 찾습니다.
mw_headlines = current_element.find_all('span', class_='mw-headline')
memory_headline_tags.extend(mw_headlines)

current_element = current_element.next_sibling

# 만약 섹션 내부에서 직접 메모리 헤드라인을 찾지 못했다면, 헤드라인 태그 자체를 순회합니다.
if not memory_headline_tags:
memory_headline_tags = soup.find_all('span', class_='mw-headline')
else:
# Memories 섹션을 찾지 못하면, 전체 문서에서 검색합니다.
memory_headline_tags = soup.find_all('span', class_='mw-headline')

# 5. 각 메모리 헤드라인을 순회하며 데이터 추출 및 행 생성
if memory_headline_tags:
for span_tag in memory_headline_tags:
span_content = span_tag.get_text(strip=True)

# 5.1.  해당  뒤의 본문과 이탤릭체 내용 추출
memo_body, memo_italics = extract_content_after_headline(span_tag)

# 5.2. 행 데이터 구성
row = {
'TOME 제목': tome_title,
'mw-headline  제목': span_content,
'메모리 본문 (주요 텍스트)': memo_body,
'메모리 이탤릭체 내용 ()': memo_italics
}
list_of_data_rows.append(row)
else:
# 메모리 헤드라인을 찾지 못했을 경우 (이전 로직 유지)
td_content = ""
td_tags = soup.find_all('td')
if td_tags:
td_content = td_tags[0].get_text(separator=' ', strip=True)

i_content = "\n---\n".join([i.get_text(separator=' ', strip=True) for i in soup.find_all('i') if i.get_text(separator=' ', strip=True)])

row = {
'TOME 제목': tome_title,
'mw-headline  제목': "mw-headline  없음",
'메모리 본문 (주요 텍스트)': td_content,
'메모리 이탤릭체 내용 ()': i_content
}
list_of_data_rows.append(row)

print(f"✅ TOME: {tome_title} 데이터 추출 완료. 생성된 행 수: {len(list_of_data_rows)}")
return list_of_data_rows # 딕셔너리 리스트 반환

except requests.exceptions.RequestException as e:
print(f"❌ 웹사이트 접속 또는 데이터 요청 실패: {e}")
return None
except Exception as e:
print(f"❌ 크롤링 중 알 수 없는 오류 발생: {e}")
return None

def append_to_excel_file(new_data, output_filename):
"""
새로운 데이터를 기존 엑셀 파일에 추가(Append)하거나 새 파일을 생성합니다.
(new_data는 딕셔너리 리스트일 수 있습니다.)
"""
if not new_data:
print("⚠️ 추출된 데이터가 없으므로 저장하지 않습니다.")
return

if not output_filename.lower().endswith('.xlsx'):
output_filename += '.xlsx'

# 엑셀 파일 저장을 위해 pandas DataFrame으로 변환
new_df = pd.DataFrame(new_data)

try:
# 파일이 이미 존재하는지 확인
if os.path.exists(output_filename):
# 기존 데이터를 불러와 새 데이터를 행으로 추가합니다.
existing_df = pd.read_excel(output_filename, engine='openpyxl')
# Pandas가 열 이름을 기준으로 데이터를 정렬하여 행을 추가합니다.
combined_df = pd.concat([existing_df, new_df], ignore_index=True)
else:
# 파일이 없으면 새 데이터프레임을 사용합니다.
combined_df = new_df

# 결합된 DataFrame을 Excel 파일에 저장합니다. (index=False로 불필요한 인덱스 열 제거)
combined_df.to_excel(output_filename, index=False, engine='openpyxl')

print(f"⭐ 데이터가 엑셀 파일 '{output_filename}'에 성공적으로 추가되었습니다.")

except FileNotFoundError:
print(f"❌ 오류: 엑셀 파일 '{output_filename}'을 찾을 수 없습니다. (경로 확인 필요)")
except Exception as e:
print(f"❌ 오류: 엑셀 파일 저장 중 문제가 발생했습니다. ({e})")

def main():
print("\n=======================================================")
print(" 🔗 DbD TOME (mw-headline 기준 일대일 매칭) 엑셀 크롤러")
print("=======================================================")

# 1. 라이브러리 설치 안내 (pandas, openpyxl 추가됨)
print("💡 이 코드는 'requests', 'bs4', 'pandas', 'openpyxl'을 사용합니다. ")
print("   설치: pip install requests beautifulsoup4 pandas openpyxl")

# 2. 파일명 입력
filename_input = input(f"\n[필수] 저장할 엑셀 파일 이름 (기본값: {DEFAULT_TEXT_FILENAME}): ").strip()
text_filename = filename_input if filename_input else DEFAULT_TEXT_FILENAME

# 3. 수동 URL 입력 및 반복 실행 루프
while True:
# URL 입력 시 이전 예시 URL을 표시하여 편의를 제공
# 예시 URL은 DbD TOME으로 유지합니다.
url_input = input(f"\nhttp://www.kpedia.jp/w/42385 크롤링할 TOME URL을 입력하세요 (예: https://deadbydaylight.fandom.com/wiki/Tome_1_-_Awakening | 종료하려면 '종료' 입력): ")

if url_input.strip().lower() == '종료':
print("\n프로그램을 종료합니다. 감사합니다. 👋")
break

if not url_input.strip():
print("❌ 유효한 URL을 입력해 주세요.")
continue

# 4.  크롤링 및 엑셀 파일 추가 저장
# extracted_data는 이제 딕셔너리들의 리스트를 반환할 수 있습니다.
extracted_data = crawl_and_extract_tags(url_input)

if extracted_data:
append_to_excel_file(extracted_data, text_filename)

if __name__ == "__main__":
main()

Geben Sie hier die Bildbeschreibung ein
Auf diese Weise werden die Haupttextsätze in Excel nicht richtig ausgerichtet.

Python BeautifulSoup – Extrahieren von Inhaltsblöcken nach bestimmten Unterüberschriften innerhalb eines größeren Abschn

Python BeautifulSoup – Extrahieren von Inhaltsblöcken nach bestimmten Unterüberschriften innerhalb eines größeren Abschn ⇐ Python

Quick Reply