Ich brauche Hilfe bei der Datenbereinigung und Vorverarbeitung in meinem Projekt [geschlossen]

Anonymous · Post by **Anonymous** » 14 Dec 2025, 01:15

import pandas as pd
import numpy as np
import seaborn as sns

Code: Select all

df=pd.read_excel('Online_Retail.xlsx')
#Alternative Method to Introduce Missing Values: Using numpy.random.rand
# Make a copy of the original DataFrame to avoid modifying 'df_10' further
df_10 = df.copy()

# Define the columns where missing values will be introduced
columns_to_corrupt = ['Quantity', 'UnitPrice', 'CustomerID']

# Define the percentage of missing values to introduce (e.g., 20%)
missing_percentage = 0.20

for col in columns_to_corrupt:
# Generate a boolean array where True indicates where to place a NaN
# The size of the array matches the number of rows in the DataFrame
# We compare np.random.rand with missing_percentage to get the desired proportion
mask = np.random.rand(len(df_10)) < missing_percentage

# Apply the mask to the column to set values to NaN
df_10.loc[mask, col] = np.nan

print("DataFrame with missing values introduced using np.random.rand:")
print(df_10.head())

print("\nNumber of missing values per column in df_10:")
print(df_10[columns_to_corrupt].isnull().sum())

Code: Select all

# The dataframe being used in this cell is 'df_10'
df_10['InvoiceDate'] = pd.to_datetime(df_10['InvoiceDate'])
df_10['InvoiceDate'] = df_10['InvoiceDate'].dt.strftime('%Y-%m-%d')

Code: Select all

#Encode 'StockCode' column to numerical format using a loop for mapping

# Method 1: For loop with dictionary mapping
country_mapping = {}
current_code = 1
country_codes = []

# Iterate through each country
for country in df_10['Country']:
if country not in country_mapping:
country_mapping[country] = current_code
current_code += 1
country_codes.append(country_mapping[country])

# Add new column
df_10['Country_Code_ForLoop'] = country_codes

print("\nUsing For Loop:")
print(df_10)
print(f"\nMapping Dictionary: {country_mapping}")

Code: Select all

###### stockcode to integer value
# 1) Read Excel file
# This line is commented out as df_10 is already defined earlier.
# df_10 = pd.read_excel("Online_Retail.xlsx", sheet_name="Online Retail")

# 2) Convert StockCode to string using FOR LOOP (your requested way)
new_list = []
for x in df_10["StockCode"]:
new_list.append(str(x))

df_10["StockCode"] = new_list

# 3) Manual Label Encoding (NO inbuilt encoders)
stock_map = {}       # stores StockCode -> number
encoded_list = []    # stores encoded values
next_id = 1

for code in df_10["StockCode"]:
if code not in stock_map:
stock_map[code] = next_id
next_id += 1
encoded_list.append(stock_map[code])

# 4) Add encoded column
df_10["StockCode_encoded"] = encoded_list

Code: Select all

df_10.to_csv('dff.csv', index=False)

Code: Select all

df_10.to_csv('dff.csv', index=False)

Problemstellung:
· Behandeln Sie fehlende Werte mit geeigneten Imputationsstrategien (Mittelwert, Median, Modus oder fortgeschrittenere Techniken).
· Identifizieren und behandeln Sie doppelte oder fehlerhafte Transaktionen (z. B. Stornierungen, die durch Rechnungsnummern gekennzeichnet sind, die mit „c“ beginnen).
dieses Bild meines Datensatzes zur Referenzierung von Daten und zur Problemstellung außerdem:

Ich bin wirklich dankbar, wenn mir jemand helfen kann

Ich brauche Hilfe bei der Datenbereinigung und Vorverarbeitung in meinem Projekt [geschlossen]

Ich brauche Hilfe bei der Datenbereinigung und Vorverarbeitung in meinem Projekt [geschlossen] ⇐ Python

Quick Reply