Best Practices für die Auswahl von Primärschlüsselkombinationen aus mehreren Spalten

Best Practices für die Auswahl von Primärschlüsselkombinationen aus mehreren Spalten ⇐ Python

Post Reply Previous topic Next topic

1 post • Page 1 of 1

Anonymous

Best Practices für die Auswahl von Primärschlüsselkombinationen aus mehreren Spalten

Post by Anonymous » 04 Mar 2025, 08:38

Ich arbeite in Azure Databricks mit einem großen PYSPARK -Datenframe mit 170 Spalten. Ich muss die bestmögliche Kombination von 2-3 Spalten als Primärschlüssel identifizieren und sicherstellen, dass die ausgewählte Kombination jede Zeile eindeutig identifizieren sollte. Einschränkungen. Diese Methode ist jedoch bei großen Datensätzen aufgrund mehrerer .distinct (). Count () operationen. < /P>
Hier ist mein Code: < /p>
from itertools import combinations

selected_columns = [
"Message_MessageBody_Header_",
"Message_FileSequenceNo",
"Message_MessageBody_ArticleInfo_BNo",
"EnvDate",
"EnvTime"
] # Replace with actual column names

total_count = df.count()
print(f"Total records in DataFrame: {total_count}")

missing_columns = [col for col in selected_columns if col not in df.columns]
if missing_columns:
print(f"Error: The following columns are missing in the DataFrame: {missing_columns}")
else:
print(f"Selected columns exist in the DataFrame: {selected_columns}")

found_primary_key = False
for r in range(2, len(selected_columns) + 1):
print(f"\nChecking {r}-column combinations...")

for combo in combinations(selected_columns, r):
print(f"\n

Testing combination: {combo}")
unique_count = df.select(*combo).distinct().count()
print(f"Unique count for {combo}: {unique_count}")

df.select(*combo).distinct().show(5, truncate=False)

if unique_count == total_count:
print(f"\n

Found Primary Key: {combo}")
found_primary_key = True
break # Stop once a valid key is found

if found_primary_key:
break

if not found_primary_key:
print("\n

No unique primary key combination found in the selected columns.")
< /code>
Problem:
Dieser Ansatz ist rechnerisch teuer, da wiederholt .Distinct (). count () Operationen auf großen Daten. schneller?>

1741073903

Anonymous

Ich arbeite in Azure Databricks mit einem großen PYSPARK -Datenframe mit 170 Spalten. Ich muss die bestmögliche Kombination von 2-3 Spalten als Primärschlüssel identifizieren und sicherstellen, dass die ausgewählte Kombination jede Zeile eindeutig identifizieren sollte. Einschränkungen. Diese Methode ist jedoch bei großen Datensätzen aufgrund mehrerer .distinct (). Count () operationen. < /P>
Hier ist mein Code: < /p>
    from itertools import combinations

selected_columns = [
"Message_MessageBody_Header_",
"Message_FileSequenceNo",
"Message_MessageBody_ArticleInfo_BNo",
"EnvDate",
"EnvTime"
]  # Replace with actual column names

total_count = df.count()
print(f"Total records in DataFrame: {total_count}")

missing_columns = [col for col in selected_columns if col not in df.columns]
if missing_columns:
print(f"Error: The following columns are missing in the DataFrame: {missing_columns}")
else:
print(f"Selected columns exist in the DataFrame: {selected_columns}")

found_primary_key = False
for r in range(2, len(selected_columns) + 1):
print(f"\nChecking {r}-column combinations...")

for combo in combinations(selected_columns, r):
print(f"\n🔍 Testing combination: {combo}")
unique_count = df.select(*combo).distinct().count()
print(f"Unique count for {combo}: {unique_count}")

df.select(*combo).distinct().show(5, truncate=False)

if unique_count == total_count:
print(f"\n✅ Found Primary Key: {combo}")
found_primary_key = True
break  # Stop once a valid key is found

if found_primary_key:
break

if not found_primary_key:
print("\n❌ No unique primary key combination found in the selected columns.")
< /code>
Problem:
Dieser Ansatz ist rechnerisch teuer, da wiederholt .Distinct (). count () Operationen auf großen Daten. schneller?>

Post Reply Previous topic Next topic

1 post • Page 1 of 1

Quick Reply

Username:

Change Text Case:

Smilies

View more smilies

Similar Topics

Replies

Views

Last post

Best Practices für die Umsetzung von Fristen und Wiederholungen für GRPC [geschlossen]

Last post by Anonymous « 06 Feb 2025, 02:33
Posted in C#

by Anonymous » 06 Feb 2025, 02:33 » in C#

Ich arbeite an einem älteren GRPC -Code und bin neu im Raum. Der Server hat einige Probleme, auf Anfragen vor Ablauf der Frist zu reagieren (Zeitüberschreitung) und ich versuche, das Problem...

0 Replies

12 Views

Last post by Anonymous
06 Feb 2025, 02:33
Best Practices für die Reinigung eines gleichzeitigen Abschieds von Hintergrundjobs in einem Dateigenerierungsdienst [ge

Last post by Anonymous « 29 Apr 2025, 12:41
Posted in C#

by Anonymous » 29 Apr 2025, 12:41 » in C#

I'm building a microservice that handles asynchronous file generation with the following flow:

POST request - Creates a file generation job
Stores the job in a ConcurrentDictionary
Returns a job...

0 Replies

5 Views

Last post by Anonymous
29 Apr 2025, 12:41
Best Practices für die Verwendung aktualisierter Bibliotheksversionen in Quarkus -Projekten [geschlossen]

Last post by Anonymous « 20 Mar 2025, 14:20
Posted in Java

by Anonymous » 20 Mar 2025, 14:20 » in Java

Ich möchte Quarkus (mit Gradle) für ein Greenfield-ish-Projekt verwenden und eine Frage zur Verwaltung von Bibliotheksversionen über das Projekt by Quarkus habe. Zum Beispiel io.quarkus:...

0 Replies

11 Views

Last post by Anonymous
20 Mar 2025, 14:20
Best Practices für zirkuläre Verschiebungs- (Rotations-)Operationen in C++

Last post by Guest « 03 Jan 2025, 10:38
Posted in C++

by Guest » 03 Jan 2025, 10:38 » in C++

Linke und rechte Verschiebungsoperatoren (>) sind in C++ bereits verfügbar.
Ich konnte jedoch nicht herausfinden, wie ich zirkuläre Verschiebungs- oder Rotationsoperationen durchführen kann.

Wie...

0 Replies

24 Views

Last post by Guest
03 Jan 2025, 10:38
Was sind die Best Practices zum Hinzufügen von Metadaten zu einer RESTful-JSON-Antwort?

Last post by Guest « 18 Jan 2025, 21:19
Posted in Java

by Guest » 18 Jan 2025, 21:19 » in Java

Hintergrund

Wir erstellen eine Restful API, die Datenobjekte als JSON zurückgeben soll. In den meisten Fällen ist es in Ordnung, nur das Datenobjekt zurückzugeben, aber in einigen Fällen, z. B....

0 Replies

18 Views

Last post by Guest
18 Jan 2025, 21:19

Return to “Python”