Das Setup (was ich nicht kann Einfluss) beträgt:
- 0 GPU
- 4000 CPU
- 15,0 Gi Erinnerung
Ich habe mit einem Teil davon begonnen 500.000 Zeilen, aber das führte zum Absturz des Kernels. Versuchte 250.000 mit dem gleichen Ergebnis. Jetzt auf 100.000 und immer noch abstürzen.
Basierend auf den Unternehmensregeln muss ich die erste Verbindung zum herstellen Datenbank wie unten, die funktioniert:
Code: Select all
# Connection to MSSQL with Kerberos + pyodbc
def mssql_conn_kerberos(server, driver, trusted_connection, trust_server_certificate, kerberos_cmd):
# Run Kerberos for authentifications
os.system(kerberos_cmd)
try:
# First connection attempt
c_conn = pyodbc.connect(
f'DRIVER={driver};'
f'SERVER={server};'
f'Trusted_Connection={trusted_connection};'
f'TrustServerCertificate={trust_server_certificate}'
)
except:
# Re-run Kerberos and try authentification
os.system(kerberos_cmd)
c_conn = pyodbc.connect(
f"DRIVER={driver};"
f"SERVER={server};"
f"Trusted_Connection={trusted_connection};"
f"TrustServerCertificate={trust_server_certificate}"
)
c_cursor = c_conn.cursor()
print("Pyodbc connection ready.")
return c_conn # Connection to the database
Code: Select all
def call_my_query(path_to_query, query_name, chunk, connection):
file_path = os.path.join(path_to_query, query_name)
with open(file_path, "r") as file:
query = file.read()
# SQL processing in chunks + time
chunks = []
start_time = time.time()
for x in pd.read_sql_query(query, connection, chunksize=chunk):
chunks.append(x)
# Concating the chungs - joining all the chunks together
df = pd.concat(chunks, ignore_index=True)
# Process end-time
end_time = time.time()
print("Data loaded successfully!")
print(f'Processed {len(df)} rows in {end_time - start_time:.2f} seconds')
return df
Code: Select all
The Kernel crashed while executing code in the current cell or a previous cell.
Please review the code in the cell(s) to identify a possible cause of the failure.
Click here for more info.
View Jupyter log for further details.
Alternation der „call_my_query“ für Dask:
Code: Select all
def call_my_query_dask(query_name, chunk, connection, index_col):
# Load query from file
file_path = os.path.join(path_to_query, query_name)
with open(file_path, "r") as file:
query_original = file.read()
# Convert the SQL string/text
query = sqlalchemy.select(query_original)
# Start timing the process
start_time = time.time()
# Use Dask to read the SQL query in chunks
print("Executing query and loading data with Dask...")
df_dask = dd.read_sql_query(
sql=query,
con=connection_url,
npartitions=10,
index_col = index_col
)
# Process end-time
end_time = time.time()
print("Data loaded successfully!")
print(f"Processed approximately {df_dask.shape[0].compute()} rows in {end_time - start_time:.2f} seconds")
return df_dask
Textspaltenausdruck 'SELECT\n\t[COL1]\n\ t, [COL...' sollte explizit mit text('SELECT\n\t[COL1]\n\t, [COL...') deklariert werden, oder verwenden Sie literal_column('SELECT\n\t[COL1] \n\t, [COL...') für mehr Spezifität
Vielen Dank an alle für jede Hilfe.