Cell Line Biology → Drug Response Learning  
Patient Biology → Drug Response Prediction


# =========================
# 1. BASIC UTILITIES
# =========================
import os
import numpy as np
import pandas as pd

# =========================
# 2. VISUALIZATION
# =========================
import matplotlib.pyplot as plt
import seaborn as sns

# =========================
# 3. SKLEARN (ESSENTIAL ML TOOLS)
# =========================
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
from sklearn.impute import SimpleImputer

# =========================
# 4. PYTORCH (DEEP LEARNING)
# =========================
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# =========================
# 5. WARNING CONTROL
# =========================
import warnings
warnings.filterwarnings("ignore")


import shap
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests
from tqdm import tqdm


# =========================
# TCGA MULTI-OMICS DATA
# =========================

tcga_data = pd.read_csv("/kaggle/input/datasets/samdemharter/brca-multiomics-tcga/data.csv")
tcga_subtypes = pd.read_csv("/kaggle/input/datasets/samdemharter/brca-multiomics-tcga/brca_data_w_subtypes.csv")

print("TCGA DATA SHAPE:", tcga_data.shape)
display(tcga_data.head())

print("\nSUBTYPES DATA SHAPE:", tcga_subtypes.shape)
display(tcga_subtypes.head())

TCGA DATA SHAPE: (705, 1937)

SUBTYPES DATA SHAPE: (705, 1941)


# =========================
# GDSC MAIN DATA
# =========================

gdsc_main = pd.read_csv("/kaggle/input/datasets/samiraalipour/genomics-of-drug-sensitivity-in-cancer-gdsc/GDSC_DATASET.csv")
gdsc2 = pd.read_csv("/kaggle/input/datasets/samiraalipour/genomics-of-drug-sensitivity-in-cancer-gdsc/GDSC2-dataset.csv")
compounds = pd.read_csv("/kaggle/input/datasets/samiraalipour/genomics-of-drug-sensitivity-in-cancer-gdsc/Compounds-annotation.csv")

print("GDSC MAIN SHAPE:", gdsc_main.shape)
display(gdsc_main.head())

print("\nGDSC2 SHAPE:", gdsc2.shape)
display(gdsc2.head())

print("\nCOMPOUNDS SHAPE:", compounds.shape)
display(compounds.head())

GDSC MAIN SHAPE: (242035, 19)

GDSC2 SHAPE: (242036, 19)

COMPOUNDS SHAPE: (621, 6)


# =========================
# CELL LINE DETAILS (EXCEL)
# =========================

excel_path = "/kaggle/input/datasets/samiraalipour/genomics-of-drug-sensitivity-in-cancer-gdsc/Cell_Lines_Details.xlsx"

# Load all sheets
xls = pd.ExcelFile(excel_path)

print("Available Sheets:", xls.sheet_names)

Available Sheets: ['Cell line details', 'COSMIC tissue classification', 'Decode']


cell_line_details = pd.read_excel(xls, sheet_name="Cell line details")
cosmic_classification = pd.read_excel(xls, sheet_name="COSMIC tissue classification")
decode = pd.read_excel(xls, sheet_name="Decode")

print("Cell Line Details:", cell_line_details.shape)
display(cell_line_details.head())

print("\nCOSMIC Classification:", cosmic_classification.shape)
display(cosmic_classification.head())

print("\nDecode:", decode.shape)
display(decode.head())

Cell Line Details: (1002, 13)

COSMIC Classification: (1029, 4)

Decode: (50, 2)


def quick_check(df, name):
    print(f"\n===== {name} =====")
    print(df.shape)
    print(df.columns[:10])
    print(df.isnull().sum().head())

quick_check(tcga_data, "TCGA")
quick_check(gdsc_main, "GDSC MAIN")
quick_check(cell_line_details, "CELL LINE DETAILS")

===== TCGA =====
(705, 1937)
Index(['rs_CLEC3A', 'rs_CPB1', 'rs_SCGB2A2', 'rs_SCGB1D2', 'rs_TFF1',
       'rs_MUCL1', 'rs_GSTM1', 'rs_PIP', 'rs_ADIPOQ', 'rs_ADH1B'],
      dtype='object')
rs_CLEC3A     0
rs_CPB1       0
rs_SCGB2A2    0
rs_SCGB1D2    0
rs_TFF1       0
dtype: int64

===== GDSC MAIN =====
(242035, 19)
Index(['COSMIC_ID', 'CELL_LINE_NAME', 'TCGA_DESC', 'DRUG_ID', 'DRUG_NAME',
       'LN_IC50', 'AUC', 'Z_SCORE', 'GDSC Tissue descriptor 1',
       'GDSC Tissue descriptor 2'],
      dtype='object')
COSMIC_ID            0
CELL_LINE_NAME       0
TCGA_DESC         1067
DRUG_ID              0
DRUG_NAME            0
dtype: int64

===== CELL LINE DETAILS =====
(1002, 13)
Index(['Sample Name', 'COSMIC identifier', 'Whole Exome Sequencing (WES)',
       'Copy Number Alterations (CNA)', 'Gene Expression', 'Methylation',
       'Drug\nResponse', 'GDSC\nTissue descriptor 1',
       'GDSC\nTissue\ndescriptor 2', 'Cancer Type\n(matching TCGA label)'],
      dtype='object')
Sample Name                      0
COSMIC identifier                1
Whole Exome Sequencing (WES)     0
Copy Number Alterations (CNA)    0
Gene Expression                  0
dtype: int64


cell_line_details["Gene Expression"].unique()

array(['Y', 'N', 968], dtype=object)


print("TCGA Shape:", tcga_data.shape)
display(tcga_data.head())

print("\nGDSC Shape:", gdsc_main.shape)
display(gdsc_main.head())

TCGA Shape: (705, 1937)

GDSC Shape: (242035, 19)


gdsc_brca = gdsc_main[
    gdsc_main["TCGA_DESC"].str.contains("BRCA", na=False)
].copy()

print("Filtered GDSC BRCA Shape:", gdsc_brca.shape)
gdsc_brca["TCGA_DESC"].value_counts()

Filtered GDSC BRCA Shape: (13106, 19)

TCGA_DESC
BRCA    13106
Name: count, dtype: int64


cols = [
    "CELL_LINE_NAME",
    "DRUG_NAME",
    "LN_IC50",
    "TCGA_DESC",
    "GDSC Tissue descriptor 1",
    "GDSC Tissue descriptor 2",
    "Microsatellite instability Status (MSI)",
    "Growth Properties"
]

gdsc_brca = gdsc_brca[cols].copy()
gdsc_brca.head()


print(gdsc_brca.isnull().sum())

# Drop rows without target
gdsc_brca = gdsc_brca.dropna(subset=["LN_IC50"])

# Fill categorical missing values
for col in gdsc_brca.columns:
    if gdsc_brca[col].dtype == "object":
        gdsc_brca[col] = gdsc_brca[col].fillna("Unknown")

CELL_LINE_NAME                               0
DRUG_NAME                                    0
LN_IC50                                      0
TCGA_DESC                                    0
GDSC Tissue descriptor 1                     0
GDSC Tissue descriptor 2                     0
Microsatellite instability Status (MSI)    283
Growth Properties                            0
dtype: int64


# Check missing values again
print(gdsc_brca.isnull().sum())

# Handle MSI missing values explicitly
gdsc_brca["Microsatellite instability Status (MSI)"] = \
    gdsc_brca["Microsatellite instability Status (MSI)"].fillna("Unknown")

# Final check
print("\nAfter handling missing:")
print(gdsc_brca.isnull().sum())

CELL_LINE_NAME                             0
DRUG_NAME                                  0
LN_IC50                                    0
TCGA_DESC                                  0
GDSC Tissue descriptor 1                   0
GDSC Tissue descriptor 2                   0
Microsatellite instability Status (MSI)    0
Growth Properties                          0
dtype: int64

After handling missing:
CELL_LINE_NAME                             0
DRUG_NAME                                  0
LN_IC50                                    0
TCGA_DESC                                  0
GDSC Tissue descriptor 1                   0
GDSC Tissue descriptor 2                   0
Microsatellite instability Status (MSI)    0
Growth Properties                          0
dtype: int64


X = pd.get_dummies(gdsc_brca.drop(columns=["LN_IC50"]))
y = gdsc_brca["LN_IC50"]

print("Feature shape:", X.shape)

Feature shape: (13106, 345)


# For now, DO NOT scale X
X_model = X.copy()


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_model, y, test_size=0.2, random_state=42
)

print(X_train.shape, X_test.shape)

(10484, 345) (2622, 345)


from sklearn.preprocessing import StandardScaler

X_tcga = tcga_data.drop(columns=["vital.status"], errors="ignore")
y_tcga = tcga_data.get("vital.status")

scaler_tcga = StandardScaler()
X_tcga_scaled = scaler_tcga.fit_transform(X_tcga)


drug_counts = gdsc_brca["DRUG_NAME"].value_counts()
gdsc_brca = gdsc_brca[
    gdsc_brca["DRUG_NAME"].isin(drug_counts[drug_counts > 30].index)
]


from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

model = RandomForestRegressor(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
import numpy as np
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("RMSE:", rmse)
print("R2 Score:", r2)

RMSE: 1.4684609512258178
R2 Score: 0.7114470729457797


drug_effectiveness = gdsc_brca.groupby("DRUG_NAME")["LN_IC50"].mean()

top_drugs = drug_effectiveness.sort_values().head(10)

top_drugs

DRUG_NAME
Romidepsin             -5.065256
Bortezomib             -4.935499
Sepantronium bromide   -3.856885
Dactinomycin           -3.040432
Eg5_9814               -2.877072
Docetaxel              -2.781732
Paclitaxel             -2.602020
SN-38                  -2.521838
Vinblastine            -2.495477
Dinaciclib             -2.391435
Name: LN_IC50, dtype: float64


import pandas as pd

# Create base template
base_input = X_train.iloc[0:1].copy()
base_input[:] = 0

# Set BRCA context
if "TCGA_DESC_BRCA" in base_input.columns:
    base_input["TCGA_DESC_BRCA"] = 1

drug_columns = [col for col in X.columns if col.startswith("DRUG_NAME_")]

results = []

for drug_col in drug_columns:
    temp = base_input.copy()
    temp[drug_col] = 1
    
    pred = model.predict(temp)[0]
    drug_name = drug_col.replace("DRUG_NAME_", "")
    
    results.append((drug_name, pred))

# Rank drugs
ranked_drugs = sorted(results, key=lambda x: x[1])

# Top 10 predicted drugs
ranked_drugs[:10]

[('Bortezomib', np.float64(-5.7079588)),
 ('Romidepsin', np.float64(-5.028690395000005)),
 ('Docetaxel', np.float64(-4.495570110166665)),
 ('Daporinad', np.float64(-4.130855989999993)),
 ('Eg5_9814', np.float64(-3.9013913650000025)),
 ('Dactinomycin', np.float64(-3.8376358526071463)),
 ('Sepantronium bromide', np.float64(-3.815396625000005)),
 ('SN-38', np.float64(-3.690362534999999)),
 ('Vinorelbine', np.float64(-3.534576590000001)),
 ('Paclitaxel', np.float64(-3.3584968999999982))]

Observed Drug Sensitivity ≈ Model Predictions


print("TCGA Prepared Shape:", X_tcga_scaled.shape)

TCGA Prepared Shape: (705, 1936)


import torch
import torch.nn as nn
import torch.optim as optim

# Convert to tensor
X_tcga_tensor = torch.tensor(X_tcga_scaled, dtype=torch.float32)

# Autoencoder
class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim=128):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.ReLU(),
            nn.Linear(512, latent_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 512),
            nn.ReLU(),
            nn.Linear(512, input_dim)
        )

    def forward(self, x):
        z = self.encoder(x)
        x_recon = self.decoder(z)
        return z, x_recon

model_ae = Autoencoder(input_dim=X_tcga_scaled.shape[1])
optimizer = optim.Adam(model_ae.parameters(), lr=1e-3)
criterion = nn.MSELoss()

# Training
for epoch in range(50):
    optimizer.zero_grad()
    z, recon = model_ae(X_tcga_tensor)
    loss = criterion(recon, X_tcga_tensor)
    loss.backward()
    optimizer.step()

    if epoch % 5 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

Epoch 0, Loss: 1.0042
Epoch 5, Loss: 0.8649
Epoch 10, Loss: 0.7471
Epoch 15, Loss: 0.6563
Epoch 20, Loss: 0.5779
Epoch 25, Loss: 0.5179
Epoch 30, Loss: 0.4717
Epoch 35, Loss: 0.4333
Epoch 40, Loss: 0.4047
Epoch 45, Loss: 0.3779


with torch.no_grad():
    tcga_embeddings, _ = model_ae(X_tcga_tensor)

tcga_embeddings = tcga_embeddings.numpy()
print("Latent TCGA shape:", tcga_embeddings.shape)

Latent TCGA shape: (705, 128)


import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split

# -----------------------------
# 1. Train-validation split
# -----------------------------
X_train_tcga, X_val_tcga = train_test_split(
    X_tcga_scaled, test_size=0.2, random_state=42
)

X_train_tcga = torch.tensor(X_train_tcga, dtype=torch.float32)
X_val_tcga = torch.tensor(X_val_tcga, dtype=torch.float32)

# -----------------------------
# 2. Improved Autoencoder
# -----------------------------
class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim=64):  # reduced latent size
        super().__init__()
        
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            
            nn.Linear(256, latent_dim)
        )
        
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            
            nn.Linear(256, 512),
            nn.ReLU(),
            
            nn.Linear(512, input_dim)
        )

    def forward(self, x):
        z = self.encoder(x)
        x_recon = self.decoder(z)
        return z, x_recon

# -----------------------------
# 3. Model setup
# -----------------------------
input_dim = X_tcga_scaled.shape[1]
model_ae = Autoencoder(input_dim=input_dim, latent_dim=64)

optimizer = optim.Adam(model_ae.parameters(), lr=1e-3)
criterion = nn.MSELoss()

# -----------------------------
# 4. Training with validation
# -----------------------------
epochs = 100
best_val_loss = float("inf")
patience = 10
counter = 0

for epoch in range(epochs):
    
    # ---- TRAIN ----
    model_ae.train()
    optimizer.zero_grad()
    
    z_train, recon_train = model_ae(X_train_tcga)
    train_loss = criterion(recon_train, X_train_tcga)
    
    train_loss.backward()
    optimizer.step()
    
    # ---- VALIDATION ----
    model_ae.eval()
    with torch.no_grad():
        z_val, recon_val = model_ae(X_val_tcga)
        val_loss = criterion(recon_val, X_val_tcga)
    
    # ---- EARLY STOPPING ----
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        counter = 0
    else:
        counter += 1
    
    if counter >= patience:
        print(f"Early stopping at epoch {epoch}")
        break
    
    # ---- LOGGING ----
    if epoch % 10 == 0:
        print(f"Epoch {epoch} | Train Loss: {train_loss.item():.4f} | Val Loss: {val_loss.item():.4f}")

Epoch 0 | Train Loss: 1.0111 | Val Loss: 0.9603
Epoch 10 | Train Loss: 0.8737 | Val Loss: 0.8258
Epoch 20 | Train Loss: 0.7851 | Val Loss: 0.7579
Epoch 30 | Train Loss: 0.7079 | Val Loss: 0.7008
Epoch 40 | Train Loss: 0.6535 | Val Loss: 0.6644
Epoch 50 | Train Loss: 0.6133 | Val Loss: 0.6453
Epoch 60 | Train Loss: 0.5781 | Val Loss: 0.6331
Epoch 70 | Train Loss: 0.5464 | Val Loss: 0.6262
Epoch 80 | Train Loss: 0.5174 | Val Loss: 0.6228
Epoch 90 | Train Loss: 0.4891 | Val Loss: 0.6228
Early stopping at epoch 94


print(model_ae)

Autoencoder(
  (encoder): Sequential(
    (0): Linear(in_features=1936, out_features=512, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.3, inplace=False)
    (3): Linear(in_features=512, out_features=256, bias=True)
    (4): ReLU()
    (5): Dropout(p=0.3, inplace=False)
    (6): Linear(in_features=256, out_features=64, bias=True)
  )
  (decoder): Sequential(
    (0): Linear(in_features=64, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=1936, bias=True)
  )
)


import torch
import torch.nn as nn
import torch.optim as optim

# Convert data
X_tensor = torch.tensor(X.values, dtype=torch.float32)
y_tensor = torch.tensor(y.values, dtype=torch.float32).view(-1, 1)

# Define model
class DrugModel(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            
            nn.Linear(256, 128),
            nn.ReLU(),
            
            nn.Linear(128, 1)
        )

    def forward(self, x):
        return self.net(x)

# Initialize
drug_model = DrugModel(X.shape[1])

optimizer = optim.Adam(drug_model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

# Training
epochs = 100

for epoch in range(epochs):
    drug_model.train()
    
    optimizer.zero_grad()
    preds = drug_model(X_tensor)
    loss = criterion(preds, y_tensor)
    
    loss.backward()
    optimizer.step()
    
    if epoch % 5 == 0:
        print(f"Epoch {epoch} | Loss: {loss.item():.4f}")

Epoch 0 | Loss: 16.8086
Epoch 5 | Loss: 15.4064
Epoch 10 | Loss: 13.3671
Epoch 15 | Loss: 10.1954
Epoch 20 | Loss: 6.9651
Epoch 25 | Loss: 6.5886
Epoch 30 | Loss: 6.3395
Epoch 35 | Loss: 5.4245
Epoch 40 | Loss: 5.2448
Epoch 45 | Loss: 4.8101
Epoch 50 | Loss: 4.3130
Epoch 55 | Loss: 4.0003
Epoch 60 | Loss: 3.5914
Epoch 65 | Loss: 3.2783
Epoch 70 | Loss: 2.9976
Epoch 75 | Loss: 2.7949
Epoch 80 | Loss: 2.6352
Epoch 85 | Loss: 2.5112
Epoch 90 | Loss: 2.4155
Epoch 95 | Loss: 2.3368


!pip install torchinfo
from torchinfo import summary

summary(
    drug_model,
    input_size=(1, X.shape[1]),  # batch size 1
    col_names=["input_size", "output_size", "num_params", "trainable"],
)

Requirement already satisfied: torchinfo in /usr/local/lib/python3.12/dist-packages (1.8.0)

============================================================================================================================================
Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Trainable
============================================================================================================================================
DrugModel                                [1, 345]                  [1, 1]                    --                        True
├─Sequential: 1-1                        [1, 345]                  [1, 1]                    --                        True
│    └─Linear: 2-1                       [1, 345]                  [1, 256]                  88,576                    True
│    └─ReLU: 2-2                         [1, 256]                  [1, 256]                  --                        --
│    └─Dropout: 2-3                      [1, 256]                  [1, 256]                  --                        --
│    └─Linear: 2-4                       [1, 256]                  [1, 128]                  32,896                    True
│    └─ReLU: 2-5                         [1, 128]                  [1, 128]                  --                        --
│    └─Linear: 2-6                       [1, 128]                  [1, 1]                    129                       True
============================================================================================================================================
Total params: 121,601
Trainable params: 121,601
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 0.12
============================================================================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.49
Estimated Total Size (MB): 0.49
============================================================================================================================================


import torch

# Ensure model is in eval mode
drug_model.eval()

# Create base input (same shape as training features)
base = torch.zeros((1, X.shape[1]), dtype=torch.float32)

# Set BRCA context if present
if "TCGA_DESC_BRCA" in X.columns:
    base[0, X.columns.get_loc("TCGA_DESC_BRCA")] = 1

# Get drug columns
drug_indices = [i for i, col in enumerate(X.columns) if col.startswith("DRUG_NAME_")]

results = []

for i in drug_indices:
    temp = base.clone()
    temp[0, i] = 1  # activate drug feature
    
    pred = drug_model(temp).item()
    drug_name = X.columns[i].replace("DRUG_NAME_", "")
    
    results.append((drug_name, pred))

# Rank drugs (lower IC50 = better)
ranked_drugs = sorted(results, key=lambda x: x[1])

# Top 10 drugs
ranked_drugs[:10]

[('Dactinomycin', 0.09183281660079956),
 ('Camptothecin', 0.09219810366630554),
 ('Eg5_9814', 0.09471762180328369),
 ('Sepantronium bromide', 0.0961800068616867),
 ('SN-38', 0.09625014662742615),
 ('Romidepsin', 0.09637980163097382),
 ('Rapamycin', 0.09649278223514557),
 ('Paclitaxel', 0.09678377211093903),
 ('Dactolisib', 0.09703224897384644),
 ('Bortezomib', 0.09705592691898346)]


#import shap

# smaller sample
#X_sample = X_test.sample(20, random_state=42)

#explainer = shap.TreeExplainer(model)
#shap_values = explainer.shap_values(X_sample)

#shap.summary_plot(shap_values, X_sample)

Patient Biology → Drug Sensitivity Prediction → Therapy Recommendation


import torch
import pandas as pd

drug_model.eval()

# Prepare base vector (same structure as GDSC input)
base = torch.zeros((1, X.shape[1]), dtype=torch.float32)

# Set BRCA context
if "TCGA_DESC_BRCA" in X.columns:
    base[0, X.columns.get_loc("TCGA_DESC_BRCA")] = 1

# Drug indices
drug_indices = [i for i, col in enumerate(X.columns) if col.startswith("DRUG_NAME_")]

# -----------------------------
# PATIENT-LEVEL PREDICTION
# -----------------------------
patient_drug_predictions = {}

for patient_id in range(len(tcga_embeddings)):
    
    patient_results = []
    
    for i in drug_indices:
        temp = base.clone()
        temp[0, i] = 1
        
        pred = drug_model(temp).item()
        drug_name = X.columns[i].replace("DRUG_NAME_", "")
        
        patient_results.append((drug_name, pred))
    
    ranked = sorted(patient_results, key=lambda x: x[1])
    
    # Store top 5 drugs per patient
    patient_drug_predictions[f"Patient_{patient_id}"] = ranked[:5]

# Example output
list(patient_drug_predictions.items())[:2]

[('Patient_0',
  [('Dactinomycin', 0.09183281660079956),
   ('Camptothecin', 0.09219810366630554),
   ('Eg5_9814', 0.09471762180328369),
   ('Sepantronium bromide', 0.0961800068616867),
   ('SN-38', 0.09625014662742615)]),
 ('Patient_1',
  [('Dactinomycin', 0.09183281660079956),
   ('Camptothecin', 0.09219810366630554),
   ('Eg5_9814', 0.09471762180328369),
   ('Sepantronium bromide', 0.0961800068616867),
   ('SN-38', 0.09625014662742615)])]


results_df = []

for patient, drugs in patient_drug_predictions.items():
    for rank, (drug, score) in enumerate(drugs):
        results_df.append({
            "Patient": patient,
            "Rank": rank + 1,
            "Drug": drug,
            "Predicted_IC50": score
        })

results_df = pd.DataFrame(results_df)

results_df.head(20)


top_drugs = results_df.groupby("Drug")["Rank"].mean().sort_values()

top_drugs.head(10)

Drug
Dactinomycin            1.0
Camptothecin            2.0
Eg5_9814                3.0
Sepantronium bromide    4.0
SN-38                   5.0
Name: Rank, dtype: float64

TCGA embeddings are not used as input to the drug model


import numpy as np
import pandas as pd

# -----------------------------
# Create drug one-hot mapping
# -----------------------------
drug_names = sorted(gdsc_brca['DRUG_NAME'].unique())
drug_to_idx = {drug: i for i, drug in enumerate(drug_names)}

num_drugs = len(drug_names)

# -----------------------------
# Create training dataset
# -----------------------------
X_final = []
y_final = []

for _, row in gdsc_brca.iterrows():
    
    drug = row['DRUG_NAME']
    if drug not in drug_to_idx:
        continue
    
    # Randomly assign a TCGA embedding (proxy alignment)
    patient_idx = np.random.randint(0, len(tcga_embeddings))
    patient_embed = tcga_embeddings[patient_idx]
    
    # Drug one-hot
    drug_vec = np.zeros(num_drugs)
    drug_vec[drug_to_idx[drug]] = 1
    
    # Combine
    combined = np.concatenate([patient_embed, drug_vec])
    
    X_final.append(combined)
    y_final.append(row['LN_IC50'])

X_final = np.array(X_final)
y_final = np.array(y_final)

print("Final Input Shape:", X_final.shape)

Final Input Shape: (12974, 401)


import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split

# Split
X_train, X_val, y_train, y_val = train_test_split(
    X_final, y_final, test_size=0.2, random_state=42
)

X_train = torch.tensor(X_train, dtype=torch.float32)
X_val = torch.tensor(X_val, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
y_val = torch.tensor(y_val, dtype=torch.float32).view(-1, 1)

# Model
class PatientDrugModel(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            
            nn.Linear(512, 256),
            nn.ReLU(),
            
            nn.Linear(256, 64),
            nn.ReLU(),
            
            nn.Linear(64, 1)
        )
    
    def forward(self, x):
        return self.net(x)

model_pd = PatientDrugModel(X_train.shape[1])

optimizer = optim.Adam(model_pd.parameters(), lr=1e-3)
criterion = nn.MSELoss()

# Training
for epoch in range(100):
    
    model_pd.train()
    optimizer.zero_grad()
    
    preds = model_pd(X_train)
    loss = criterion(preds, y_train)
    
    loss.backward()
    optimizer.step()
    
    # Validation
    model_pd.eval()
    with torch.no_grad():
        val_preds = model_pd(X_val)
        val_loss = criterion(val_preds, y_val)
    
    if epoch % 5 == 0:
        print(f"Epoch {epoch} | Train: {loss.item():.4f} | Val: {val_loss.item():.4f}")

Epoch 0 | Train: 16.9719 | Val: 15.2478
Epoch 5 | Train: 7.4599 | Val: 8.3644
Epoch 10 | Train: 7.2087 | Val: 7.1292
Epoch 15 | Train: 7.7485 | Val: 7.2843
Epoch 20 | Train: 7.1278 | Val: 7.0564
Epoch 25 | Train: 6.8889 | Val: 6.6874
Epoch 30 | Train: 6.9184 | Val: 6.7122
Epoch 35 | Train: 6.7295 | Val: 6.6401
Epoch 40 | Train: 6.5845 | Val: 6.4637
Epoch 45 | Train: 6.4856 | Val: 6.3642
Epoch 50 | Train: 6.3632 | Val: 6.2806
Epoch 55 | Train: 6.2202 | Val: 6.1688
Epoch 60 | Train: 6.0499 | Val: 6.0461
Epoch 65 | Train: 5.8617 | Val: 5.8867
Epoch 70 | Train: 5.6617 | Val: 5.7253
Epoch 75 | Train: 5.3950 | Val: 5.5325
Epoch 80 | Train: 5.1299 | Val: 5.3125
Epoch 85 | Train: 4.8510 | Val: 5.0726
Epoch 90 | Train: 4.5225 | Val: 4.7919
Epoch 95 | Train: 4.1631 | Val: 4.4673


model_pd.eval()

patient_predictions = {}

for p_idx in range(len(tcga_embeddings)):
    
    patient_embed = tcga_embeddings[p_idx]
    results = []
    
    for drug, d_idx in drug_to_idx.items():
        
        drug_vec = np.zeros(num_drugs)
        drug_vec[d_idx] = 1
        
        combined = np.concatenate([patient_embed, drug_vec])
        combined_tensor = torch.tensor(combined, dtype=torch.float32).unsqueeze(0)
        
        pred = model_pd(combined_tensor).item()
        results.append((drug, pred))
    
    ranked = sorted(results, key=lambda x: x[1])
    patient_predictions[f"Patient_{p_idx}"] = ranked[:5]

# Show first 3 patients
list(patient_predictions.items())[:3]

[('Patient_0',
  [('Dactinomycin', 1.3400644063949585),
   ('Dinaciclib', 1.356196641921997),
   ('Docetaxel', 1.3603168725967407),
   ('Vinorelbine', 1.3685065507888794),
   ('Staurosporine', 1.3775458335876465)]),
 ('Patient_1',
  [('Dactinomycin', 1.9531712532043457),
   ('Rapamycin', 1.9948554039001465),
   ('Bortezomib', 2.029646396636963),
   ('Eg5_9814', 2.0298521518707275),
   ('Dactolisib', 2.0301706790924072)]),
 ('Patient_2',
  [('Dactinomycin', 2.3876516819000244),
   ('Romidepsin', 2.4177968502044678),
   ('Dinaciclib', 2.4332621097564697),
   ('Vinorelbine', 2.4362428188323975),
   ('Epirubicin', 2.44201397895813)])]


# Keep only BRCA in GDSC
gdsc_brca = gdsc_main[gdsc_main['TCGA_DESC'] == 'BRCA'].copy()

print("Before:", gdsc_main.shape, "After:", gdsc_brca.shape)

Before: (242035, 19) After: (13106, 19)


import numpy as np

drug_names = sorted(gdsc_brca['DRUG_NAME'].unique())
drug_to_idx = {d: i for i, d in enumerate(drug_names)}
num_drugs = len(drug_names)

X_final, y_final = [], []

for i, row in gdsc_brca.reset_index(drop=True).iterrows():
    # deterministic mapping
    patient_idx = i % len(tcga_embeddings)
    patient_embed = tcga_embeddings[patient_idx]

    drug_vec = np.zeros(num_drugs)
    drug_vec[drug_to_idx[row['DRUG_NAME']]] = 1

    X_final.append(np.concatenate([patient_embed, drug_vec]))
    y_final.append(row['LN_IC50'])

X_final = np.array(X_final)
y_final = np.array(y_final)

print(X_final.shape, y_final.shape)

(13106, 414) (13106,)

[Patient Embedding] + [Drug Info] + [Pathway Info] → Drug Response


# One-hot encode TARGET and TARGET_PATHWAY
path_df = pd.get_dummies(gdsc_brca[['TARGET', 'TARGET_PATHWAY']].fillna('UNK'))

# Align with iteration order
path_mat = path_df.values

X_final = np.hstack([X_final, path_mat])
print("With pathway features:", X_final.shape)

With pathway features: (13106, 624)


from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim

X_tr, X_va, y_tr, y_va = train_test_split(X_final, y_final, test_size=0.2, random_state=42)

X_tr = torch.tensor(X_tr, dtype=torch.float32)
X_va = torch.tensor(X_va, dtype=torch.float32)
y_tr = torch.tensor(y_tr, dtype=torch.float32).view(-1,1)
y_va = torch.tensor(y_va, dtype=torch.float32).view(-1,1)

class PatientDrugModel(nn.Module):
    def __init__(self, d):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d, 512), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(512, 256), nn.ReLU(),
            nn.Linear(256, 64), nn.ReLU(),
            nn.Linear(64, 1)
        )
    def forward(self, x): return self.net(x)

model_pd = PatientDrugModel(X_tr.shape[1])
opt = optim.Adam(model_pd.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

for ep in range(100):
    model_pd.train()
    opt.zero_grad()
    pred = model_pd(X_tr)
    loss = loss_fn(pred, y_tr)
    loss.backward(); opt.step()

    model_pd.eval()
    with torch.no_grad():
        v = loss_fn(model_pd(X_va), y_va)

    if ep % 5 == 0:
        print(f"Epoch {ep} | Train {loss.item():.4f} | Val {v.item():.4f}")

Epoch 0 | Train 16.9250 | Val 16.2017
Epoch 5 | Train 7.5643 | Val 8.1778
Epoch 10 | Train 6.9079 | Val 7.6112
Epoch 15 | Train 7.4528 | Val 7.8514
Epoch 20 | Train 6.6583 | Val 7.4754
Epoch 25 | Train 6.3645 | Val 6.9829
Epoch 30 | Train 6.3209 | Val 6.8925
Epoch 35 | Train 5.9946 | Val 6.6570
Epoch 40 | Train 5.6950 | Val 6.3197
Epoch 45 | Train 5.4224 | Val 6.0141
Epoch 50 | Train 5.0946 | Val 5.6969
Epoch 55 | Train 4.7399 | Val 5.3449
Epoch 60 | Train 4.3659 | Val 4.9809
Epoch 65 | Train 3.9926 | Val 4.6135
Epoch 70 | Train 3.6443 | Val 4.2810
Epoch 75 | Train 3.3499 | Val 3.9859
Epoch 80 | Train 3.1009 | Val 3.7410
Epoch 85 | Train 2.8723 | Val 3.5414
Epoch 90 | Train 2.6927 | Val 3.3866
Epoch 95 | Train 2.5518 | Val 3.2780


model_pd.eval()

patient_predictions = {}

# Precompute pathway encoding per drug (IMPORTANT)
drug_pathway_map = {}

for drug in drug_to_idx.keys():
    sub = gdsc_brca[gdsc_brca['DRUG_NAME'] == drug]
    
    if len(sub) == 0:
        continue
    
    # Take most frequent pathway row
    idx = sub.index[0]
    pathway_vector = path_df.loc[idx].values.astype(np.float32)
    
    drug_pathway_map[drug] = pathway_vector


# -----------------------------
# PREDICTION LOOP
# -----------------------------
for p_idx, emb in enumerate(tcga_embeddings):
    
    results = []
    emb = emb.astype(np.float32)
    
    for drug, d_idx in drug_to_idx.items():
        
        if drug not in drug_pathway_map:
            continue
        
        # Drug one-hot
        drug_vec = np.zeros(num_drugs, dtype=np.float32)
        drug_vec[d_idx] = 1.0
        
        # Pathway vector (already numeric)
        path_vec = drug_pathway_map[drug]
        
        # Combine safely
        x = np.concatenate([emb, drug_vec, path_vec]).astype(np.float32)
        
        # Convert to tensor
        x = torch.from_numpy(x).unsqueeze(0)
        
        # Predict
        yhat = model_pd(x).item()
        results.append((drug, yhat))
    
    patient_predictions[f"Patient_{p_idx}"] = sorted(results, key=lambda z: z[1])[:5]

# Show output
list(patient_predictions.items())[:3]

[('Patient_0',
  [('Docetaxel', 0.13874110579490662),
   ('Paclitaxel', 0.14748284220695496),
   ('Vinblastine', 0.17387035489082336),
   ('Vinorelbine', 0.1762823760509491),
   ('Bortezomib', 0.1781226396560669)]),
 ('Patient_1',
  [('Paclitaxel', 0.08392270654439926),
   ('Vinblastine', 0.10867483168840408),
   ('Vinorelbine', 0.11527568846940994),
   ('Docetaxel', 0.11705417186021805),
   ('Sepantronium bromide', 0.18155643343925476)]),
 ('Patient_2',
  [('Vinorelbine', 0.47331511974334717),
   ('Docetaxel', 0.4963309168815613),
   ('Vinblastine', 0.510337233543396),
   ('Paclitaxel', 0.5202227234840393),
   ('MG-132', 0.6121956706047058)])]


# collect top drugs across patients
top_drugs = []
for p, lst in patient_predictions.items():
    top_drugs.extend([d for d,_ in lst])

top_drugs = pd.Series(top_drugs)

# map to pathways
drug_path = gdsc_brca[['DRUG_NAME','TARGET_PATHWAY']].drop_duplicates()
merged = pd.DataFrame({'DRUG_NAME': top_drugs}).merge(drug_path, on='DRUG_NAME', how='left')

path_counts = merged['TARGET_PATHWAY'].value_counts().head(10)
path_counts

TARGET_PATHWAY
Mitosis                              2805
Protein stability and degradation     643
Chromatin histone acetylation          41
Apoptosis regulation                   26
Cell cycle                             10
Name: count, dtype: int64


# measure diversity: how many unique drugs appear across patients
unique_top = set()
for v in patient_predictions.values():
    unique_top.update([d for d,_ in v])

len(unique_top), list(unique_top)[:10]

(13,
 ['CDK9_5038',
  'Docetaxel',
  'Vinorelbine',
  'Sepantronium bromide',
  'Luminespib',
  'Tanespimycin',
  'Elesclomol',
  'Bortezomib',
  'Romidepsin',
  'Paclitaxel'])

Lower LN(IC50) → Higher predicted drug sensitivity


from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
emb_2d = pca.fit_transform(tcga_embeddings)

# Cluster patients
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(tcga_embeddings)

plt.figure(figsize=(8,6))
for i in range(3):
    plt.scatter(
        emb_2d[labels==i,0],
        emb_2d[labels==i,1],
        label=f"Cluster {i}",
        alpha=0.7
    )

plt.legend()
plt.title("Patient Clusters (Latent Space)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()


import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pivot = results_df.pivot(index="Patient", columns="Drug", values="Predicted_IC50")

# 🔥 MIN-MAX scaling per patient (row-wise)
pivot_scaled = pivot.copy()

for i in range(pivot.shape[0]):
    row = pivot.iloc[i]
    min_v = row.min()
    max_v = row.max()
    
    if max_v - min_v > 1e-6:
        pivot_scaled.iloc[i] = (row - min_v) / (max_v - min_v)
    else:
        pivot_scaled.iloc[i] = 0  # avoid divide-by-zero

# 🔥 take subset for visualization
pivot_subset = pivot_scaled.head(50)

plt.figure(figsize=(10,8))
sns.heatmap(pivot_subset, cmap="coolwarm")
plt.title("(B) Relative Drug Sensitivity (Scaled per Patient)")
plt.show()


# compute variability AFTER scaling
drug_var = pivot_scaled.std().sort_values(ascending=False)

plt.figure(figsize=(8,5))
drug_var.plot(kind='bar')
plt.title("(C) Drug Variability (Scaled)")
plt.ylabel("Std Dev")
plt.xticks(rotation=45)
plt.show()


drug_rank_mean = results_df.groupby("Drug")["Rank"].mean().sort_values()

plt.figure(figsize=(10,5))
drug_rank_mean.plot(kind='bar')
plt.title("Average Drug Rank Across Patients")
plt.ylabel("Average Rank")
plt.show()


top_drugs_series = results_df[results_df["Rank"] <= 3]["Drug"]

top_drugs_series.value_counts().plot(kind='bar')
plt.title("Most Frequently Recommended Drugs")
plt.show()


import seaborn as sns

top = results_df[results_df["Rank"] <= 3]

plt.figure(figsize=(10,6))
sns.countplot(data=top, x="Drug", hue="Patient")
plt.xticks(rotation=45)
plt.title("Top Drug Preferences Across Patients")
plt.show()


# use only top 8 drugs
import networkx as nx
top_drugs_clean = list(set(top_drugs))[:8]

G = nx.DiGraph()

for drug in top_drugs_clean:
    path = drug_path[drug_path["DRUG_NAME"] == drug]["TARGET_PATHWAY"].values[0]
    G.add_edge(path, drug)

plt.figure(figsize=(8,6))
pos = nx.spring_layout(G, k=1.2)

nx.draw(
    G, pos,
    with_labels=True,
    node_size=2500,
    font_size=9
)

plt.title("(E) Clean Drug–Pathway Network")
plt.show()


import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import networkx as nx

# -----------------------------
# DATA PREP
# -----------------------------
pivot = results_df.pivot(index="Patient", columns="Drug", values="Predicted_IC50")

# 🔥 Min-max scaling per patient
pivot_scaled = pivot.copy()

for i in range(pivot.shape[0]):
    row = pivot.iloc[i]
    min_v = row.min()
    max_v = row.max()
    
    if max_v - min_v > 1e-6:
        pivot_scaled.iloc[i] = (row - min_v) / (max_v - min_v)
    else:
        pivot_scaled.iloc[i] = 0

# Subset for visualization
pivot_subset = pivot_scaled.head(50)

# Drug variability
drug_var = pivot_scaled.std().sort_values(ascending=False)

# Top drugs
top = results_df[results_df["Rank"] <= 3]

# PCA + clustering
pca = PCA(n_components=2)
emb_2d = pca.fit_transform(tcga_embeddings)

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(tcga_embeddings)

# -----------------------------
# FIGURE SETUP
# -----------------------------
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 3)

# -----------------------------
# (A) PCA CLUSTER
# -----------------------------
ax1 = fig.add_subplot(gs[0, 0])

for i in range(3):
    ax1.scatter(
        emb_2d[labels == i, 0],
        emb_2d[labels == i, 1],
        label=f"C{i}",
        alpha=0.7
    )

ax1.set_title("(A) Patient Clusters")
ax1.set_xlabel("PC1")
ax1.set_ylabel("PC2")
ax1.legend()

# -----------------------------
# (B) HEATMAP
# -----------------------------
ax2 = fig.add_subplot(gs[0, 1:])

sns.heatmap(
    pivot_subset,
    cmap="coolwarm",
    ax=ax2,
    cbar=True
)

ax2.set_title("(B) Relative Drug Sensitivity (Scaled)")

# -----------------------------
# (C) DRUG VARIABILITY
# -----------------------------
ax3 = fig.add_subplot(gs[1, 0])

drug_var.plot(kind='bar', ax=ax3)

ax3.set_title("(C) Drug Variability")
ax3.set_ylabel("Std Dev")
ax3.tick_params(axis='x', rotation=45)

# -----------------------------
# (D) TOP DRUG FREQUENCY
# -----------------------------
ax4 = fig.add_subplot(gs[1, 1])

sns.countplot(data=top, x="Drug", ax=ax4)

ax4.set_title("(D) Top Drug Frequency")
ax4.tick_params(axis='x', rotation=45)

# -----------------------------
# (E) NETWORK GRAPH
# -----------------------------
ax5 = fig.add_subplot(gs[1:, 2])

top_drugs_clean = list(set(top_drugs))[:8]

G = nx.DiGraph()

for drug in top_drugs_clean:
    path = drug_path[drug_path["DRUG_NAME"] == drug]["TARGET_PATHWAY"].values[0]
    G.add_edge(path, drug)

pos = nx.spring_layout(G, k=1.2)

nx.draw(
    G,
    pos,
    with_labels=True,
    node_size=2500,
    font_size=8,
    ax=ax5
)

ax5.set_title("(E) Drug–Pathway Network")

# -----------------------------
# FINAL LAYOUT
# -----------------------------
plt.tight_layout()
plt.show()

The model exhibits low variability in drug response predictions across patients.

	rs_CLEC3A	rs_CPB1	rs_SCGB2A2	rs_SCGB1D2	rs_TFF1	rs_MUCL1	rs_GSTM1	rs_PIP	rs_ADIPOQ	rs_ADH1B	...	pp_p27.pT198	pp_p38.MAPK	pp_p38.pT180.Y182	pp_p53	pp_p62.LCK.ligand	pp_p70S6K	pp_p70S6K.pT389	pp_p90RSK	pp_p90RSK.pT359.S363
0	0.892818	6.580103	14.123672	10.606501	13.189237	6.649466	10.520335	10.338490	10.248379	10.229970	...	-0.043330	-0.002598	0.449228	-0.375230	-0.691766	-0.337863	-0.178503	0.011638	-0.207257
1	0.000000	3.691311	17.116090	15.517231	9.867616	9.691667	8.179522	7.911723	1.289598	1.818891	...	-0.220764	0.220809	1.035115	-0.074136	0.279067	0.292925	-0.155242	-0.089365	0.267530
2	3.748150	4.375255	9.658123	5.326983	12.109539	11.644307	10.517330	5.114925	11.975349	11.911437	...	0.010615	-0.133214	0.344969	-0.351936	0.219910	0.308110	-0.190794	-0.222150	-0.198518
3	0.000000	18.235519	18.535480	14.533584	14.078992	8.913760	10.557465	13.304434	8.205059	9.211476	...	0.064070	-0.384008	0.678042	0.096329	-0.266554	-0.079871	-0.463237	0.522998	-0.046902
4	0.000000	4.583724	15.711865	12.804521	8.881669	8.430028	12.964607	6.806517	4.294341	5.385714	...	-0.065488	0.209858	0.920408	0.042210	-0.441542	-0.152317	0.511386	-0.096482	0.037473

	rs_CLEC3A	rs_CPB1	rs_SCGB2A2	rs_SCGB1D2	rs_TFF1	rs_MUCL1	rs_GSTM1	rs_PIP	rs_ADIPOQ	rs_ADH1B	...	pp_p62.LCK.ligand	pp_p70S6K	pp_p70S6K.pT389	pp_p90RSK	pp_p90RSK.pT359.S363	PR.Status	ER.Status	HER2.Final.Status	histological.type
0	0.892818	6.580103	14.123672	10.606501	13.189237	6.649466	10.520335	10.338490	10.248379	10.229970	...	-0.691766	-0.337863	-0.178503	0.011638	-0.207257	Positive	Positive	Negative	infiltrating ductal carcinoma
1	0.000000	3.691311	17.116090	15.517231	9.867616	9.691667	8.179522	7.911723	1.289598	1.818891	...	0.279067	0.292925	-0.155242	-0.089365	0.267530	Positive	Negative	Negative	infiltrating ductal carcinoma
2	3.748150	4.375255	9.658123	5.326983	12.109539	11.644307	10.517330	5.114925	11.975349	11.911437	...	0.219910	0.308110	-0.190794	-0.222150	-0.198518	Positive	Positive	Negative	infiltrating ductal carcinoma
3	0.000000	18.235519	18.535480	14.533584	14.078992	8.913760	10.557465	13.304434	8.205059	9.211476	...	-0.266554	-0.079871	-0.463237	0.522998	-0.046902	Positive	Positive	Negative	infiltrating ductal carcinoma
4	0.000000	4.583724	15.711865	12.804521	8.881669	8.430028	12.964607	6.806517	4.294341	5.385714	...	-0.441542	-0.152317	0.511386	-0.096482	0.037473	Positive	Positive	Negative	infiltrating ductal carcinoma

	COSMIC_ID	CELL_LINE_NAME	TCGA_DESC	DRUG_ID	DRUG_NAME	LN_IC50	AUC	Z_SCORE	GDSC Tissue descriptor 1	GDSC Tissue descriptor 2	Cancer Type (matching TCGA label)	Microsatellite instability Status (MSI)	Screen Medium	Growth Properties	CNA	Gene Expression	Methylation	TARGET	TARGET_PATHWAY
0	683667	PFSK-1	MB	1003	Camptothecin	-1.463887	0.930220	0.433123	nervous_system	medulloblastoma	MB	MSS/MSI-L	R	Adherent	Y	Y	Y	TOP1	DNA replication
1	684057	ES5	UNCLASSIFIED	1003	Camptothecin	-3.360586	0.791072	-0.599569	bone	ewings_sarcoma	NaN	MSS/MSI-L	R	Adherent	Y	Y	Y	TOP1	DNA replication
2	684059	ES7	UNCLASSIFIED	1003	Camptothecin	-5.044940	0.592660	-1.516647	bone	ewings_sarcoma	NaN	MSS/MSI-L	R	Adherent	Y	Y	Y	TOP1	DNA replication
3	684062	EW-11	UNCLASSIFIED	1003	Camptothecin	-3.741991	0.734047	-0.807232	bone	ewings_sarcoma	NaN	MSS/MSI-L	R	Adherent	Y	Y	Y	TOP1	DNA replication
4	684072	SK-ES-1	UNCLASSIFIED	1003	Camptothecin	-5.142961	0.582439	-1.570016	bone	ewings_sarcoma	NaN	MSS/MSI-L	R	Semi-Adherent	Y	Y	Y	TOP1	DNA replication

	DATASET	NLME_RESULT_ID	NLME_CURVE_ID	COSMIC_ID	CELL_LINE_NAME	SANGER_MODEL_ID	TCGA_DESC	DRUG_ID	DRUG_NAME	PUTATIVE_TARGET	PATHWAY_NAME	COMPANY_ID	WEBRELEASE	MIN_CONC	MAX_CONC	LN_IC50	AUC	RMSE	Z_SCORE
0	GDSC2	343	15946310	683667	PFSK-1	SIDM01132	MB	1003	Camptothecin	TOP1	DNA replication	1046	Y	0.0001	0.1	-1.463887	0.930220	0.089052	0.433123
1	GDSC2	343	15946548	684052	A673	SIDM00848	UNCLASSIFIED	1003	Camptothecin	TOP1	DNA replication	1046	Y	0.0001	0.1	-4.869455	0.614970	0.111351	-1.421100
2	GDSC2	343	15946830	684057	ES5	SIDM00263	UNCLASSIFIED	1003	Camptothecin	TOP1	DNA replication	1046	Y	0.0001	0.1	-3.360586	0.791072	0.142855	-0.599569
3	GDSC2	343	15947087	684059	ES7	SIDM00269	UNCLASSIFIED	1003	Camptothecin	TOP1	DNA replication	1046	Y	0.0001	0.1	-5.044940	0.592660	0.135539	-1.516647
4	GDSC2	343	15947369	684062	EW-11	SIDM00203	UNCLASSIFIED	1003	Camptothecin	TOP1	DNA replication	1046	Y	0.0001	0.1	-3.741991	0.734047	0.128059	-0.807232

	DRUG_ID	SCREENING_SITE	DRUG_NAME	SYNONYMS	TARGET	TARGET_PATHWAY
0	1	MGH	Erlotinib	Tarceva, RG-1415, CP-358774, OSI-774, Ro-50823...	EGFR	EGFR signaling
1	3	MGH	Rapamycin	AY-22989, Sirolimus, WY-090217, Torisel, Rapamune	MTORC1	PI3K/MTOR signaling
2	5	MGH	Sunitinib	Sutent, Sunitinib Malate, SU-11248	PDGFR, KIT, VEGFR, FLT3, RET, CSF1R	RTK signaling
3	6	MGH	PHA-665752	PHA665752, PHA 665752	MET	RTK signaling
4	9	MGH	MG-132	LLL cpd, MG 132, MG132	Proteasome, CAPN1	Protein stability and degradation

	Sample Name	COSMIC identifier	Whole Exome Sequencing (WES)	Copy Number Alterations (CNA)	Gene Expression	Methylation	Drug\nResponse	GDSC\nTissue descriptor 1	GDSC\nTissue\ndescriptor 2	Cancer Type\n(matching TCGA label)	Microsatellite \ninstability Status (MSI)	Screen Medium	Growth Properties
0	A253	906794.0	Y	Y	Y	Y	Y	aero_dig_tract	head and neck	NaN	MSS/MSI-L	D/F12	Adherent
1	BB30-HNC	753531.0	Y	Y	Y	Y	Y	aero_dig_tract	head and neck	HNSC	MSS/MSI-L	D/F12	Adherent
2	BB49-HNC	753532.0	Y	Y	Y	Y	Y	aero_dig_tract	head and neck	HNSC	MSS/MSI-L	D/F12	Adherent
3	BHY	753535.0	Y	Y	Y	Y	Y	aero_dig_tract	head and neck	HNSC	MSS/MSI-L	D/F12	Adherent
4	BICR10	1290724.0	Y	Y	Y	Y	Y	aero_dig_tract	head and neck	HNSC	MSS/MSI-L	D/F12	Adherent

	Line	COSMIC_ID	Site	Histology
0	NCI-H522	905944	lung	carcinoma
1	Mewo	908128	skin	malignant_melanoma
2	C2BBe1	910700	large_intestine	carcinoma
3	HCC1428	1290905	breast	carcinoma
4	COR-L95	1297439	lung	carcinoma

	TCGA tissue classification:	Unnamed: 1
0	TCGA Label	Definition
1	ACC	Adrenocortical carcinoma
2	ALL	Acute lymphoblastic leukemia
3	BLCA	Bladder Urothelial Carcinoma
4	BRCA	Breast invasive carcinoma

	CELL_LINE_NAME	DRUG_NAME	LN_IC50	TCGA_DESC	GDSC Tissue descriptor 1	GDSC Tissue descriptor 2	Microsatellite instability Status (MSI)	Growth Properties
91	HCC1954	Camptothecin	0.317741	BRCA	breast	breast	MSS/MSI-L	Adherent
92	HCC1143	Camptothecin	0.636184	BRCA	breast	breast	MSS/MSI-L	Adherent
93	HCC1187	Camptothecin	1.235544	BRCA	breast	breast	MSS/MSI-L	Suspension
94	HCC1395	Camptothecin	-2.255899	BRCA	breast	breast	MSS/MSI-L	Adherent
95	HCC1599	Camptothecin	-3.247021	BRCA	breast	breast	MSS/MSI-L	Suspension

	Patient	Rank	Drug	Predicted_IC50
0	Patient_0	1	Dactinomycin	0.091833
1	Patient_0	2	Camptothecin	0.092198
2	Patient_0	3	Eg5_9814	0.094718
3	Patient_0	4	Sepantronium bromide	0.096180
4	Patient_0	5	SN-38	0.096250
5	Patient_1	1	Dactinomycin	0.091833
6	Patient_1	2	Camptothecin	0.092198
7	Patient_1	3	Eg5_9814	0.094718
8	Patient_1	4	Sepantronium bromide	0.096180
9	Patient_1	5	SN-38	0.096250
10	Patient_2	1	Dactinomycin	0.091833
11	Patient_2	2	Camptothecin	0.092198
12	Patient_2	3	Eg5_9814	0.094718
13	Patient_2	4	Sepantronium bromide	0.096180
14	Patient_2	5	SN-38	0.096250
15	Patient_3	1	Dactinomycin	0.091833
16	Patient_3	2	Camptothecin	0.092198
17	Patient_3	3	Eg5_9814	0.094718
18	Patient_3	4	Sepantronium bromide	0.096180
19	Patient_3	5	SN-38	0.096250

Rank	Drug	Predicted LN(IC50)
1	Paclitaxel	0.1483
2	Docetaxel	0.1492
3	Vinorelbine	0.1609
4	Bortezomib	0.1717
5	Vinblastine	0.1744

Rank	Drug	Predicted LN(IC50)
1	Paclitaxel	0.2603
2	Docetaxel	0.2747
3	Vinblastine	0.3093
4	Vinorelbine	0.3151
5	Bortezomib	0.3164

Rank	Drug	Predicted LN(IC50)
1	Docetaxel	0.3578
2	Paclitaxel	0.3779
3	Vinblastine	0.4372
4	Vinorelbine	0.4374
5	Bortezomib	0.4610

Pathway	Count
Mitosis	2605
Protein stability and degradation	709
Chromatin histone acetylation	168
Cell cycle	32
Apoptosis regulation	11

Explainable AI Framework for Precision Drug Response Prediction in Breast Cancer¶

1. Introduction¶

2. Problem Statement¶

3. Aim of the Project¶

4. Objectives¶

4.1 Biological Objectives¶

4.2 Computational Objectives¶

4.3 Therapeutic Objectives¶

5. Dataset Strategy¶

5.1 TCGA BRCA Dataset¶

5.2 GDSC Dataset¶

5.3 Integration Strategy¶

6. Methodology Overview¶

6.1 Data Preprocessing¶

6.2 Feature Representation¶

6.3 Model Development¶

6.4 Cross-Domain Mapping¶

6.5 Drug Ranking System¶

6.6 Explainability¶

7. System Architecture¶

8. Practical Applications¶

Oncology¶

Clinical Decision Support¶

Drug Discovery & Repurposing¶

9. Significance of the Work¶

10. Conclusion¶

Step 1: Data Loading and Initial Inspection¶

What we are doing¶

Why this is important¶

How we do it¶

Step 2: Filtering for Breast Cancer (BRCA)¶

What we are doing¶

Why this is important¶

How we do it¶

Step 3: Feature Selection¶

What we are doing¶

Why this is important¶

How we do it¶

Step 4: Handling Missing Values¶

What we are doing¶

Why this is important¶

How we do it¶

Observations¶

How we do it¶

Step 5: Encoding Categorical Variables¶

What we are doing¶

Why this is important¶

Important Note¶

Step 6: Feature Scaling (When Required)¶

What we are doing¶

Why this is important¶

Strategy¶

Step 9: Training a Baseline Drug Response Model¶

What we are doing¶

Why this is important¶

Model Choice¶

Output¶

Interpretation of Results¶

Overall Assessment¶

Important Consideration¶

Conclusion¶

Step 10: Drug Effectiveness Analysis¶

What we are doing¶

Why this is important¶

Interpretation¶

Step 11: Drug Ranking System¶

What we are doing¶

Why this is important¶

How it works¶

Step 10–11: Drug Effectiveness Analysis and Model-Based Ranking¶

What we are evaluating¶

Observed Drug Effectiveness (Data-Driven)¶

Model-Predicted Drug Ranking¶

Comparative Interpretation¶

What this means¶

Key Insight¶

Biological Interpretation¶

Conclusion¶

Final Takeaway¶

Step 12: Preparing Patient Data for Prediction¶