Spaces:

openadmet
/

OpenADMET-ExpansionRx-Challenge

Running

App Files Files Community

Maria Castellanos commited on Sep 29

Commit

b2070e0

1 Parent(s): bcf1817

Add links

Browse files

Files changed (3) hide show

about.py +0 -1
app.py +10 -11
evaluate.py +12 -7

about.py CHANGED Viewed

@@ -9,7 +9,6 @@ ENDPOINTS = ["LogD",
              "Caco-2 Permeability Papp A>B",
              "MPPB",
              "MBPB",
-             "RLM CLint",
              "MGMB"]
 LB_COLS0 = ["endpoint",

              "Caco-2 Permeability Papp A>B",
              "MPPB",
              "MBPB",
              "MGMB"]
 LB_COLS0 = ["endpoint",

app.py CHANGED Viewed

@@ -107,7 +107,7 @@ def gradio_interface():
         Participants will be tasked with solving real-world ADMET prediction problems ExpansionRx faced during lead optimization.
         Specifically, you will be asked to predict the ADMET properties of late-stage molecules based on earlier-stage data from the same campaigns.
-        For this challenge we selected ten (10) crucial endpoints for the community to predict:
         - LogD
         - Kinetic Solubility **KSOL**: uM
@@ -117,14 +117,13 @@ def gradio_interface():
         - Caco-2 Papp A>B (10^-6 cm/s)
         - Mouse Plasma Protein Binding (**MPPB**): % Unbound
         - Mouse Brain Protein Binding (**MBPB**): % Unbound
-        - Rat Liver Microsomal (**RLM**) *Clint*: mL/min/kg
         - Mouse Gastrocnemius Muscle Binding (**MGMB**): % Unbound
         Find more information about these endpoints on our [blog](https://openadmet.org/community/blogs/challenge_announcement2/).
         ## ✅ How to Participate
         1. **Register**: Create an account with Hugging Face.
-        2. **Download the Public Dataset**: Clone the ExpansionRx dataset [link]
         3. **Train Your Model**: Use the provided training data for each ADMET property of your choice.
         4. **Submit Predictions**: Follow the instructions in the *Submit* tab to upload your predictions.
         5. Join the discussion on the [Challenge Discord](https://discord.gg/MY5cEFHH3D)!
@@ -145,10 +144,9 @@ def gradio_interface():
         | Caco-2 Permeability Papp A>B | 10^-6 cm/s  |   float   | Caco-2 Permeability Papp A>B |
         | MPPB                         | % Unbound   |   float   | Mouse Plasma Protein Binding |
         | MBPB                         | % Unbound   |   float   | Mouse Brain Protein Binding |
-        | RLM CLint                    | mL/min/kg   |   float   | Rat Liver Microsomal Stability |
         | MGMB.                        | % Unbound   |   float   | Mouse Gastrocnemius Muscle Binding |
-        You can download the training data from the [Hugging Face dataset](https://huggingface.co/datasets/OpenADMET/openadmet-challenge-training-set).
         The test set will remained blinded until the challenge submission deadline. You will be tasked with predicting the same set of ADMET endpoints for the test set molecules.
         ## 📝 Evaluation
@@ -156,7 +154,7 @@ def gradio_interface():
         - We welcome submissions of any kind, including machine learning and physics-based approaches. You can also employ pre-training approaches as you see fit,
         as well as incorporate data from external sources into your models and submissions.
         - In the spirit of open science and open source we would love to see code showing how you created your submission if possible, in the form of a Github Repository.
-        If not possible due to IP or other constraints you must at a minimum provide a short report written methodology based on the template [here](link to google doc).
         **Make sure your lat submission before the deadline includes a link to a report or to a Github repository.**
         - Each participant can submit as many times as they like, up to a limit of 5 times/day. **Only your latest submission will be considered for the final leaderboard.**
         - The endpoints will be judged individually by mean absolute error (**MAE**), while an overall leaderboard will be judged by the macro-averaged relative absolute error (**MA-RAE**).
@@ -165,7 +163,7 @@ def gradio_interface():
         📅 **Timeline**:
         - **September 16:** Challenge announcement
-        - **September XX:** Sample data release
         - **October 27:** Challenge starts
         - **October-November:** Online Q&A sessions and support via the Discord channel
         - **January 19, 2026:** Submission closes
@@ -334,15 +332,16 @@ def gradio_interface():
                         gr.Markdown(
                             """
                             ## Submission Instructions
-                            Upload a single CSV file containing your predictions for all ligands in the test set.
                             Only your latest submission will be considered.
-                            You can download a CSV template with the ligands in the test set here.
                             """
                         )
                         download_btn = gr.DownloadButton(
-                            label="📥 Download Test Set Template",
-                            value="data/test_set-example.csv",
                             variant="secondary",
                             )
                     with gr.Column():

         Participants will be tasked with solving real-world ADMET prediction problems ExpansionRx faced during lead optimization.
         Specifically, you will be asked to predict the ADMET properties of late-stage molecules based on earlier-stage data from the same campaigns.
+        For this challenge we selected nine (9) crucial endpoints for the community to predict:
         - LogD
         - Kinetic Solubility **KSOL**: uM
         - Caco-2 Papp A>B (10^-6 cm/s)
         - Mouse Plasma Protein Binding (**MPPB**): % Unbound
         - Mouse Brain Protein Binding (**MBPB**): % Unbound
         - Mouse Gastrocnemius Muscle Binding (**MGMB**): % Unbound
         Find more information about these endpoints on our [blog](https://openadmet.org/community/blogs/challenge_announcement2/).
         ## ✅ How to Participate
         1. **Register**: Create an account with Hugging Face.
+        2. **Download the Public Dataset**: Download the ExpansionRx dataset.
         3. **Train Your Model**: Use the provided training data for each ADMET property of your choice.
         4. **Submit Predictions**: Follow the instructions in the *Submit* tab to upload your predictions.
         5. Join the discussion on the [Challenge Discord](https://discord.gg/MY5cEFHH3D)!
         | Caco-2 Permeability Papp A>B | 10^-6 cm/s  |   float   | Caco-2 Permeability Papp A>B |
         | MPPB                         | % Unbound   |   float   | Mouse Plasma Protein Binding |
         | MBPB                         | % Unbound   |   float   | Mouse Brain Protein Binding |
         | MGMB.                        | % Unbound   |   float   | Mouse Gastrocnemius Muscle Binding |
+        You can download the training data from the [Hugging Face dataset](https://huggingface.co/datasets/openadmet/openadmet-challenge-train-data).
         The test set will remained blinded until the challenge submission deadline. You will be tasked with predicting the same set of ADMET endpoints for the test set molecules.
         ## 📝 Evaluation
         - We welcome submissions of any kind, including machine learning and physics-based approaches. You can also employ pre-training approaches as you see fit,
         as well as incorporate data from external sources into your models and submissions.
         - In the spirit of open science and open source we would love to see code showing how you created your submission if possible, in the form of a Github Repository.
+        If not possible due to IP or other constraints you must at a minimum provide a short report written methodology based on the template [here](https://docs.google.com/document/d/1bttGiBQcLiSXFngmzUdEqVchzPhj-hcYLtYMszaOqP8/edit?usp=sharing).
         **Make sure your lat submission before the deadline includes a link to a report or to a Github repository.**
         - Each participant can submit as many times as they like, up to a limit of 5 times/day. **Only your latest submission will be considered for the final leaderboard.**
         - The endpoints will be judged individually by mean absolute error (**MAE**), while an overall leaderboard will be judged by the macro-averaged relative absolute error (**MA-RAE**).
         📅 **Timeline**:
         - **September 16:** Challenge announcement
+        - **October XX:** Second announcement and sample data release
         - **October 27:** Challenge starts
         - **October-November:** Online Q&A sessions and support via the Discord channel
         - **January 19, 2026:** Submission closes
                         gr.Markdown(
                             """
                             ## Submission Instructions
+                            After training your model with the [ExpansionRx trainining set](https://huggingface.co/datasets/openadmet/openadmet-challenge-train-data),
+                            please upload a single CSV file containing your predictions for all compounds in the test set.
                             Only your latest submission will be considered.
+                            Download a CSV file with the compounds in the test set here:
                             """
                         )
                         download_btn = gr.DownloadButton(
+                            label="📥 Download Test Set Compounds",
+                            value="data/expansion_data_test_blinded.csv",
                             variant="secondary",
                             )
                     with gr.Column():

evaluate.py CHANGED Viewed

@@ -255,18 +255,23 @@ def calculate_metrics(
     # Do some checks
     # 1) Check all columns are present
-    _check_required_columns(results_dataframe, "Results file", ["Molecule Name"] + ENDPOINTS)
-    _check_required_columns(test_dataframe, "Test file", ["Molecule Name"] + ENDPOINTS)
     # 2) Check all Molecules in the test set are present in the predictions
-    merged_df = pd.merge(test_dataframe, results_dataframe, on=['Molecule Name'], how='left', indicator=True)
     if not (merged_df['_merge'] == 'both').all():
         raise gr.Error("The predictions file is missing some molecules present in the test set. Please ensure all molecules are included.")
     # TODO: What to do when a molecule is duplicated in the Predictions file?
     df_results = pd.DataFrame(columns=["endpoint", "MAE", "RAE", "R2", "Spearman R", "Kendall's Tau"])
     for i, measurement in enumerate(ENDPOINTS):
-        df_pred = results_dataframe[['Molecule Name', measurement]].copy()
-        df_true = test_dataframe[['Molecule Name', measurement]].copy()
         # coerce numeric columns
         df_pred[measurement] = pd.to_numeric(df_pred[measurement], errors="coerce")
         df_true[measurement] = pd.to_numeric(df_true[measurement], errors="coerce")
@@ -280,7 +285,7 @@ def calculate_metrics(
             df_pred.rename(columns={measurement: f"{measurement}_pred"})
                 .merge(
                     df_true.rename(columns={measurement: f"{measurement}_true"}),
-                    on="Molecule Name",
                     how="inner",
                 )
                 .dropna(subset=[f"{measurement}_pred", f"{measurement}_true"])
@@ -288,7 +293,7 @@ def calculate_metrics(
         n_total = merged[f"{measurement}_true"].notna().sum()     # Valid test set points
         n_pairs = len(merged)                         # actual pairs with predictions
         coverage = (n_pairs / n_total * 100.0) if n_total else 0.0
-        merged = merged.sort_values("Molecule Name", kind="stable")
         # validate pairs
         if n_pairs < 10:

     # Do some checks
     # 1) Check all columns are present
+    if "Molecule Name" in results_dataframe.columns: # Temporary check so old version of results doesn't fail
+        results_dataframe.rename({"Molecule Name": "Name"}, inplace=True)
+    _check_required_columns(results_dataframe, "Results file", ["Name"] + ENDPOINTS)
+    _check_required_columns(test_dataframe, "Test file", ["Name"] + ENDPOINTS)
     # 2) Check all Molecules in the test set are present in the predictions
+    merged_df = pd.merge(test_dataframe, results_dataframe, on=['Name'], how='left', indicator=True)
     if not (merged_df['_merge'] == 'both').all():
         raise gr.Error("The predictions file is missing some molecules present in the test set. Please ensure all molecules are included.")
     # TODO: What to do when a molecule is duplicated in the Predictions file?
     df_results = pd.DataFrame(columns=["endpoint", "MAE", "RAE", "R2", "Spearman R", "Kendall's Tau"])
     for i, measurement in enumerate(ENDPOINTS):
+        df_pred = results_dataframe[['Name', measurement]].copy()
+        # Only use data with operator "="
+        mask = test_dataframe[f"op_{measurement}"] != '='
+        test_dataframe.loc[mask, measurement] = np.nan
+        df_true = test_dataframe[['Name', measurement]].copy()
         # coerce numeric columns
         df_pred[measurement] = pd.to_numeric(df_pred[measurement], errors="coerce")
         df_true[measurement] = pd.to_numeric(df_true[measurement], errors="coerce")
             df_pred.rename(columns={measurement: f"{measurement}_pred"})
                 .merge(
                     df_true.rename(columns={measurement: f"{measurement}_true"}),
+                    on="Name",
                     how="inner",
                 )
                 .dropna(subset=[f"{measurement}_pred", f"{measurement}_true"])
         n_total = merged[f"{measurement}_true"].notna().sum()     # Valid test set points
         n_pairs = len(merged)                         # actual pairs with predictions
         coverage = (n_pairs / n_total * 100.0) if n_total else 0.0
+        merged = merged.sort_values("Name", kind="stable")
         # validate pairs
         if n_pairs < 10: