Gigi commited on
Commit
6096583
·
1 Parent(s): 60710c8

add dataset link

Browse files
Files changed (1) hide show
  1. README.md +26 -0
README.md CHANGED
@@ -122,6 +122,32 @@ It performs relative comparison only.
122
 
123
  Training performed using Fireworks AI.
124
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  ---
126
 
127
  ## Model Outputs
 
122
 
123
  Training performed using Fireworks AI.
124
 
125
+ ## Training Data
126
+
127
+ This model is fine-tuned via supervised fine-tuning (SFT) with LoRA on pairwise privacy-preference comparisons.
128
+
129
+ Training labels are generated using a teacher model (OpenAI o3) on [ShareGPT90K](https://huggingface.co/datasets/liyucheng/ShareGPT90K)-derived privacy-variant pairs.
130
+ As described in the paper, o3 was selected based on its alignment with human ground truth under high-consensus cases.
131
+
132
+ In addition, we release a human-labeled evaluation set of 150 A/B pairs.
133
+ Each pair is annotated by at least 5 qualified participants (52 unique participants total), with provided `consensus` labels and `consensus_ratio`.
134
+
135
+ For details on data construction, model selection, and annotation procedures, please refer to the paper.
136
+
137
+ ---
138
+ ## Released Dataset (Human Ground Truth)
139
+
140
+ We release a human-labeled [dataset](https://github.com/PEACH-Research-Lab/Operationalize-Data-Minimization/blob/main/human_labeled_datasets/DATASET_CARD.md) of 150 pairwise privacy-preference comparisons.
141
+
142
+ Each JSONL entry contains:
143
+ - `survey_id`, `conversation_id`, `pair_index`
144
+ - `answers`: anonymized participant votes (`participant_1`, `participant_2`, ...)
145
+ - `consensus`, `consensus_ratio`
146
+ - `message_A`, `message_B`
147
+
148
+ ### Participant Privacy
149
+ All participant identifiers are anonymized. No Prolific IDs or direct participant identifiers are released.
150
+
151
  ---
152
 
153
  ## Model Outputs