Add link to paper
#2
by
						
nielsr
	
							HF Staff
						- opened
							
					
    	
        README.md
    CHANGED
    
    | 
         @@ -1,22 +1,22 @@ 
     | 
|
| 1 | 
         
             
            ---
         
     | 
| 2 | 
         
            -
            license: apache-2.0
         
     | 
| 3 | 
         
             
            datasets:
         
     | 
| 4 | 
         
             
            - togethercomputer/RedPajama-Data-1T
         
     | 
| 5 | 
         
             
            language:
         
     | 
| 6 | 
         
             
            - en
         
     | 
| 7 | 
         
            -
            pipeline_tag: text-generation
         
     | 
| 8 | 
         
             
            library_name: transformers
         
     | 
| 
         | 
|
| 
         | 
|
| 9 | 
         
             
            ---
         
     | 
| 10 | 
         | 
| 11 | 
         
             
            ## PDS-160M
         
     | 
| 12 | 
         | 
| 13 | 
         
            -
            [paper](https:// 
     | 
| 14 | 
         | 
| 15 | 
         
             
            **PDS-160M** is a 160M model with [Mistral](https://arxiv.org/abs/2310.06825) achitecture pre-trained from scratch on the data selected from the CC split of [Redpajama](https://github.com/togethercomputer/RedPajama-Data), using the PDS framework.
         
     | 
| 16 | 
         | 
| 17 | 
         
             
            The PDS framework is based on the [Pontryagin's maximum principle](https://en.wikipedia.org/wiki/Pontryagin%27s_maximum_principle#:~:text=Pontryagin's%20maximum%20principle%20is%20used,the%20state%20or%20input%20controls.) for optimal pre-training data selection, which not only enjoy strong theoretical support but is also scalable for training large language models. 
         
     | 
| 18 | 
         | 
| 19 | 
         
            -
            Please refer to our [paper](https:// 
     | 
| 20 | 
         | 
| 21 | 
         
             
            ### Overview of the theory:
         
     | 
| 22 | 
         | 
| 
         @@ -32,7 +32,7 @@ Please refer to our [paper](https://arxiv.org/abs/2410.07064) for more details. 
     | 
|
| 32 | 
         | 
| 33 | 
         
             
            ### Evaluation
         
     | 
| 34 | 
         | 
| 35 | 
         
            -
            PDS-selected data improves the performance of language models pre-trained from scratch and saves pre-training  
     | 
| 36 | 
         | 
| 37 | 
         
             
            <p align='left'>
         
     | 
| 38 | 
         
             
                <img src="https://cdn-uploads.huggingface.co/production/uploads/624ac662102fcdff87be51b9/6undIr37d10qD73TDiPDK.png" width="600">
         
     | 
| 
         @@ -51,4 +51,4 @@ PDS-selected data improves the performance of language models pre-trained from s 
     | 
|
| 51 | 
         
             
              journal={arXiv preprint arXiv:2410.07064},
         
     | 
| 52 | 
         
             
              year={2024}
         
     | 
| 53 | 
         
             
            }
         
     | 
| 54 | 
         
            -
            ```
         
     | 
| 
         | 
|
| 1 | 
         
             
            ---
         
     | 
| 
         | 
|
| 2 | 
         
             
            datasets:
         
     | 
| 3 | 
         
             
            - togethercomputer/RedPajama-Data-1T
         
     | 
| 4 | 
         
             
            language:
         
     | 
| 5 | 
         
             
            - en
         
     | 
| 
         | 
|
| 6 | 
         
             
            library_name: transformers
         
     | 
| 7 | 
         
            +
            license: apache-2.0
         
     | 
| 8 | 
         
            +
            pipeline_tag: text-generation
         
     | 
| 9 | 
         
             
            ---
         
     | 
| 10 | 
         | 
| 11 | 
         
             
            ## PDS-160M
         
     | 
| 12 | 
         | 
| 13 | 
         
            +
            [paper](https://huggingface.co/papers/2410.07064) | [code](https://github.com/microsoft/LMOps/tree/main/data_selection)
         
     | 
| 14 | 
         | 
| 15 | 
         
             
            **PDS-160M** is a 160M model with [Mistral](https://arxiv.org/abs/2310.06825) achitecture pre-trained from scratch on the data selected from the CC split of [Redpajama](https://github.com/togethercomputer/RedPajama-Data), using the PDS framework.
         
     | 
| 16 | 
         | 
| 17 | 
         
             
            The PDS framework is based on the [Pontryagin's maximum principle](https://en.wikipedia.org/wiki/Pontryagin%27s_maximum_principle#:~:text=Pontryagin's%20maximum%20principle%20is%20used,the%20state%20or%20input%20controls.) for optimal pre-training data selection, which not only enjoy strong theoretical support but is also scalable for training large language models. 
         
     | 
| 18 | 
         | 
| 19 | 
         
            +
            Please refer to our [paper](https://huggingface.co/papers/2410.07064) for more details.
         
     | 
| 20 | 
         | 
| 21 | 
         
             
            ### Overview of the theory:
         
     | 
| 22 | 
         | 
| 
         | 
|
| 32 | 
         | 
| 33 | 
         
             
            ### Evaluation
         
     | 
| 34 | 
         | 
| 35 | 
         
            +
            PDS-selected data improves the performance of language models pre-trained from scratch and saves pre-training computation. The improvement scales up to large model sizes.
         
     | 
| 36 | 
         | 
| 37 | 
         
             
            <p align='left'>
         
     | 
| 38 | 
         
             
                <img src="https://cdn-uploads.huggingface.co/production/uploads/624ac662102fcdff87be51b9/6undIr37d10qD73TDiPDK.png" width="600">
         
     | 
| 
         | 
|
| 51 | 
         
             
              journal={arXiv preprint arXiv:2410.07064},
         
     | 
| 52 | 
         
             
              year={2024}
         
     | 
| 53 | 
         
             
            }
         
     | 
| 54 | 
         
            +
            ```
         
     |