RichardErkhov commited on
Commit
637571e
·
verified ·
1 Parent(s): fe328c2

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ nova-1.3b-bcr - bnb 8bits
11
+ - Model creator: https://huggingface.co/lt-asset/
12
+ - Original model: https://huggingface.co/lt-asset/nova-1.3b-bcr/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ license: bsd-3-clause-clear
20
+ ---
21
+ # Nova: Generative Language Model For Assembly Code
22
+
23
+ ## Abstract
24
+ Binary code analysis is the foundation of crucial tasks in the security domain; thus building effective binary analysis techniques is more important than ever. Large language models (LLMs) although have brought impressive improvement to source code tasks, do not directly generalize to assembly code due to the unique challenges of assembly: (1) the low information density of assembly and (2) the diverse optimizations in assembly code. To overcome these challenges, this work proposes a hierarchical attention mechanism that builds attention summaries to capture the semantics more effectively and designs contrastive learning objectives to train LLMs to learn assembly optimization. Equipped with these techniques, this work develops Nova, a generative LLM for assembly code. Nova outperforms existing techniques on binary code decompilation by up to 14.84 -- 21.58% higher Pass@1 and Pass@10, and outperforms the latest binary code similarity detection techniques by up to 6.17% Recall@1, showing promising abilities on both assembly generation and understanding tasks.
25
+
26
+ ## Introduction of Nova
27
+ Nova is pre-trained with the language modeling objective starting from DeepSeek-Coder checkpoints, using the disassembly code from [AnghaBench](https://github.com/albertan017/LLM4Decompile) and C/C++ program compiled from [The-Stack](https://huggingface.co/datasets/bigcode/the-stack).
28
+
29
+ This is the repository of the instruciton-tuned model of Nova that is good at binary code recovery, with 1.3B parameters.
30
+ The other models in this series:
31
+ - [Nova-1.3b](https://huggingface.co/lt-asset/nova-1.3b): Foundation model for binary code with 1.3B parameters.
32
+ - [Nova-6.7b](https://huggingface.co/lt-asset/nova-6.7b): Foundation model for binary code with 6.7B parameters.
33
+ - [Nova-6.7b-bcr](https://huggingface.co/lt-asset/nova-6.7b-bcr): Nova-6.7b model further instruction-tuned for binary code recovery.
34
+
35
+ ## Usage
36
+ ```python
37
+ from transformers import AutoTokenizer
38
+ from modeling_nova import NovaTokenizer, NovaForCausalLM
39
+
40
+ tokenizer = AutoTokenizer.from_pretrained('lt-asset/nova-1.3b-bcr', trust_remote_code=True)
41
+ if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
42
+ print('Vocabulary:', len(tokenizer.get_vocab())) # 32280
43
+ tokenizer.pad_token = tokenizer.eos_token
44
+ tokenizer.pad_token_id = tokenizer.eos_token_id
45
+ nova_tokenizer = NovaTokenizer(tokenizer)
46
+
47
+ model = NovaForCausalLM.from_pretrained('lt-asset/nova-1.3b-bcr', torch_dtype=torch.bfloat16).eval()
48
+
49
+ # load the humaneval-decompile dataset
50
+ data = json.load(open('humaneval_decompile_nova_1.3b.json', 'r'))
51
+ for item in data:
52
+ print(item['task_id'], item['type'])
53
+
54
+ prompt_before = f'# This is the assembly code with {item["type"]} optimization:\n<func0>:'
55
+ asm = item['normalized_asm'].strip()
56
+ assert asm.startswith('<func0>:')
57
+ asm = asm[len('<func0>:'): ]
58
+ prompt_after = '\nWhat is the source code?\n'
59
+
60
+ inputs = prompt_before + asm + prompt_after
61
+ # 0 for non-assembly code characters and 1 for assembly characters, required by nova tokenizer
62
+ char_types = '0' * len(prompt_before) + '1' * len(asm) + '0' * len(prompt_after)
63
+
64
+ tokenizer_output = nova_tokenizer.encode(inputs, '', char_types)
65
+ input_ids = torch.LongTensor(tokenizer_output['input_ids'].tolist()).unsqueeze(0)
66
+ nova_attention_mask = torch.LongTensor(tokenizer_output['nova_attention_mask']).unsqueeze(0)
67
+
68
+ outputs = model.generate(
69
+ inputs=input_ids.cuda(), max_new_tokens=512, temperature=0.2, top_p=0.95,
70
+ num_return_sequences=20, do_sample=True, nova_attention_mask=nova_attention_mask.cuda(),
71
+ no_mask_idx=torch.LongTensor([tokenizer_output['no_mask_idx']]).cuda(),
72
+ pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id
73
+ )
74
+ item['infer_c_func'] = []
75
+ for output in outputs:
76
+ item['infer_c_func'].append({
77
+ 'c_func': tokenizer.decode(output[input_ids.size(1): ], skip_special_tokens=True, clean_up_tokenization_spaces=True)
78
+ })
79
+
80
+ json.dump(data, open(f'humaneval_decompile_nova_1.3b.json', 'w'), indent=2)
81
+ ```
82
+
83
+ ## Citation
84
+ ```
85
+ @misc{jiang2024nova,
86
+ title={Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning},
87
+ author={Nan Jiang and Chengxiao Wang and Kevin Liu and Xiangzhe Xu and Lin Tan and Xiangyu Zhang},
88
+ year={2024},
89
+ eprint={2311.13721},
90
+ archivePrefix={arXiv},
91
+ primaryClass={cs.SE},
92
+ url={https://arxiv.org/abs/2311.13721},
93
+ }
94
+ ```
95
+