semmyk commited on
Commit
8290881
Β·
1 Parent(s): 42d6e84

baseline08_beta0.3.0_01Oct25: refactor oauth log in, - Marker converter, dropped llm_client, - add force_ocr: to phase in cli-option

Browse files
README.md CHANGED
@@ -82,11 +82,11 @@ requires-python: ">=3.12"
82
  [![Python](https://img.shields.io/badge/Python->=3.12-blue?logo=python)](https://www.python.org/)
83
  [![MIT License](https://img.shields.io/badge/License-MIT-yellow?logo=mit)](LICENSE)
84
 
85
- A Gradio-based web application for converting PDF and HTML documents to Markdown format. Powered by the Marker library (a pipeline of deep learning models for document parsing) and optional LLM integration for enhanced processing. Supports batch processing of files and directories via an intuitive UI.
86
 
87
  ## Features
88
- - **PDF to Markdown**: Extract text, tables, and images from PDFs using Marker.
89
- - **HTML to Markdown**: Convert HTML files to clean Markdown.
90
  - **Batch Processing**: Upload multiple files or entire directories.
91
  - **LLM Integration**: Optional use of Hugging Face or OpenAI models for advanced conversion (e.g., via Llama or GPT models).
92
  - **Customizable Settings**: Adjust model parameters, output formats (Markdown/HTML), page ranges, and more via the UI.
@@ -104,21 +104,21 @@ parserpdf/
104
  β”œβ”€β”€ converters/ # Conversion logic
105
  β”‚ β”œβ”€β”€ __init__.py
106
  β”‚ β”œβ”€β”€ extraction_converter.py # Document extraction utilities
107
- β”‚ β”œβ”€β”€ pdf_to_md.py # Marker-based PDF β†’ Markdown
108
- β”‚ β”œβ”€β”€ html_to_md.py # HTML β†’ Markdown
109
  β”‚ └── md_to_pdf.py # Markdown β†’ PDF (pending full implementation)
110
  β”œβ”€β”€ file_handler/ # File handling utilities
111
  β”‚ β”œβ”€β”€ __init__.py
112
  β”‚ └── file_utils.py # Helpers for files, directories, and paths
113
  β”œβ”€β”€ llm/ # LLM client integrations
114
  β”‚ β”œβ”€β”€ __init__.py
115
- β”‚ β”œβ”€β”€ hf_client.py # Hugging Face client wrapper
116
- β”‚ β”œβ”€β”€ openai_client.py # Marker OpenAI client
117
  β”‚ β”œβ”€β”€ llm_login.py # Authentication handlers
118
  β”‚ └── provider_validator.py # Provider validation
119
  β”œβ”€β”€ ui/ # Gradio UI components
120
  β”‚ β”œβ”€β”€ __init__.py
121
- β”‚ └── gradio_ui.py # UI layout and event handlers
122
  β”œβ”€β”€ utils/ # Utility modules
123
  β”‚ β”œβ”€β”€ __init__.py
124
  β”‚ β”œβ”€β”€ config.py # Configuration constants
@@ -132,8 +132,8 @@ parserpdf/
132
  β”‚ β”œβ”€β”€ output_dir/ # Output directory
133
  β”‚ β”œβ”€β”€ pdf/ # Sample PDFs
134
  β”œβ”€β”€ logs/ # Log files (gitignored)
135
- β”œβ”€β”€ tests/ # Unit tests
136
- β”œβ”€β”€ tests_converter.py # tests for converters
137
  └── scrapyard/ # Development scraps
138
 
139
 
@@ -165,10 +165,11 @@ parserpdf/
165
  HF_TOKEN=hf_xxx
166
  OPENAI_API_KEY=sk-xxx
167
  ```
 
168
 
169
  4. Install Marker (if not in requirements.txt):
170
  ```
171
- pip install marker-pdf
172
  ```
173
 
174
  ## Usage
@@ -180,7 +181,7 @@ parserpdf/
180
  2. Open the provided local URL (e.g., http://127.0.0.1:7860) in your browser.
181
 
182
  3. In the UI:
183
- - Upload PDF/HTML files or directories via the "PDF & HTML ➜ Markdown" tab.
184
  - Configure LLM/Marker settings in the accordions (e.g., select provider, model, tokens).
185
  - Click "Process All Uploaded Files" to convert.
186
  - View logs, JSON output, and download generated Markdown files.
@@ -208,13 +209,15 @@ parserpdf/
208
  ## Limitations & TODO
209
  - Markdown β†’ PDF is pending full implementation.
210
  - HTML tab is deprecated; use main tab for mixed uploads.
211
- - Large files/directories may require increased `max_workers`.
212
  - No JSON/chunks output yet (flagged for future).
213
 
214
  ## Contributing
215
  Fork the repo, create a branch, and submit a PR.
 
 
216
 
217
- Ensure tests pass: - verify the application's functionality.
218
  ```
219
  pytest tests/
220
  ```
 
82
  [![Python](https://img.shields.io/badge/Python->=3.12-blue?logo=python)](https://www.python.org/)
83
  [![MIT License](https://img.shields.io/badge/License-MIT-yellow?logo=mit)](LICENSE)
84
 
85
+ A Gradio-based web application for converting PDF, HTML and Word documents to Markdown format. Powered by the Marker library (a pipeline of deep learning models for document parsing) and optional LLM integration for enhanced processing. Supports batch processing of files and directories via an intuitive UI.
86
 
87
  ## Features
88
+ - **PDF to Markdown**: Extract text, tables, and images from PDFs, HTMLs and Word documents using Marker.
89
+ - **HTML to Markdown**: Convert HTML files to clean Markdown. #Deprecated
90
  - **Batch Processing**: Upload multiple files or entire directories.
91
  - **LLM Integration**: Optional use of Hugging Face or OpenAI models for advanced conversion (e.g., via Llama or GPT models).
92
  - **Customizable Settings**: Adjust model parameters, output formats (Markdown/HTML), page ranges, and more via the UI.
 
104
  β”œβ”€β”€ converters/ # Conversion logic
105
  β”‚ β”œβ”€β”€ __init__.py
106
  β”‚ β”œβ”€β”€ extraction_converter.py # Document extraction utilities
107
+ β”‚ β”œβ”€β”€ pdf_to_md.py # Marker-based PDF, HTML, Word β†’ Markdown
108
+ β”‚ β”œβ”€β”€ html_to_md.py # HTML β†’ Markdown #Deprecated
109
  β”‚ └── md_to_pdf.py # Markdown β†’ PDF (pending full implementation)
110
  β”œβ”€β”€ file_handler/ # File handling utilities
111
  β”‚ β”œβ”€β”€ __init__.py
112
  β”‚ └── file_utils.py # Helpers for files, directories, and paths
113
  β”œβ”€β”€ llm/ # LLM client integrations
114
  β”‚ β”œβ”€β”€ __init__.py
115
+ β”‚ β”œβ”€β”€ hf_client.py # Hugging Face client wrapper ##PutOnHold
116
+ β”‚ β”œβ”€β”€ openai_client.py # Marker OpenAI client ##NotFullyImplemented
117
  β”‚ β”œβ”€β”€ llm_login.py # Authentication handlers
118
  β”‚ └── provider_validator.py # Provider validation
119
  β”œβ”€β”€ ui/ # Gradio UI components
120
  β”‚ β”œβ”€β”€ __init__.py
121
+ β”‚ └── gradio_ui.py # UI layout, event handlers and coordination
122
  β”œβ”€β”€ utils/ # Utility modules
123
  β”‚ β”œβ”€β”€ __init__.py
124
  β”‚ β”œβ”€β”€ config.py # Configuration constants
 
132
  β”‚ β”œβ”€β”€ output_dir/ # Output directory
133
  β”‚ β”œβ”€β”€ pdf/ # Sample PDFs
134
  β”œβ”€β”€ logs/ # Log files (gitignored)
135
+ β”œβ”€β”€ tests/ # Unit tests ##ToBeUpdated
136
+ β”‚ β”œβ”€β”€ tests_converter.py # tests for converters
137
  └── scrapyard/ # Development scraps
138
 
139
 
 
165
  HF_TOKEN=hf_xxx
166
  OPENAI_API_KEY=sk-xxx
167
  ```
168
+ - HuggingFace login (oauth) integrated with Gradio:
169
 
170
  4. Install Marker (if not in requirements.txt):
171
  ```
172
+ pip install marker-pdf[full]
173
  ```
174
 
175
  ## Usage
 
181
  2. Open the provided local URL (e.g., http://127.0.0.1:7860) in your browser.
182
 
183
  3. In the UI:
184
+ - Upload PDF/HTML/Word files or directories via the "PDF, HTML & Word ➜ Markdown" tab.
185
  - Configure LLM/Marker settings in the accordions (e.g., select provider, model, tokens).
186
  - Click "Process All Uploaded Files" to convert.
187
  - View logs, JSON output, and download generated Markdown files.
 
209
  ## Limitations & TODO
210
  - Markdown β†’ PDF is pending full implementation.
211
  - HTML tab is deprecated; use main tab for mixed uploads.
212
+ - Large files/directories may require increased `max_workers` and higher processing power.
213
  - No JSON/chunks output yet (flagged for future).
214
 
215
  ## Contributing
216
  Fork the repo, create a branch, and submit a PR.
217
+ - GitHub
218
+ - HuggingFace Space Community
219
 
220
+ Ensure tests pass: - verify the application's functionality. ##TardyOutdated
221
  ```
222
  pytest tests/
223
  ```
converters/extraction_converter.py CHANGED
@@ -53,6 +53,7 @@ class DocumentConverter:
53
  output_format: str = "markdown",
54
  output_dir: Optional[Union[str, Path]] = "output_dir",
55
  use_llm: Optional[bool] = None, #bool = False, #Optional[bool] = False, #True,
 
56
  page_range: Optional[str] = None, #str = None #Optional[str] = None,
57
  ):
58
 
@@ -68,6 +69,7 @@ class DocumentConverter:
68
  self.max_retries = max_retries ## pass to __call__
69
  self.output_dir = output_dir ## "output_dir": settings.DEBUG_DATA_FOLDER if debug else output_dir,
70
  self.use_llm = use_llm if use_llm else False #use_llm[0] if isinstance(use_llm, tuple) else use_llm, #False, #True,
 
71
  #self.page_range = page_range[0] if isinstance(page_range, tuple) else page_range ##SMY: iterating twice because self.page casting as hint type tuple!
72
  self.page_range = page_range if page_range else None
73
  # self.page_range = page_range[0] if isinstance(page_range, tuple) else page_range if isinstance(page_range, str) else None, ##Example: "0,4-8,16" ##Marker parses as List[int] #]debug #len(pdf_file)
@@ -80,6 +82,7 @@ class DocumentConverter:
80
 
81
  # 0) Instantiate the LLM Client (OPENAIChatClient): Get a provider‐agnostic chat function
82
  ##SMY: #future. Plan to integrate into Marker: uses its own LLM services (clients). As at 1.9.2, there's no huggingface client service.
 
83
  try:
84
  self.client = OpenAIChatClient(
85
  model_id=model_id,
@@ -95,16 +98,17 @@ class DocumentConverter:
95
  tb = traceback.format_exc() #exc.__traceback__
96
  logger.exception(f"βœ— Error initialising OpenAIChatClient: {exc}\n{tb}")
97
  raise RuntimeError(f"βœ— Error initialising OpenAIChatClient: {exc}\n{tb}") #.with_traceback(tb)
98
-
99
  # 1) # Define the custom configuration for the Hugging Face LLM.
100
  # Use typing.Dict and typing.Any for flexible dictionary type hints
101
  try:
102
  self.config_dict: Dict[str, Any] = self.get_config_dict(model_id=model_id, llm_service=str(self.llm_service), output_format=output_format)
103
- #self.config_dict.pop("page_range") if self.config_dict.get("page_range")[0] is None else None ##SMY: execute if page_range is none. `else None` ensures valid syntactic expression
104
-
105
  ##SMY: if falsely empty tuple () or None, pop the "page_range" key-value pair, else do nothing if truthy tuple value (i.e. keep as-is)
106
  self.config_dict.pop("page_range", None) if not self.config_dict.get("page_range") else None
107
  self.config_dict.pop("use_llm", None) if not self.config_dict.get("use_llm") or self.config_dict.get("use_llm") is False or self.config_dict.get("use_llm") == 'False' else None
 
108
 
109
  logger.log(level=20, msg="βœ”οΈ config_dict custom configured:", extra={"service": "openai"}) #, "config": str(self.config_dict)})
110
 
@@ -124,27 +128,17 @@ class DocumentConverter:
124
  logger.exception(f"βœ— Error parsing/processing custom config_dict: {exc}\n{tb}")
125
  raise RuntimeError(f"βœ— Error parsing/processing custom config_dict: {exc}\n{tb}") #.with_traceback(tb)
126
 
127
- # 3) Create the artifact dictionary and retrieve the LLM service. ##SMY: disused
128
- try:
129
- ##self.artifact_dict: Dict[str, Any] = self.get_create_model_dict ##SMY: Might have to eliminate function afterall
130
- #self.artifact_dict: Dict[str, Type[BaseModel]] = create_model_dict() ##SMY: BaseModel for Any??
131
- self.artifact_dict = {} ##dummy
132
- ##logger.log(level=20, msg="βœ”οΈ Create artifact_dict and llm_service retrieved:", extra={"llm_service": self.llm_service})
133
-
134
- except Exception as exc:
135
- tb = traceback.format_exc() #exc.__traceback__
136
- logger.exception(f"βœ— Error creating artifact_dict or retrieving LLM service: {exc}\n{tb}")
137
- raise RuntimeError(f"βœ— Error creating artifact_dict or retrieving LLM service: {exc}\n{tb}") #.with_traceback(tb)
138
-
139
- # 4) Load models if not already loaded in reload mode
140
  from globals import config_load_models
141
  try:
142
- if not config_load_models.model_dict or 'model_dict' not in globals():
 
 
 
143
  model_dict = load_models()
144
  '''if 'model_dict' not in globals():
145
  #model_dict = self.load_models()
146
  model_dict = load_models()'''
147
- else: model_dict = config_load_models.model_dict
148
  except OSError as exc_ose:
149
  tb = traceback.format_exc() #exc.__traceback__
150
  logger.warning(f"⚠️ OSError: the paging file is too small (to complete reload): {exc_ose}\n{tb}")
@@ -153,30 +147,28 @@ class DocumentConverter:
153
  tb = traceback.format_exc() #exc.__traceback__
154
  logger.exception(f"βœ— Error loading models (reload): {exc}\n{tb}")
155
  raise RuntimeError(f"βœ— Error loading models (reload): {exc}\n{tb}") #.with_traceback(tb)
156
-
157
 
158
- # 5) Instantiate Marker's MarkerConverter (PdfConverter) with config managed by config_parser
159
  try: # Assign llm_service if api_token. ##SMY: split and slicing ##Gets the string value
160
  llm_service_str = None if api_token == '' or api_token is None or self.use_llm is False else str(self.llm_service).split("'")[1] #
161
 
162
  # sets api_key required by Marker ## to handle Marker's assertion test on OpenAI
163
- #os.environ["OPENAI_API_KEY"] = api_token if api_token !='' or api_token is not None else self.openai_api_key ##SMY: looks lame
164
- os.environ["OPENAI_API_KEY"] = api_token if api_token and api_token != '' else os.getenv("OPENAI_API_KEY") or os.getenv("GEMINI_API_KEY") or os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACEHUB_API_TOKEN")
165
- #logger.log(level=20, msg="self.converter: instantiating MarkerConverter:", extra={"llm_service_str": llm_service_str, "api_token": api_token}) ##debug
166
 
167
  config_dict = config_parser.generate_config_dict()
168
- #config_dict["pdftext_worker"] = self.max_workers #1 ##SMY: move to get_config_dicts()
169
 
170
- #self.converter: MarkerConverter = MarkerConverter(
171
  self.converter = MarkerConverter(
172
- ##artifact_dict=self.artifact_dict,
173
  #artifact_dict=create_model_dict(),
174
  artifact_dict=model_dict if model_dict else create_model_dict(),
175
  config=config_dict,
176
  #config=config_parser.generate_config_dict(),
177
  #llm_service=self.llm_service ##SMY expecting str but self.llm_service, is service object marker.services of type BaseServices
178
  llm_service=llm_service_str, ##resolve
179
- )
180
 
181
  logger.log(level=20, msg="βœ”οΈ MarkerConverter instantiated successfully:", extra={"converter.config": str(self.converter.config.get("openai_base_url")), "use_llm":self.converter.use_llm})
182
  #return self.converter ##SMY: to query why did I comment out?. Bingo: "__init__() should return None, not 'PdfConverter'"
@@ -187,21 +179,20 @@ class DocumentConverter:
187
 
188
  # Define the custom configuration for HF LLM.
189
  def get_config_dict(self, model_id: str, llm_service=MarkerOpenAIService, output_format: Optional[str] = "markdown" ) -> Dict[str, Any]:
190
- """ Define the custom configuration for the Hugging Face LLM. """
191
 
192
  try:
193
- ## Enable higher quality processing with LLMs. ## See MarkerOpenAIService,
194
- # llm_service disused here
195
  ##llm_service = llm_service.removeprefix("<class '").removesuffix("'>") # e.g <class 'marker.services.openai.OpenAIService'>
196
  #llm_service = str(llm_service).split("'")[1] ## SMY: split and slicing
197
  self.use_llm = self.use_llm[0] if isinstance(self.use_llm, tuple) else self.use_llm
198
  self.page_range = self.page_range[0] if isinstance(self.page_range, tuple) else self.page_range #if isinstance(self.page_range, str) else None, ##SMY: passing as hint type tuple!
199
 
200
-
201
  config_dict = {
202
  "output_format" : output_format, #"markdown",
203
  "openai_model" : self.model_id, #self.client.model_id, #"model_name"
204
- "openai_api_key" : self.client.openai_api_key, #self.client.openai_api_key, #self.api_token,
205
  "openai_base_url": self.openai_base_url, #self.client.base_url, #self.base_url,
206
  "temperature" : self.temperature, #self.client.temperature,
207
  "top_p" : self.top_p, #self.client.top_p,
@@ -210,6 +201,7 @@ class DocumentConverter:
210
  "max_retries" : self.max_retries, #3, ## pass to __call__
211
  "output_dir" : self.output_dir,
212
  "use_llm" : self.use_llm, #False, #True,
 
213
  "page_range" : self.page_range, ##debug #len(pdf_file)
214
  }
215
  return config_dict
@@ -219,10 +211,6 @@ class DocumentConverter:
219
  raise RuntimeError(f"βœ— Error configuring custom config_dict: {exc}\n{tb}") #").with_traceback(tb)
220
  #raise
221
 
222
- ''' # create/load models. Called to curtail reloading models at each instance
223
- def load_models():
224
- return create_model_dict()'''
225
-
226
  ##SMY: flagged for deprecation
227
  ##SMY: marker prefer default artifact dictionary (marker.models.create_model_dict) instead of overridding
228
  #def get_extraction_converter(self, chat_fn):
 
53
  output_format: str = "markdown",
54
  output_dir: Optional[Union[str, Path]] = "output_dir",
55
  use_llm: Optional[bool] = None, #bool = False, #Optional[bool] = False, #True,
56
+ force_ocr: Optional[bool] = None, #bool = False,
57
  page_range: Optional[str] = None, #str = None #Optional[str] = None,
58
  ):
59
 
 
69
  self.max_retries = max_retries ## pass to __call__
70
  self.output_dir = output_dir ## "output_dir": settings.DEBUG_DATA_FOLDER if debug else output_dir,
71
  self.use_llm = use_llm if use_llm else False #use_llm[0] if isinstance(use_llm, tuple) else use_llm, #False, #True,
72
+ self.force_ocr = force_ocr if force_ocr else False
73
  #self.page_range = page_range[0] if isinstance(page_range, tuple) else page_range ##SMY: iterating twice because self.page casting as hint type tuple!
74
  self.page_range = page_range if page_range else None
75
  # self.page_range = page_range[0] if isinstance(page_range, tuple) else page_range if isinstance(page_range, str) else None, ##Example: "0,4-8,16" ##Marker parses as List[int] #]debug #len(pdf_file)
 
82
 
83
  # 0) Instantiate the LLM Client (OPENAIChatClient): Get a provider‐agnostic chat function
84
  ##SMY: #future. Plan to integrate into Marker: uses its own LLM services (clients). As at 1.9.2, there's no huggingface client service.
85
+ '''
86
  try:
87
  self.client = OpenAIChatClient(
88
  model_id=model_id,
 
98
  tb = traceback.format_exc() #exc.__traceback__
99
  logger.exception(f"βœ— Error initialising OpenAIChatClient: {exc}\n{tb}")
100
  raise RuntimeError(f"βœ— Error initialising OpenAIChatClient: {exc}\n{tb}") #.with_traceback(tb)
101
+ '''
102
  # 1) # Define the custom configuration for the Hugging Face LLM.
103
  # Use typing.Dict and typing.Any for flexible dictionary type hints
104
  try:
105
  self.config_dict: Dict[str, Any] = self.get_config_dict(model_id=model_id, llm_service=str(self.llm_service), output_format=output_format)
106
+
107
+ ##SMY: execute if page_range is none. `else None` ensures valid syntactic expression
108
  ##SMY: if falsely empty tuple () or None, pop the "page_range" key-value pair, else do nothing if truthy tuple value (i.e. keep as-is)
109
  self.config_dict.pop("page_range", None) if not self.config_dict.get("page_range") else None
110
  self.config_dict.pop("use_llm", None) if not self.config_dict.get("use_llm") or self.config_dict.get("use_llm") is False or self.config_dict.get("use_llm") == 'False' else None
111
+ self.config_dict.pop("force_ocr", None) if not self.config_dict.get("force_ocr") or self.config_dict.get("force_ocr") is False or self.config_dict.get("force_ocr") == 'False' else None
112
 
113
  logger.log(level=20, msg="βœ”οΈ config_dict custom configured:", extra={"service": "openai"}) #, "config": str(self.config_dict)})
114
 
 
128
  logger.exception(f"βœ— Error parsing/processing custom config_dict: {exc}\n{tb}")
129
  raise RuntimeError(f"βœ— Error parsing/processing custom config_dict: {exc}\n{tb}") #.with_traceback(tb)
130
 
131
+ # 3) Load models if not already loaded in reload mode
 
 
 
 
 
 
 
 
 
 
 
 
132
  from globals import config_load_models
133
  try:
134
+ if config_load_models.model_dict:
135
+ model_dict = config_load_models.model_dict
136
+ #elif not config_load_models.model_dict or 'model_dict' not in globals():
137
+ else:
138
  model_dict = load_models()
139
  '''if 'model_dict' not in globals():
140
  #model_dict = self.load_models()
141
  model_dict = load_models()'''
 
142
  except OSError as exc_ose:
143
  tb = traceback.format_exc() #exc.__traceback__
144
  logger.warning(f"⚠️ OSError: the paging file is too small (to complete reload): {exc_ose}\n{tb}")
 
147
  tb = traceback.format_exc() #exc.__traceback__
148
  logger.exception(f"βœ— Error loading models (reload): {exc}\n{tb}")
149
  raise RuntimeError(f"βœ— Error loading models (reload): {exc}\n{tb}") #.with_traceback(tb)
 
150
 
151
+ # 4) Instantiate Marker's MarkerConverter (PdfConverter) with config managed by config_parser
152
  try: # Assign llm_service if api_token. ##SMY: split and slicing ##Gets the string value
153
  llm_service_str = None if api_token == '' or api_token is None or self.use_llm is False else str(self.llm_service).split("'")[1] #
154
 
155
  # sets api_key required by Marker ## to handle Marker's assertion test on OpenAI
156
+ if llm_service_str:
157
+ os.environ["OPENAI_API_KEY"] = api_token if api_token and api_token != '' else os.getenv("OPENAI_API_KEY") or os.getenv("GEMINI_API_KEY") or os.getenv("HF_TOKEN") or os.getenv("HUGGINGFACEHUB_API_TOKEN")
158
+ #logger.log(level=20, msg="self.converter: instantiating MarkerConverter:", extra={"llm_service_str": llm_service_str, "api_token": api_token}) ##debug
159
 
160
  config_dict = config_parser.generate_config_dict()
161
+ #config_dict["pdftext_worker"] = self.max_workers #1 ##SMY: moved to get_config_dicts()
162
 
163
+ #self.converter: marker.converters.pdf.PdfConverter
164
  self.converter = MarkerConverter(
 
165
  #artifact_dict=create_model_dict(),
166
  artifact_dict=model_dict if model_dict else create_model_dict(),
167
  config=config_dict,
168
  #config=config_parser.generate_config_dict(),
169
  #llm_service=self.llm_service ##SMY expecting str but self.llm_service, is service object marker.services of type BaseServices
170
  llm_service=llm_service_str, ##resolve
171
+ )
172
 
173
  logger.log(level=20, msg="βœ”οΈ MarkerConverter instantiated successfully:", extra={"converter.config": str(self.converter.config.get("openai_base_url")), "use_llm":self.converter.use_llm})
174
  #return self.converter ##SMY: to query why did I comment out?. Bingo: "__init__() should return None, not 'PdfConverter'"
 
179
 
180
  # Define the custom configuration for HF LLM.
181
  def get_config_dict(self, model_id: str, llm_service=MarkerOpenAIService, output_format: Optional[str] = "markdown" ) -> Dict[str, Any]:
182
+ """ Define the custom configuration for the Hugging Face LLM: combining Markers cli_options and LLM. """
183
 
184
  try:
185
+ ## LLM Enable higher quality processing. ## See MarkerOpenAIService,
 
186
  ##llm_service = llm_service.removeprefix("<class '").removesuffix("'>") # e.g <class 'marker.services.openai.OpenAIService'>
187
  #llm_service = str(llm_service).split("'")[1] ## SMY: split and slicing
188
  self.use_llm = self.use_llm[0] if isinstance(self.use_llm, tuple) else self.use_llm
189
  self.page_range = self.page_range[0] if isinstance(self.page_range, tuple) else self.page_range #if isinstance(self.page_range, str) else None, ##SMY: passing as hint type tuple!
190
 
191
+ ##SMY: TODO: convert to {inputs} and called from gradio_ui
192
  config_dict = {
193
  "output_format" : output_format, #"markdown",
194
  "openai_model" : self.model_id, #self.client.model_id, #"model_name"
195
+ "openai_api_key" : self.openai_api_key, #self.client.openai_api_key, #self.api_token,
196
  "openai_base_url": self.openai_base_url, #self.client.base_url, #self.base_url,
197
  "temperature" : self.temperature, #self.client.temperature,
198
  "top_p" : self.top_p, #self.client.top_p,
 
201
  "max_retries" : self.max_retries, #3, ## pass to __call__
202
  "output_dir" : self.output_dir,
203
  "use_llm" : self.use_llm, #False, #True,
204
+ "force_ocr" : self.force_ocr, #False,
205
  "page_range" : self.page_range, ##debug #len(pdf_file)
206
  }
207
  return config_dict
 
211
  raise RuntimeError(f"βœ— Error configuring custom config_dict: {exc}\n{tb}") #").with_traceback(tb)
212
  #raise
213
 
 
 
 
 
214
  ##SMY: flagged for deprecation
215
  ##SMY: marker prefer default artifact dictionary (marker.models.create_model_dict) instead of overridding
216
  #def get_extraction_converter(self, chat_fn):
converters/pdf_to_md.py CHANGED
@@ -1,13 +1,13 @@
1
  # converters/pdf_to_md.py
2
  import os
3
  from pathlib import Path
4
- from typing import List, Dict, Optional, Union
5
  import traceback ## Extract, format and print information about Python stack traces.
6
  import time
7
 
8
- #from llm.hf_client import HFChatClient
9
  from converters.extraction_converter import DocumentConverter #, DocumentExtractor #as docextractor #ExtractionConverter #get_extraction_converter ## SMY: should disuse
10
- from file_handler.file_utils import collect_pdf_paths, collect_html_paths, collect_markdown_paths, create_outputdir, write_markdown, dump_images
11
 
12
 
13
  from utils import config
@@ -43,7 +43,9 @@ def init_worker(#self,
43
  output_format: str, #: str = "markdown",
44
  output_dir: str, #: Union | None = "output_dir",
45
  use_llm: bool, #: bool | None = False,
 
46
  page_range: str, #: str | None = None
 
47
  ):
48
 
49
  #'''
@@ -58,35 +60,6 @@ def init_worker(#self,
58
  # Define global variables
59
  global docconverter
60
  global converter
61
-
62
-
63
- ##SMY: kept for future implementation. Replaced with DocumentConverter.
64
- '''
65
- # 1) Instantiate the DocumentExtractor
66
- logger.log(level=20, msg="initialising docextractor:", extra={"model_id": model_id, "hf_provider": hf_provider})
67
- try:
68
- docextractor = DocumentExtractor(
69
- provider=provider,
70
- model_id=model_id,
71
- hf_provider=hf_provider,
72
- endpoint_url=endpoint_url,
73
- backend_choice=backend_choice,
74
- system_message=system_message,
75
- max_tokens=max_tokens,
76
- temperature=temperature,
77
- top_p=top_p,
78
- stream=stream,
79
- api_token=api_token,
80
- )
81
- logger.log(level=20, msg="βœ”οΈ docextractor initialised:", extra={"model_id": model_id, "hf_provider": hf_provider})
82
- except Exception as exc:
83
- #logger.error(f"Failed to initialise DocumentExtractor: {exc}")
84
- tb = traceback.format_exc()
85
- logger.exception(f"init_worker: Error initialising DocumentExtractor β†’ {exc}\n{tb}", exc_info=True)
86
- return f"βœ— init_worker: error initialising DocumentExtractor β†’ {exc}\n{tb}"
87
-
88
- self.docextractor = docextractor
89
- '''
90
 
91
  #'''
92
  # 1) Instantiate the DocumentConverter
@@ -105,6 +78,7 @@ def init_worker(#self,
105
  output_format, #: str = "markdown",
106
  output_dir, #: Union | None = "output_dir",
107
  use_llm, #: bool | None = False,
 
108
  page_range, #: str | None = None
109
  )
110
  logger.log(level=20, msg="βœ”οΈ docextractor initialised:", extra={"docconverter model_id": docconverter.converter.config.get("openai_model"), "docconverter use_llm": docconverter.converter.use_llm, "docconverter output_dir": docconverter.output_dir})
@@ -127,8 +101,9 @@ class PdfToMarkdownConverter:
127
 
128
  #def __init__(self, options: Dict | None = None):
129
  def __init__(self, options: Dict | None = None): #extractor: DocumentExtractor, options: Dict | None = None):
130
- self.options = options or {}
131
  self.output_dir_string = ''
 
132
  #self.OUTPUT_DIR = config.OUTPUT_DIR ##flag unused
133
  #self.MAX_RETRIES = config.MAX_RETRIES ##flag unused
134
  #self.docconverter = None #DocumentConverter
@@ -197,25 +172,28 @@ class PdfToMarkdownConverter:
197
  return {"file": md_file.name, "images": images_count, "filepath": md_file, "image_path": image_path} ####SMY should be Dict[str, int, str]. Dicts are not necessarily ordered.
198
 
199
  #def convert_files(src_path: str, output_dir: str, max_retries: int = 2) -> str:
200
- def convert_files(self, src_path: str, output_dir_string: str = None, max_retries: int = 2) -> Union[Dict, str]: #str:
 
201
  #def convert_files(self, src_path: str) -> str:
202
  """
203
  Worker task: use `extractor` to convert file with retry/backoff.
204
  Returns a short log line.
205
  """
206
 
207
- try:
208
  output_dir = create_outputdir(root=src_path, output_dir_string=self.output_dir_string)
209
  logger.info(f"βœ“ output_dir created: {output_dir}") #{create_outputdir(src_path)}"
210
  except Exception as exc:
211
  tb = traceback.format_exc()
212
  logger.exception("βœ— error creating output_dir β†’ {exc}\n{tb}", exc_info=True)
213
- return f"βœ— error creating output_dir β†’ {exc}\n{tb}"
 
214
 
215
  try:
216
  #if Path(src_path).suffix.lower() not in {".pdf", ".html", ".htm"}:
217
  #if not Path(src_path).name.endswith(tuple({".pdf", ".html"})): #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
218
- if not Path(src_path).name.endswith((".pdf", ".html")): #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
 
219
  logger.log(level=20, msg=f"skipped {Path(src_path).name}", exc_info=True)
220
  return f"skipped {Path(src_path).name}"
221
  except Exception as exc:
@@ -226,7 +204,8 @@ class PdfToMarkdownConverter:
226
  #max_retries = self.MAX_RETRIES
227
  for attempt in range(1, max_retries + 1):
228
  try:
229
- info = self.extract(str(src_path), str(output_dir.stem)) #extractor.converter(str(src_path), str(output_dir)) #
 
230
  logger.log(level=20, msg=f"βœ“ : info about extracted {Path(src_path).name}: ", extra={"info": str(info)})
231
  ''' ##SMY: moving formating to calling Gradio
232
  img_count = info.get("images", 0)
@@ -239,7 +218,7 @@ class PdfToMarkdownConverter:
239
  except Exception as exc:
240
  if attempt == max_retries:
241
  tb = traceback.format_exc()
242
- return f"βœ— {info.get('file')} β†’ {exc}\n{tb}"
243
  #return f"βœ— {md_filename} β†’ {exc}\n{tb}"
244
 
245
  #time.sleep(2 ** attempt)
 
1
  # converters/pdf_to_md.py
2
  import os
3
  from pathlib import Path
4
+ from typing import List, Dict, Union, Optional
5
  import traceback ## Extract, format and print information about Python stack traces.
6
  import time
7
 
8
+ from ui.gradio_ui import gr
9
  from converters.extraction_converter import DocumentConverter #, DocumentExtractor #as docextractor #ExtractionConverter #get_extraction_converter ## SMY: should disuse
10
+ from file_handler.file_utils import write_markdown, dump_images, collect_pdf_paths, collect_html_paths, collect_markdown_paths, create_outputdir
11
 
12
 
13
  from utils import config
 
43
  output_format: str, #: str = "markdown",
44
  output_dir: str, #: Union | None = "output_dir",
45
  use_llm: bool, #: bool | None = False,
46
+ force_ocr: bool,
47
  page_range: str, #: str | None = None
48
+ progress: gr.Progress = gr.Progress(),
49
  ):
50
 
51
  #'''
 
60
  # Define global variables
61
  global docconverter
62
  global converter
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  #'''
65
  # 1) Instantiate the DocumentConverter
 
78
  output_format, #: str = "markdown",
79
  output_dir, #: Union | None = "output_dir",
80
  use_llm, #: bool | None = False,
81
+ force_ocr,
82
  page_range, #: str | None = None
83
  )
84
  logger.log(level=20, msg="βœ”οΈ docextractor initialised:", extra={"docconverter model_id": docconverter.converter.config.get("openai_model"), "docconverter use_llm": docconverter.converter.use_llm, "docconverter output_dir": docconverter.output_dir})
 
101
 
102
  #def __init__(self, options: Dict | None = None):
103
  def __init__(self, options: Dict | None = None): #extractor: DocumentExtractor, options: Dict | None = None):
104
+ self.options = options or {} ##SMY: TOBE implemented - bring all Marker's options
105
  self.output_dir_string = ''
106
+ self.output_dir = self.output_dir_string ## placeholder
107
  #self.OUTPUT_DIR = config.OUTPUT_DIR ##flag unused
108
  #self.MAX_RETRIES = config.MAX_RETRIES ##flag unused
109
  #self.docconverter = None #DocumentConverter
 
172
  return {"file": md_file.name, "images": images_count, "filepath": md_file, "image_path": image_path} ####SMY should be Dict[str, int, str]. Dicts are not necessarily ordered.
173
 
174
  #def convert_files(src_path: str, output_dir: str, max_retries: int = 2) -> str:
175
+ #def convert_files(self, src_path: str, output_dir_string: str = None, max_retries: int = 2, progress = gr.Progress()) -> Union[Dict, str]: #str:
176
+ def convert_files(self, src_path: str, max_retries: int = 2, progress = gr.Progress()) -> Union[Dict, str]:
177
  #def convert_files(self, src_path: str) -> str:
178
  """
179
  Worker task: use `extractor` to convert file with retry/backoff.
180
  Returns a short log line.
181
  """
182
 
183
+ '''try: ##moved to gradio_ui. sets to PdfToMarkdownConverter.output_dir_string
184
  output_dir = create_outputdir(root=src_path, output_dir_string=self.output_dir_string)
185
  logger.info(f"βœ“ output_dir created: {output_dir}") #{create_outputdir(src_path)}"
186
  except Exception as exc:
187
  tb = traceback.format_exc()
188
  logger.exception("βœ— error creating output_dir β†’ {exc}\n{tb}", exc_info=True)
189
+ return f"βœ— error creating output_dir β†’ {exc}\n{tb}"'''
190
+ output_dir = Path(self.output_dir) ## takes the value from gradio_ui
191
 
192
  try:
193
  #if Path(src_path).suffix.lower() not in {".pdf", ".html", ".htm"}:
194
  #if not Path(src_path).name.endswith(tuple({".pdf", ".html"})): #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
195
+ #if not Path(src_path).name.endswith((".pdf", ".html", ".docx", ".doc")): #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
196
+ if not Path(src_path).name.endswith(config.file_types_tuple): #,".docx", ".doc", ".pptx", ".ppt", ".xlsx", ".xls"})):
197
  logger.log(level=20, msg=f"skipped {Path(src_path).name}", exc_info=True)
198
  return f"skipped {Path(src_path).name}"
199
  except Exception as exc:
 
204
  #max_retries = self.MAX_RETRIES
205
  for attempt in range(1, max_retries + 1):
206
  try:
207
+ #info = self.extract(str(src_path), str(output_dir.stem)) #extractor.converter(str(src_path), str(output_dir)) #
208
+ info = self.extract(str(src_path), str(output_dir)) #extractor.converter(str(src_path), str(output_dir)) #
209
  logger.log(level=20, msg=f"βœ“ : info about extracted {Path(src_path).name}: ", extra={"info": str(info)})
210
  ''' ##SMY: moving formating to calling Gradio
211
  img_count = info.get("images", 0)
 
218
  except Exception as exc:
219
  if attempt == max_retries:
220
  tb = traceback.format_exc()
221
+ return f"βœ— {info.get('file', 'UnboundlocalError: info is None')} β†’ {exc}\n{tb}"
222
  #return f"βœ— {md_filename} β†’ {exc}\n{tb}"
223
 
224
  #time.sleep(2 ** attempt)
file_handler/file_utils.py CHANGED
@@ -6,15 +6,15 @@ import shutil
6
  import tempfile
7
 
8
  from itertools import chain
9
- from typing import List, Union, Any, Mapping
10
  from PIL import Image
11
 
12
  #import utils.config as config ##SMY: currently unused
13
 
14
- ##SMY: Might be deprecated vis duplicated. See marker/marker/config/parser.py ~ https://github.com/datalab-to/marker/blob/master/marker/config/parser.py#L169
15
  #def create_outputdir(root: Union[str, Path], out_dir:Union[str, Path] = None) -> Path: #List[Path]:
16
  def create_outputdir(root: Union[str, Path], output_dir_string:str = None) -> Path: #List[Path]:
17
- """ Create output dir under the input folder """
18
 
19
  ''' ##preserved for future implementation if needed again
20
  root = root if isinstance(root, Path) else Path(root)
@@ -24,10 +24,12 @@ def create_outputdir(root: Union[str, Path], output_dir_string:str = None) -> Pa
24
  out_dir = out_dir if out_dir else "output_md" ## SMY: default to outputdir in config file = "output_md"
25
  output_dir = root.parent / out_dir #"md_output" ##SMY: concatenating output str with src Path
26
  '''
 
27
 
28
  ## map to img_path. Opt to putting output within same output_md folder rather than individual source folders
29
  output_dir_string = output_dir_string if output_dir_string else "output_dir" ##redundant SMY: default to outputdir in config file = "output_md"
30
- output_dir = Path("data") / output_dir_string #"output_md" ##SMY: concatenating output str with src Path
 
31
  output_dir.mkdir(mode=0o2755, parents=True, exist_ok=True) #,mode=0o2755
32
  return output_dir
33
 
@@ -225,6 +227,17 @@ def check_create_file(filename: Union[str, Path]) -> Path:
225
 
226
  return filename_path
227
 
 
 
 
 
 
 
 
 
 
 
 
228
  def zip_processed_files(root_dir: str, file_paths: list[str], tz_hours=None, date_format='%d%b%Y_%H-%M-%S') -> Path:
229
  """
230
  Creates a zip file from a list of file paths (strings) and returns the Path object.
@@ -247,11 +260,14 @@ def zip_processed_files(root_dir: str, file_paths: list[str], tz_hours=None, dat
247
  raise ValueError(f"Root directory does not exist: {root_path}")
248
 
249
  # Create a temporary directory in a location where Gradio can access it.
250
- gradio_output_dir = Path(tempfile.gettempdir()) / "gradio_temp_output"
 
251
  #gradio_output_dir.mkdir(exist_ok=True)
252
  file_utils.check_create_dir(gradio_output_dir)
253
  final_zip_path = gradio_output_dir / f"outputs_processed_{utils.get_time_now_str(tz_hours=tz_hours, date_format=date_format)}.zip"
254
-
 
 
255
  # Use a context manager to create the zip file: use zipfile() opposed to shutil.make_archive
256
  # 'w' mode creates a new file, overwriting if it already exists.
257
  zip_unprocessed = 0
@@ -442,7 +458,7 @@ def write_markdown(
442
  Notes
443
  -----
444
  The function is intentionally lightweight: it only handles path resolution,
445
- directory creation, and file I/O. All rendering logic should be performed before
446
  calling this helper.
447
  """
448
  src = Path(src_path)
@@ -460,9 +476,11 @@ def write_markdown(
460
 
461
  ## Opt to putting output within same output_md folder rather than individual source folders
462
  #md_path = Path("data\\pdf") / "output_md" / f"{src.stem}" / md_name ##debug
463
- md_path = Path("data") / output_dir / f"{src.stem}" / md_name ##debug
 
464
  ##SMY: [resolved] Permission Errno13 - https://stackoverflow.com/a/57454275
465
- md_path.parent.mkdir(mode=0o2755, parents=True, exist_ok=True) ##SMY: create nested md_path if not exists
 
466
  md_path.parent.chmod(0)
467
 
468
  try:
@@ -531,7 +549,8 @@ def dump_images(
531
  #img_path = Path(src.parent) / f"{Path(output_dir).stem}" / f"{src.stem}" / img_name
532
 
533
  #img_path = Path("data\\pdf") / "output_md" / f"{src.stem}" / img_name ##debug
534
- img_path = Path("data") / output_dir / f"{src.stem}" / img_name ##debug
 
535
  #img_path.mkdir(mode=0o777, parents=True, exist_ok=True) ##SMY: create nested img_path if not exists
536
  #img_path.parent.mkdir(parents=True, exist_ok=True)
537
 
 
6
  import tempfile
7
 
8
  from itertools import chain
9
+ from typing import List, Optional, Union, Any, Mapping
10
  from PIL import Image
11
 
12
  #import utils.config as config ##SMY: currently unused
13
 
14
+ ##SMY: flagged: deprecated vis duplicated. See create_temp_folder() and marker/marker/config/parser.py ~ https://github.com/datalab-to/marker/blob/master/marker/config/parser.py#L169
15
  #def create_outputdir(root: Union[str, Path], out_dir:Union[str, Path] = None) -> Path: #List[Path]:
16
  def create_outputdir(root: Union[str, Path], output_dir_string:str = None) -> Path: #List[Path]:
17
+ """ Create output dir default to Temp """
18
 
19
  ''' ##preserved for future implementation if needed again
20
  root = root if isinstance(root, Path) else Path(root)
 
24
  out_dir = out_dir if out_dir else "output_md" ## SMY: default to outputdir in config file = "output_md"
25
  output_dir = root.parent / out_dir #"md_output" ##SMY: concatenating output str with src Path
26
  '''
27
+ root = create_temp_folder()
28
 
29
  ## map to img_path. Opt to putting output within same output_md folder rather than individual source folders
30
  output_dir_string = output_dir_string if output_dir_string else "output_dir" ##redundant SMY: default to outputdir in config file = "output_md"
31
+ #output_dir = Path("data") / output_dir_string #"output_md" ##SMY: concatenating output str with src Path
32
+ output_dir = Path(root) / output_dir_string #"output_md" ##SMY: concatenating output str with src Path
33
  output_dir.mkdir(mode=0o2755, parents=True, exist_ok=True) #,mode=0o2755
34
  return output_dir
35
 
 
227
 
228
  return filename_path
229
 
230
+ def create_temp_folder(tempfolder: Optional[str | Path] = ''):
231
+ """ Create a temp folder Gradio and output_dir if supplied"""
232
+ # Create a temporary directory in a location where Gradio can access it.
233
+ #gradio_output_dir = Path(tempfile.gettempdir()) / "gradio_temp_output"/ tempfolder if tempfolder else Path(tempfile.gettempdir()) / "gradio_temp_output"
234
+ #gradio_output_dir.mkdir(exist_ok=True)
235
+ #gradio_output_dir = check_create_dir(gradio_output_dir)
236
+
237
+ gradio_output_dir = check_create_dir(Path(tempfile.gettempdir()) / "gradio_temp_output"/ tempfolder if tempfolder else Path(tempfile.gettempdir()) / "gradio_temp_output")
238
+
239
+ return gradio_output_dir
240
+
241
  def zip_processed_files(root_dir: str, file_paths: list[str], tz_hours=None, date_format='%d%b%Y_%H-%M-%S') -> Path:
242
  """
243
  Creates a zip file from a list of file paths (strings) and returns the Path object.
 
260
  raise ValueError(f"Root directory does not exist: {root_path}")
261
 
262
  # Create a temporary directory in a location where Gradio can access it.
263
+ ##SMY: synced with create_temp_folder()
264
+ '''gradio_output_dir = Path(tempfile.gettempdir()) / "gradio_temp_output"
265
  #gradio_output_dir.mkdir(exist_ok=True)
266
  file_utils.check_create_dir(gradio_output_dir)
267
  final_zip_path = gradio_output_dir / f"outputs_processed_{utils.get_time_now_str(tz_hours=tz_hours, date_format=date_format)}.zip"
268
+ '''
269
+ final_zip_path = Path(root_dir).parent / f"outputs_processed_{utils.get_time_now_str(tz_hours=tz_hours, date_format=date_format)}.zip"
270
+
271
  # Use a context manager to create the zip file: use zipfile() opposed to shutil.make_archive
272
  # 'w' mode creates a new file, overwriting if it already exists.
273
  zip_unprocessed = 0
 
458
  Notes
459
  -----
460
  The function is intentionally lightweight: it only handles path resolution,
461
+ directory creation, and file I/O. All rendering logic are performed before
462
  calling this helper.
463
  """
464
  src = Path(src_path)
 
476
 
477
  ## Opt to putting output within same output_md folder rather than individual source folders
478
  #md_path = Path("data\\pdf") / "output_md" / f"{src.stem}" / md_name ##debug
479
+ #md_path = Path("data") / output_dir / f"{src.stem}" / md_name ##debug
480
+ md_path = Path(output_dir) / f"{src.stem}" / md_name ##debug
481
  ##SMY: [resolved] Permission Errno13 - https://stackoverflow.com/a/57454275
482
+ #md_path.parent.mkdir(mode=0o2755, parents=True, exist_ok=True) ##SMY: create nested md_path if not exists
483
+ md_path.parent.mkdir(parents=True, exist_ok=True) ##SMY: md_path now resides in Temp
484
  md_path.parent.chmod(0)
485
 
486
  try:
 
549
  #img_path = Path(src.parent) / f"{Path(output_dir).stem}" / f"{src.stem}" / img_name
550
 
551
  #img_path = Path("data\\pdf") / "output_md" / f"{src.stem}" / img_name ##debug
552
+ #img_path = Path("data") / output_dir / f"{src.stem}" / img_name ##debug
553
+ img_path = Path(output_dir) / f"{src.stem}" / img_name
554
  #img_path.mkdir(mode=0o777, parents=True, exist_ok=True) ##SMY: create nested img_path if not exists
555
  #img_path.parent.mkdir(parents=True, exist_ok=True)
556
 
llm/llm_login.py CHANGED
@@ -5,6 +5,7 @@ from time import sleep
5
  from typing import Optional
6
 
7
  from utils.logger import get_logger
 
8
 
9
  ## Get logger instance
10
  logger = get_logger(__name__)
@@ -14,6 +15,19 @@ def disable_immplicit_token():
14
  # Explicitly disable implicit token propagation; we rely on explicit auth or env var
15
  os.environ["HF_HUB_DISABLE_IMPLICIT_TOKEN"] = "1"
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  def login_huggingface(token: Optional[str] = None):
18
  """
19
  Login to Hugging Face account. Prioritize CLI login for privacy and determinism.
 
5
  from typing import Optional
6
 
7
  from utils.logger import get_logger
8
+ from ui.gradio_ui import gr
9
 
10
  ## Get logger instance
11
  logger = get_logger(__name__)
 
15
  # Explicitly disable implicit token propagation; we rely on explicit auth or env var
16
  os.environ["HF_HUB_DISABLE_IMPLICIT_TOKEN"] = "1"
17
 
18
+ #def get_login_token( api_token_arg, oauth_token: gr.OAuthToken | None=None,):
19
+ def get_login_token( api_token_arg, oauth_token):
20
+ """ Use user's supplied token or Get token from logged-in users, else from token stored on the machine. Return token"""
21
+ #oauth_token = get_token() if oauth_token is not None else api_token_arg
22
+ if api_token_arg != '': # or not None: #| None:
23
+ oauth_token = api_token_arg
24
+ elif oauth_token:
25
+ oauth_token = oauth_token.token
26
+ else: oauth_token = '' if not get_token() else get_token()
27
+
28
+ #return str(oauth_token) if oauth_token else '' ##token value or empty string
29
+ return oauth_token if oauth_token else '' ##token value or empty string
30
+
31
  def login_huggingface(token: Optional[str] = None):
32
  """
33
  Login to Hugging Face account. Prioritize CLI login for privacy and determinism.
requirements.txt CHANGED
@@ -1,5 +1,8 @@
1
- gradio>=5.44.0
2
- marker-pdf[full]>=1.10.0 # pip install marker (GitHub: https://github.com/datalab-to/marker)
3
- weasyprint>=59.0 # optional fallback if pandoc is not available
4
- #pandoc==2.3 # for Markdown β†’ PDF conversion
5
- python-magic==0.4.27 # file‑type detection
 
 
 
 
1
+ gradio>=5.44.0 # gradio[mcp]>=5.44.0
2
+ #mcp>=1.15.0 # MCP Python SDK (Model Coontext Protocol)
3
+ marker-pdf[full]>=1.10.0 # pip install marker (GitHub: https://github.com/datalab-to/marker)
4
+ weasyprint>=59.0 # optional fallback if pandoc is not available
5
+ #pandoc==2.3 # for Markdown β†’ PDF conversion
6
+ python-magic==0.4.27 # file‑type detection
7
+ #pdfdfium2 # Python binding to PDFium for PDF rendering, inspection, manipution and creation
8
+ #huggingface_hub>=0.34.0 # HuggingFace integration
ui/gradio_ui.py CHANGED
@@ -1,4 +1,5 @@
1
  # ui/gradio_ui.py
 
2
  import gradio as gr
3
  from concurrent.futures import ProcessPoolExecutor, as_completed
4
  import asyncio
@@ -7,23 +8,21 @@ from pathlib import Path, WindowsPath
7
  from typing import Optional, Union #, Dict, List, Any, Tuple
8
 
9
  from huggingface_hub import get_token
10
- from numpy import append, iterable
11
 
12
  #import file_handler
 
13
  import file_handler.file_utils
14
- from utils.config import TITLE, DESCRIPTION, DESCRIPTION_PDF_HTML, DESCRIPTION_PDF, DESCRIPTION_HTML, DESCRIPTION_MD
15
  from utils.utils import is_dict, is_list_of_dicts
16
  from file_handler.file_utils import zip_processed_files, process_dicts_data, collect_pdf_paths, collect_html_paths, collect_markdown_paths, create_outputdir ## should move to handling file
17
  from file_handler.file_utils import find_file
18
  from utils.get_config import get_config_value
19
 
20
- #from llm.hf_client import HFChatClient ## SMY: unused. See converters.extraction_converter
21
  from llm.provider_validator import is_valid_provider, suggest_providers
22
- from llm.llm_login import is_loggedin_huggingface, login_huggingface
23
  from converters.extraction_converter import DocumentConverter as docconverter #DocumentExtractor #as docextractor
24
  from converters.pdf_to_md import PdfToMarkdownConverter, init_worker
25
- #from converters.md_to_pdf import MarkdownToPdfConverter
26
- #from converters.html_to_md import HtmlToMarkdownConverter ##SMY: PENDING: implementation
27
 
28
  import traceback ## Extract, format and print information about Python stack traces.
29
  from utils.logger import get_logger
@@ -32,7 +31,6 @@ logger = get_logger(__name__) ##NB: setup_logging() ## set logging
32
 
33
  # Instantiate converters class once – they are stateless
34
  pdf2md_converter = PdfToMarkdownConverter()
35
- #html2md_converter = HtmlToMarkdownConverter()
36
  #md2pdf_converter = MarkdownToPdfConverter()
37
 
38
 
@@ -42,25 +40,18 @@ from converters.extraction_converter import load_models
42
  from globals import config_load_models
43
  try:
44
  if not config_load_models.model_dict:
45
- config_load_models.model_dict = load_models()
 
46
  '''if 'model_dict' not in globals():
47
  global model_dict
48
  model_dict = load_models()'''
 
49
  except Exception as exc:
50
  #tb = traceback.format_exc() #exc.__traceback__
51
  logger.exception(f"βœ— Error loading models (reload): {exc}") #\n{tb}")
52
  raise RuntimeError(f"βœ— Error loading models (reload): {exc}") #\n{tb}")
53
 
54
- def get_login_token( api_token_arg, oauth_token: gr.OAuthToken | None=None,):
55
- """ Use user's supplied token or Get token from logged-in users, else from token stored on the machine. Return token"""
56
- #oauth_token = get_token() if oauth_token is not None else api_token_arg
57
- if api_token_arg != '': # or not None: #| None:
58
- oauth_token = api_token_arg
59
- elif oauth_token:
60
- oauth_token = oauth_token
61
- else: get_token()
62
-
63
- return oauth_token.token if oauth_token else '' ##token value or empty string
64
 
65
  # pool executor to convert files called by Gradio
66
  ##SMY: TODO: future: refactor to gradio_process.py and
@@ -90,6 +81,7 @@ def convert_batch(
90
  #output_dir: Optional[Union[str, Path]] = "output_dir",
91
  output_dir_string: str = "output_dir_default",
92
  use_llm: bool = False, #Optional[bool] = False, #True,
 
93
  page_range: str = None, #Optional[str] = None,
94
  tz_hours: str = None,
95
  oauth_token: gr.OAuthToken | None=None,
@@ -103,15 +95,16 @@ def convert_batch(
103
  """
104
 
105
  # login: Update the Gradio UI to improve user-friendly eXperience - commencing
106
- #yield gr.update(interactive=False), f"Commencing Processing ... Getting login", {"process": "Commencing Processing"}, f"dummy_log.log"
107
- #progress((0,16), f"Commencing Processing ...")
 
108
 
109
  # get token from logged-in user:
110
  api_token = get_login_token(api_token_arg=api_token_gr, oauth_token=oauth_token)
111
  ##SMY: Strictly debug. Must not be live
112
- #logger.log(level=30, msg="Commencing: get_login_token", extra={"api_token]": api_token, "api_token_gr": api_token_gr})
113
 
114
- try:
115
  ##SMY: might deprecate. To replace with oauth login from Gradio ui or integrate cleanly.
116
  #login_huggingface(api_token) ## attempt login if not already logged in. NB: HF CLI login prompt would not display in Process Worker.
117
 
@@ -131,9 +124,8 @@ def convert_batch(
131
  tb = traceback.format_exc()
132
  logger.exception(f"βœ— Error during login_huggingface β†’ {exc}\n{tb}", exc_info=True) # Log the full traceback
133
  return [gr.update(interactive=True), f"βœ— An error occurred during login_huggingface β†’ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] # return the exception message
134
-
135
- #progress((1,16), desc=f"Log in: {is_loggedin_huggingface}")
136
-
137
  ## debug
138
  #logger.log(level=30, msg="pdf_files_inputs", extra={"input_arg[0]:": pdf_files[0]})
139
 
@@ -143,22 +135,23 @@ def convert_batch(
143
  #outputs=[log_output, files_individual_JSON, files_individual_downloads],
144
  return [gr.update(interactive=True), "Initialising ProcessPool: No files uploaded.", {"Upload":"No files uploaded"}, f"dummy_log.log"]
145
 
146
- #progress((2,16), desc=f"Getting configuration values")
147
  # Get config values if not provided
148
- config_file = find_file("config.ini") ##from file_handler.file_utils
149
- model_id = get_config_value(config_file, "MARKER_CAP", "MODEL_ID") if not model_id else model_id
150
- openai_base_url = get_config_value(config_file, "MARKER_CAP", "OPENAI_BASE_URL") if not openai_base_url else openai_base_url
151
- openai_image_format = get_config_value(config_file, "MARKER_CAP", "OPENAI_IMAGE_FORMAT") if not openai_image_format else openai_image_format
152
- max_workers = get_config_value(config_file, "MARKER_CAP", "MAX_WORKERS") if not max_workers else max_workers
153
- max_retries = get_config_value(config_file, "MARKER_CAP", "MAX_RETRIES") if not max_retries else max_retries
154
- output_format = get_config_value(config_file, "MARKER_CAP", "OUTPUT_FORMAT") if not output_format else output_format
155
- output_dir_string = str(get_config_value(config_file, "MARKER_CAP", "OUTPUT_DIR") if not output_dir_string else output_dir_string)
156
- use_llm = get_config_value(config_file, "MARKER_CAP", "USE_LLM") if not use_llm else use_llm
157
- page_range = get_config_value(config_file,"MARKER_CAP", "PAGE_RANGE") if not page_range else page_range
158
- #progress((3,16), desc="Retrieved configuration values")
 
159
 
160
  # Create the initargs tuple from the Gradio inputs: # 'files' is an iterable, and handled separately.
161
- #progress((4,16), desc=f"Initialiasing init_args")
162
  yield gr.update(interactive=False), f"Initialising init_args", {"process": "Processing files ..."}, f"dummy_log.log"
163
  init_args = (
164
  provider,
@@ -180,83 +173,91 @@ def convert_batch(
180
  output_format,
181
  output_dir_string,
182
  use_llm,
 
183
  page_range,
184
  )
185
 
186
- #global docextractor ##SMY: deprecated.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
  try:
188
  results = [] ## initialised pool result holder
189
- # Create a pool with init_worker initialiser
190
  logger.log(level=30, msg="Initialising ProcessPoolExecutor: pool:", extra={"pdf_files": pdf_files, "files_len": len(pdf_files), "model_id": model_id, "output_dir": output_dir_string}) #pdf_files_count
191
- #progress((5,16), desc=f"Initialising ProcessPoolExecutor: Processing Files ...")
192
  yield gr.update(interactive=False), f"Initialising ProcessPoolExecutor: Processing Files ...", {"process": "Processing files ..."}, f"dummy_log.log"
 
193
 
 
194
  with ProcessPoolExecutor(
195
  max_workers=max_workers,
196
  initializer=init_worker,
197
  initargs=init_args
198
  ) as pool:
199
- #logger.log(level=30, msg="Initialising ProcessPoolExecutor: pool:", extra={"pdf_files": pdf_files, "files_len": len(pdf_files), "model_id": model_id, "output_dir": output_dir_string}) #pdf_files_count
200
- #progress((6,16), desc=f"Starting ProcessPool queue: Processing Files ...")
201
-
202
- # Update the Gradio UI to improve user-friendly eXperience
203
- #outputs=[process_button, log_output, files_individual_JSON, files_individual_downloads],
204
-
205
-
206
  # Map the files (pdf_files) to the conversion function (pdf2md_converter.convert_file)
207
  # The 'docconverter' argument is implicitly handled by the initialiser
208
  #futures = [pool.map(pdf2md_converter.convert_files, f) for f in pdf_files]
209
  #logs = [f.result() for f in as_completed(futures)]
210
  #futures = [pool.submit(pdf2md_converter.convert_files, file) for file in pdf_files]
211
  #logs = [f.result() for f in futures]
212
-
213
  try:
214
- #(7,16), desc=f"ProcessPoolExecutor: Creating output_dir")
215
- yield gr.update(interactive=False), f"Creating output_dir ...", {"process": "Processing files ..."}, f"dummy_log.log"
216
- pdf2md_converter.output_dir_string = output_dir_string ##SMY: attempt setting directly to resolve pool.map iterable
217
- #progress((8,16), desc=f"ProcessPoolExecutor: Created output_dir.")
218
- yield gr.update(interactive=False), f"Created output_dir ...", {"process": "Processing files ..."}, f"dummy_log.log"
219
-
220
- except Exception as exc:
221
- # Raise the exception to stop the Gradio app: exception to halt execution
222
- logger.exception("Error during creating output_dir", exc_info=True) # Log the full traceback
223
- traceback.print_exc() # Print the exception traceback
224
- #return f"An error occurred during pool.map: {str(exc)}", f"Error: {exc}", f"Error: {exc}" ## return the exception message
225
- # Update the Gradio UI to improve user-friendly eXperience
226
- yield gr.update(interactive=True), f"An error occurred creating output_dir: {str(exc)}", {"Error":f"Error: {exc}"}, f"dummy_log.log" ## return the exception message
227
-
228
- try:
229
- #progress((9,16), desc=f"ProcessPoolExecutor: Pooling file conversion ...")
230
  yield gr.update(interactive=True), f"ProcessPoolExecutor: Pooling file conversion ...", {"process": "Processing files ..."}, f"dummy_log.log"
 
 
231
  # Use progress.tqdm to integrate with the executor map
232
  #results = pool.map(pdf2md_converter.convert_files, pdf_files) ##SMY iterables #max_retries #output_dir_string)
233
  for result_interim in progress.tqdm(
234
- iterable=pool.map(pdf2md_converter.convert_files, pdf_files), total=len(pdf_files)
235
  ):
236
  results.append(result_interim)
237
- #progress((10,16), desc=f"ProcessPoolExecutor: Pooling file conversion result: [{str(result_interim)}[:20]]")
238
  # Update the Gradio UI to improve user-friendly eXperience
239
  yield gr.update(interactive=True), f"ProcessPoolExecutor: Pooling file conversion result: [{str(result_interim)}[:20]]", {"process": "Processing files ..."}, f"dummy_log.log"
 
240
 
241
- #progress((11,16), desc=f"ProcessPoolExecutor: Got Results from files conversion")
242
- yield gr.update(interactive=True), f"rocessPoolExecutor: Got Results from files conversion: [{str(result_interim)}[:20]]", {"process": "Processing files ..."}, f"dummy_log.log"
243
  except Exception as exc:
244
  # Raise the exception to stop the Gradio app: exception to halt execution
245
  logger.exception("Error during pooling file conversion", exc_info=True) # Log the full traceback
246
- traceback.print_exc() # Print the exception traceback
247
- return [gr.update(interactive=True), f"An error occurred during pool.map: {str(exc)}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] ## return the exception message
248
  # Update the Gradio UI to improve user-friendly eXperience
249
- #yield gr.update(interactive=True), f"An error occurred during pool.map: {str(exc)}", {"Error":f"Error: {exc}"}, f"dummy_log.log" ## return the exception message
 
250
 
251
- #'''
252
  try:
 
253
  logger.log(level=20, msg="ProcessPoolExecutor pool result:", extra={"results": str(results)})
 
254
  logs = []
255
  logs_files_images = []
 
256
  #logs.extend(results) ## performant pythonic
257
  #logs = list[results] ##
258
  logs = [result for result in results] ## pythonic list comprehension
259
- ## logs : [file , images , filepath, image_path]
260
 
261
  #logs_files_images = logs_files.extend(logs_images) #zip(logs_files, logs_images) ##SMY: in progress
262
  logs_count = 0
@@ -268,64 +269,48 @@ def convert_batch(
268
  # Update the Gradio UI to improve user-friendly eXperience
269
  #yield gr.update(interactive=False), f"Processing files: {logs_files_images[logs_count]}", {"process": "Processing files"}, f"dummy_log.log"
270
  logs_count = i+i_image
271
-
272
- #progress((12,16), desc="Processing results from files conversion") ##rekickin
273
- #logs_files_images.append(logs_filepath) ## to del
274
- #logs_files_images.extend(logs_images) ## to del
275
  except Exception as exc:
276
- logger.exception("Error during processing results logs β†’ {exc}\n{tb}", exc_info=True) # Log the full traceback
277
- traceback.print_exc() # Print the exception traceback
278
  return [gr.update(interactive=True), f"An error occurred during processing results logs: {str(exc)}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] ## return the exception message
279
  #yield gr.update(interactive=True), f"An error occurred during processing results logs: {str(exc)}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" ## return the exception message
280
-
281
- #'''
282
  except Exception as exc:
283
  tb = traceback.format_exc()
284
  logger.exception(f"βœ— Error during ProcessPoolExecutor β†’ {exc}\n{tb}" , exc_info=True) # Log the full traceback
285
  #traceback.print_exc() # Print the exception traceback
286
- yield gr.update(interactive=True), f"βœ— An error occurred during ProcessPoolExecutorβ†’ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" # return the exception message
287
 
288
- '''
289
- logger.log(level=20, msg="ProcessPoolExecutor pool result:", extra={"results": str(results)})
290
- logs = []
291
- #logs.extend(results) ## performant pythonic
292
- #logs = list[results] ##
293
- logs = [result for result in results] ## pythonic list comprehension
294
- '''
295
-
296
- # Zip Processed md Files and images. Insert to first index
297
  try: ##from file_handler.file_utils
298
- #progress((13,16), desc="Zipping processed files and images")
299
- zipped_processed_files = zip_processed_files(root_dir=f"data/{output_dir_string}", file_paths=logs_files_images, tz_hours=tz_hours, date_format='%d%b%Y_%H-%M-%S') #date_format='%d%b%Y'
300
  logs_files_images.insert(0, zipped_processed_files)
301
- #logs_files_images.insert(1, "====================")
302
 
303
- #progress((14,16), desc="Zipped processed files and images")
304
  #yield gr.update(interactive=False), f"Processing zip and files: {logs_files_images}", {"process": "Processing files"}, f"dummy_log.log"
305
 
306
  except Exception as exc:
307
  tb = traceback.format_exc()
308
  logger.exception(f"βœ— Error during zipping processed files β†’ {exc}\n{tb}" , exc_info=True) # Log the full traceback
309
  #traceback.print_exc() # Print the exception traceback
310
- #return gr.update(interactive=True), f"βœ— An error occurred during zipping files β†’ {exc}\n{tb}", f"Error: {exc}", f"Error: {exc}" # return the exception message
311
  yield gr.update(interactive=True), f"βœ— An error occurred during zipping files β†’ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" # return the exception message
 
312
 
313
 
314
  # Return processed files log
315
  try:
316
- #progress((15,16), desc="Formatting processed log results")
 
317
  ## # Convert logs list of dicts to formatted json string
318
  logs_return_formatted_json_string = file_handler.file_utils.process_dicts_data(logs) #"\n".join(log for log in logs) ##SMY outputs to gr.JSON component with no need for json.dumps(data, indent=)
319
- #logs_files_images_return = "\n".join(path for path in logs_files_images) ##TypeError: sequence item 0: expected str instance, WindowsPath found
320
-
321
- ##convert the List of Path objects to List of string for gr.Files output
322
- #logs_files_images_return = list(str(path) for path in logs_files_images)
323
 
324
  ## # Convert any Path objects to strings, but leave strings as-is
325
  logs_files_images_return = list(str(path) if isinstance(path, Path) else path for path in logs_files_images)
326
  logger.log(level=20, msg="File conversion complete. Sending outcome to Gradio:", extra={"logs_files_image_return": str(logs_files_images_return)}) ## debug: FileNotFoundError: [WinError 2] The system cannot find the file specified: 'Error or no image_path'
327
 
328
- #progress((16,16), desc="Complete processing and formatting file processing results")
 
329
  #outputs=[process_button, log_output, files_individual_JSON, files_individual_downloads],
330
  #return "\n".join(logs), "\n".join(logs_files_images) #"\n".join(logs_files)
331
 
@@ -338,8 +323,8 @@ def convert_batch(
338
  tb = traceback.format_exc()
339
  logger.exception(f"βœ— Error during returning result logs β†’ {exc}\n{tb}" , exc_info=True) # Log the full traceback
340
  #traceback.print_exc() # Print the exception traceback
341
- #return [gr.update(interactive=True), f"βœ— An error occurred during returning result logsβ†’ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] # return the exception message
342
- yield [gr.update(interactive=True), f"βœ— An error occurred during returning result logsβ†’ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] # return the exception message
343
 
344
  #return "\n".join(log for log in logs), "\n".join(str(path) for path in logs_files_images)
345
  #print(f'logs_files_images: {"\n".join(str(path) for path in logs_files_images)}')
@@ -517,7 +502,7 @@ def build_interface() -> gr.Blocks:
517
  #message = f"Accumulated {len(updated_files)} file(s) total.\n\nAll file paths:\n{file_info}"
518
  message = f"Accumulated {len(updated_files)} file(s) total: \n{filename_info}"
519
 
520
- return updated_files, message
521
 
522
  # with gr.Blocks(title=TITLE) as demo
523
  with gr.Blocks(title=TITLE, css=custom_css) as demo:
@@ -592,7 +577,7 @@ def build_interface() -> gr.Blocks:
592
  )
593
 
594
  # Clean UI: Model parameters hidden in collapsible accordion
595
- with gr.Accordion("βš™οΈ Marker Settings", open=False):
596
  gr.Markdown(f"#### **Marker Configuration**")
597
  with gr.Row():
598
  openai_base_url_tb = gr.Textbox(
@@ -607,7 +592,7 @@ def build_interface() -> gr.Blocks:
607
  value="webp",
608
  )
609
  output_format_dd = gr.Dropdown(
610
- choices=["markdown", "html"], #, "json", "chunks"], ##SMY: To be enabled later
611
  #choices=["markdown", "html", "json", "chunks"],
612
  label="Output Format",
613
  value="markdown",
@@ -633,10 +618,15 @@ def build_interface() -> gr.Blocks:
633
  value=2,
634
  step=1 #0.01
635
  )
636
- use_llm_cb = gr.Checkbox(
637
- label="Use LLM for Marker conversion",
638
- value=False
639
- )
 
 
 
 
 
640
  page_range_tb = gr.Textbox(
641
  label="Page Range (Optional)",
642
  placeholder="Example: 0,1-5,8,12-15",
@@ -677,13 +667,14 @@ def build_interface() -> gr.Blocks:
677
  btn_pdf_convert = gr.Button("Convert PDF(s)")
678
  '''
679
 
 
680
  with gr.Column(elem_classes=["file-or-directory-area"]):
681
  with gr.Row():
682
  file_btn = gr.UploadButton(
683
  #file_btn = gr.File(
684
  label="Upload Multiple Files",
685
  file_count="multiple",
686
- file_types=["file"],
687
  #height=25, #"sm",
688
  size="sm",
689
  elem_classes=["gradio-upload-btn"]
@@ -692,7 +683,7 @@ def build_interface() -> gr.Blocks:
692
  #dir_btn = gr.File(
693
  label="Upload a Directory",
694
  file_count="directory",
695
- #file_types=["file"], #Warning: The `file_types` parameter is ignored when `file_count` is 'directory'
696
  #height=25, #"0.5",
697
  size="sm",
698
  elem_classes=["gradio-upload-btn"]
@@ -702,8 +693,8 @@ def build_interface() -> gr.Blocks:
702
  output_textbox = gr.Textbox(label="Accumulated Files", lines=3) #, max_lines=4) #10
703
 
704
  with gr.Row():
705
- process_button = gr.Button("Process All Uploaded Files", variant="primary")
706
- clear_button = gr.Button("Clear All Uploads", variant="secondary")
707
 
708
 
709
  # --- PDF β†’ Markdown tab ---
@@ -890,8 +881,10 @@ def build_interface() -> gr.Blocks:
890
  """
891
  #msg = f"Files list cleared: {do_logout()}" ## use as needed
892
  msg = f"Files list cleared."
893
- yield [], msg, '', ''
894
  #return [], f"Files list cleared.", [], []
 
 
895
 
896
  #hf_login_logout_btn.click(fn=custom_do_logout, inputs=None, outputs=hf_login_logout_btn)
897
  ##unused
@@ -905,14 +898,14 @@ def build_interface() -> gr.Blocks:
905
  file_btn.upload(
906
  fn=accumulate_files,
907
  inputs=[file_btn, uploaded_file_list],
908
- outputs=[uploaded_file_list, output_textbox]
909
  )
910
 
911
  # Event handler for the directory upload button
912
  dir_btn.upload(
913
  fn=accumulate_files,
914
  inputs=[dir_btn, uploaded_file_list],
915
- outputs=[uploaded_file_list, output_textbox]
916
  )
917
 
918
  # Event handler for the "Clear" button
@@ -957,6 +950,7 @@ def build_interface() -> gr.Blocks:
957
  output_format_dd,
958
  output_dir_tb,
959
  use_llm_cb,
 
960
  page_range_tb,
961
  tz_hours_num, #state_tz_hours
962
  ]
 
1
  # ui/gradio_ui.py
2
+ from ast import Interactive
3
  import gradio as gr
4
  from concurrent.futures import ProcessPoolExecutor, as_completed
5
  import asyncio
 
8
  from typing import Optional, Union #, Dict, List, Any, Tuple
9
 
10
  from huggingface_hub import get_token
 
11
 
12
  #import file_handler
13
+ from file_handler import file_utils
14
  import file_handler.file_utils
15
+ from utils.config import TITLE, DESCRIPTION, DESCRIPTION_PDF_HTML, DESCRIPTION_PDF, DESCRIPTION_HTML, DESCRIPTION_MD, file_types_list, file_types_tuple
16
  from utils.utils import is_dict, is_list_of_dicts
17
  from file_handler.file_utils import zip_processed_files, process_dicts_data, collect_pdf_paths, collect_html_paths, collect_markdown_paths, create_outputdir ## should move to handling file
18
  from file_handler.file_utils import find_file
19
  from utils.get_config import get_config_value
20
 
 
21
  from llm.provider_validator import is_valid_provider, suggest_providers
22
+ from llm.llm_login import get_login_token, is_loggedin_huggingface, login_huggingface
23
  from converters.extraction_converter import DocumentConverter as docconverter #DocumentExtractor #as docextractor
24
  from converters.pdf_to_md import PdfToMarkdownConverter, init_worker
25
+ #from converters.md_to_pdf import MarkdownToPdfConverter ##SMY: PENDING: implementation
 
26
 
27
  import traceback ## Extract, format and print information about Python stack traces.
28
  from utils.logger import get_logger
 
31
 
32
  # Instantiate converters class once – they are stateless
33
  pdf2md_converter = PdfToMarkdownConverter()
 
34
  #md2pdf_converter = MarkdownToPdfConverter()
35
 
36
 
 
40
  from globals import config_load_models
41
  try:
42
  if not config_load_models.model_dict:
43
+ model_dict = load_models()
44
+ config_load_models.model_dict = model_dict
45
  '''if 'model_dict' not in globals():
46
  global model_dict
47
  model_dict = load_models()'''
48
+ logger.log(level=30, msg="Config_load_model: ", extra={"model_dict": str(model_dict)})
49
  except Exception as exc:
50
  #tb = traceback.format_exc() #exc.__traceback__
51
  logger.exception(f"βœ— Error loading models (reload): {exc}") #\n{tb}")
52
  raise RuntimeError(f"βœ— Error loading models (reload): {exc}") #\n{tb}")
53
 
54
+ #def get_login_token( api_token_arg, oauth_token: gr.OAuthToken | None=None,): ##moved to llm_login
 
 
 
 
 
 
 
 
 
55
 
56
  # pool executor to convert files called by Gradio
57
  ##SMY: TODO: future: refactor to gradio_process.py and
 
81
  #output_dir: Optional[Union[str, Path]] = "output_dir",
82
  output_dir_string: str = "output_dir_default",
83
  use_llm: bool = False, #Optional[bool] = False, #True,
84
+ force_ocr: bool = True, #Optional[bool] = False,
85
  page_range: str = None, #Optional[str] = None,
86
  tz_hours: str = None,
87
  oauth_token: gr.OAuthToken | None=None,
 
95
  """
96
 
97
  # login: Update the Gradio UI to improve user-friendly eXperience - commencing
98
+ # [template]: #outputs=[process_button, log_output, files_individual_JSON, files_individual_downloads],
99
+ yield gr.update(interactive=False), f"Commencing Processing ... Getting login", {"process": "Commencing Processing"}, f"dummy_log.log"
100
+ progress((0,16), f"Commencing Processing ...")
101
 
102
  # get token from logged-in user:
103
  api_token = get_login_token(api_token_arg=api_token_gr, oauth_token=oauth_token)
104
  ##SMY: Strictly debug. Must not be live
105
+ #logger.log(level=30, msg="Commencing: get_login_token", extra={"api_token": api_token, "api_token_gr": api_token_gr})
106
 
107
+ '''try:
108
  ##SMY: might deprecate. To replace with oauth login from Gradio ui or integrate cleanly.
109
  #login_huggingface(api_token) ## attempt login if not already logged in. NB: HF CLI login prompt would not display in Process Worker.
110
 
 
124
  tb = traceback.format_exc()
125
  logger.exception(f"βœ— Error during login_huggingface β†’ {exc}\n{tb}", exc_info=True) # Log the full traceback
126
  return [gr.update(interactive=True), f"βœ— An error occurred during login_huggingface β†’ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] # return the exception message
127
+ '''
128
+ progress((1,16), desc=f"Log in: {is_loggedin_huggingface(api_token)}")
 
129
  ## debug
130
  #logger.log(level=30, msg="pdf_files_inputs", extra={"input_arg[0]:": pdf_files[0]})
131
 
 
135
  #outputs=[log_output, files_individual_JSON, files_individual_downloads],
136
  return [gr.update(interactive=True), "Initialising ProcessPool: No files uploaded.", {"Upload":"No files uploaded"}, f"dummy_log.log"]
137
 
138
+ progress((2,16), desc=f"Getting configuration values")
139
  # Get config values if not provided
140
+ config_file = find_file("config.ini") ##from file_handler.file_utils ##takes a bit of time to process. #NeedOptimise
141
+ model_id = model_id if model_id else get_config_value(config_file, "MARKER_CAP", "MODEL_ID")
142
+ openai_base_url = openai_base_url if openai_base_url else get_config_value(config_file, "MARKER_CAP", "OPENAI_BASE_URL")
143
+ openai_image_format = openai_image_format if openai_image_format else get_config_value(config_file, "MARKER_CAP", "OPENAI_IMAGE_FORMAT")
144
+ max_workers = max_workers if max_workers else get_config_value(config_file, "MARKER_CAP", "MAX_WORKERS")
145
+ max_retries = max_retries if max_retries else get_config_value(config_file, "MARKER_CAP", "MAX_RETRIES")
146
+ output_format = output_format if output_format else get_config_value(config_file, "MARKER_CAP", "OUTPUT_FORMAT")
147
+ output_dir_string = output_dir_string if output_dir_string else str(get_config_value(config_file, "MARKER_CAP", "OUTPUT_DIR"))
148
+ use_llm = use_llm if use_llm else get_config_value(config_file, "MARKER_CAP", "USE_LLM")
149
+ page_range = page_range if page_range else get_config_value(config_file,"MARKER_CAP", "PAGE_RANGE")
150
+
151
+ progress((3,16), desc=f"Retrieved configuration values")
152
 
153
  # Create the initargs tuple from the Gradio inputs: # 'files' is an iterable, and handled separately.
154
+ progress((4,16), desc=f"Initialiasing init_args")
155
  yield gr.update(interactive=False), f"Initialising init_args", {"process": "Processing files ..."}, f"dummy_log.log"
156
  init_args = (
157
  provider,
 
173
  output_format,
174
  output_dir_string,
175
  use_llm,
176
+ force_ocr,
177
  page_range,
178
  )
179
 
180
+ # create output_dir
181
+ try:
182
+ yield gr.update(interactive=False), f"Creating output_dir ...", {"process": "Processing files ..."}, f"dummy_log.log"
183
+ progress((5,16), desc=f"ProcessPoolExecutor: Creating output_dir")
184
+
185
+ #pdf2md_converter.output_dir_string = output_dir_string ##SMY: attempt setting directly to resolve pool.map iterable
186
+
187
+ # Create Marker output_dir in temporary directory where Gradio can access it.
188
+ output_dir = file_utils.create_temp_folder(output_dir_string)
189
+ pdf2md_converter.output_dir = output_dir
190
+
191
+ logger.info(f"βœ“ output_dir created: ", extra={"output_dir": pdf2md_converter.output_dir.name, "in": str(pdf2md_converter.output_dir.parent)})
192
+ yield gr.update(interactive=False), f"Created output_dir ...", {"process": "Processing files ..."}, f"dummy_log.log"
193
+ progress((6,16), desc=f"βœ“ Created output_dir.")
194
+ except Exception as exc:
195
+ tb = traceback.format_exc()
196
+ tbp = traceback.print_exc() # Print the exception traceback
197
+ logger.exception("βœ— error creating output_dir β†’ {exc}\n{tb}", exc_info=True) # Log the full traceback
198
+
199
+ # Update the Gradio UI to improve user-friendly eXperience
200
+ yield gr.update(interactive=True), f"βœ— An error occurred creating output_dir: {str(exc)}", {"Error":f"Error: {exc}"}, f"dummy_log.log" ## return the exception message
201
+ return f"An error occurred creating output_dir: {str(exc)}", f"Error: {exc}", f"Error: {exc}" ## return the exception message
202
+
203
+ # Process file conversion leveraging ProcessPoolExecutor for efficiency
204
  try:
205
  results = [] ## initialised pool result holder
 
206
  logger.log(level=30, msg="Initialising ProcessPoolExecutor: pool:", extra={"pdf_files": pdf_files, "files_len": len(pdf_files), "model_id": model_id, "output_dir": output_dir_string}) #pdf_files_count
 
207
  yield gr.update(interactive=False), f"Initialising ProcessPoolExecutor: Processing Files ...", {"process": "Processing files ..."}, f"dummy_log.log"
208
+ progress((7,16), desc=f"Initialising ProcessPoolExecutor: Processing Files ...")
209
 
210
+ # Create a pool with init_worker initialiser
211
  with ProcessPoolExecutor(
212
  max_workers=max_workers,
213
  initializer=init_worker,
214
  initargs=init_args
215
  ) as pool:
216
+ logger.log(level=30, msg="Initialising ProcessPoolExecutor: pool:", extra={"pdf_files": pdf_files, "files_len": len(pdf_files), "model_id": model_id, "output_dir": output_dir_string}) #pdf_files_count
217
+ progress((8,16), desc=f"Starting ProcessPool queue: Processing Files ...")
218
+
 
 
 
 
219
  # Map the files (pdf_files) to the conversion function (pdf2md_converter.convert_file)
220
  # The 'docconverter' argument is implicitly handled by the initialiser
221
  #futures = [pool.map(pdf2md_converter.convert_files, f) for f in pdf_files]
222
  #logs = [f.result() for f in as_completed(futures)]
223
  #futures = [pool.submit(pdf2md_converter.convert_files, file) for file in pdf_files]
224
  #logs = [f.result() for f in futures]
 
225
  try:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
226
  yield gr.update(interactive=True), f"ProcessPoolExecutor: Pooling file conversion ...", {"process": "Processing files ..."}, f"dummy_log.log"
227
+ progress((9,16), desc=f"ProcessPoolExecutor: Pooling file conversion ...")
228
+
229
  # Use progress.tqdm to integrate with the executor map
230
  #results = pool.map(pdf2md_converter.convert_files, pdf_files) ##SMY iterables #max_retries #output_dir_string)
231
  for result_interim in progress.tqdm(
232
+ iterable=pool.map(pdf2md_converter.convert_files, pdf_files) #, max_retries), total=len(pdf_files)
233
  ):
234
  results.append(result_interim)
 
235
  # Update the Gradio UI to improve user-friendly eXperience
236
  yield gr.update(interactive=True), f"ProcessPoolExecutor: Pooling file conversion result: [{str(result_interim)}[:20]]", {"process": "Processing files ..."}, f"dummy_log.log"
237
+ progress((10,16), desc=f"ProcessPoolExecutor: Pooling file conversion result: [{str(result_interim)}[:20]]")
238
 
239
+ yield gr.update(interactive=True), f"ProcessPoolExecutor: Got Results from files conversion: [{str(result_interim)}[:20]]", {"process": "Processing files ..."}, f"dummy_log.log"
240
+ progress((11,16), desc=f"ProcessPoolExecutor: Got Results from files conversion")
241
  except Exception as exc:
242
  # Raise the exception to stop the Gradio app: exception to halt execution
243
  logger.exception("Error during pooling file conversion", exc_info=True) # Log the full traceback
244
+ tbp = traceback.print_exc() # Print the exception traceback
 
245
  # Update the Gradio UI to improve user-friendly eXperience
246
+ yield gr.update(interactive=True), f"An error occurred during pool.map: {str(exc)}", {"Error":f"Error: {exc}\n{tbp}"}, f"dummy_log.log" ## return the exception message
247
+ return [gr.update(interactive=True), f"An error occurred during pool.map: {str(exc)}", {"Error":f"Error: {exc}\n{tbp}"}, f"dummy_log.log"] ## return the exception message
248
 
249
+ # Process file conversion results
250
  try:
251
+ progress((12,16), desc="Processing results from files conversion") ##rekickin
252
  logger.log(level=20, msg="ProcessPoolExecutor pool result:", extra={"results": str(results)})
253
+
254
  logs = []
255
  logs_files_images = []
256
+
257
  #logs.extend(results) ## performant pythonic
258
  #logs = list[results] ##
259
  logs = [result for result in results] ## pythonic list comprehension
260
+ # [template] ## logs : [file , images , filepath, image_path]
261
 
262
  #logs_files_images = logs_files.extend(logs_images) #zip(logs_files, logs_images) ##SMY: in progress
263
  logs_count = 0
 
269
  # Update the Gradio UI to improve user-friendly eXperience
270
  #yield gr.update(interactive=False), f"Processing files: {logs_files_images[logs_count]}", {"process": "Processing files"}, f"dummy_log.log"
271
  logs_count = i+i_image
 
 
 
 
272
  except Exception as exc:
273
+ tbp = traceback.print_exc() # Print the exception traceback
274
+ logger.exception("Error during processing results logs β†’ {exc}\n{tbp}", exc_info=True) # Log the full traceback
275
  return [gr.update(interactive=True), f"An error occurred during processing results logs: {str(exc)}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] ## return the exception message
276
  #yield gr.update(interactive=True), f"An error occurred during processing results logs: {str(exc)}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" ## return the exception message
 
 
277
  except Exception as exc:
278
  tb = traceback.format_exc()
279
  logger.exception(f"βœ— Error during ProcessPoolExecutor β†’ {exc}\n{tb}" , exc_info=True) # Log the full traceback
280
  #traceback.print_exc() # Print the exception traceback
281
+ yield gr.update(interactive=True), f"βœ— An error occurred during ProcessPoolExecutorβ†’ {exc}", {"Error":f"Error: {exc}"}, f"dummy_log.log" # return the exception message
282
 
283
+ # Zip Processed Files and images. Insert to first index
 
 
 
 
 
 
 
 
284
  try: ##from file_handler.file_utils
285
+ progress((13,16), desc="Zipping processed files and images")
286
+ zipped_processed_files = zip_processed_files(root_dir=f"{output_dir}", file_paths=logs_files_images, tz_hours=tz_hours, date_format='%d%b%Y_%H-%M-%S') #date_format='%d%b%Y'
287
  logs_files_images.insert(0, zipped_processed_files)
 
288
 
289
+ progress((14,16), desc="Zipped processed files and images")
290
  #yield gr.update(interactive=False), f"Processing zip and files: {logs_files_images}", {"process": "Processing files"}, f"dummy_log.log"
291
 
292
  except Exception as exc:
293
  tb = traceback.format_exc()
294
  logger.exception(f"βœ— Error during zipping processed files β†’ {exc}\n{tb}" , exc_info=True) # Log the full traceback
295
  #traceback.print_exc() # Print the exception traceback
 
296
  yield gr.update(interactive=True), f"βœ— An error occurred during zipping files β†’ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" # return the exception message
297
+ return gr.update(interactive=True), f"βœ— An error occurred during zipping files β†’ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" # return the exception message
298
 
299
 
300
  # Return processed files log
301
  try:
302
+ progress((15,16), desc="Formatting processed log results")
303
+
304
  ## # Convert logs list of dicts to formatted json string
305
  logs_return_formatted_json_string = file_handler.file_utils.process_dicts_data(logs) #"\n".join(log for log in logs) ##SMY outputs to gr.JSON component with no need for json.dumps(data, indent=)
306
+ #logs_files_images_return = "\n".join(path for path in logs_files_images) ##TypeError: sequence item 0: expected str instance, WindowsPath found
 
 
 
307
 
308
  ## # Convert any Path objects to strings, but leave strings as-is
309
  logs_files_images_return = list(str(path) if isinstance(path, Path) else path for path in logs_files_images)
310
  logger.log(level=20, msg="File conversion complete. Sending outcome to Gradio:", extra={"logs_files_image_return": str(logs_files_images_return)}) ## debug: FileNotFoundError: [WinError 2] The system cannot find the file specified: 'Error or no image_path'
311
 
312
+ progress((16,16), desc="Complete processing and formatting file processing results")
313
+ # [templates]
314
  #outputs=[process_button, log_output, files_individual_JSON, files_individual_downloads],
315
  #return "\n".join(logs), "\n".join(logs_files_images) #"\n".join(logs_files)
316
 
 
323
  tb = traceback.format_exc()
324
  logger.exception(f"βœ— Error during returning result logs β†’ {exc}\n{tb}" , exc_info=True) # Log the full traceback
325
  #traceback.print_exc() # Print the exception traceback
326
+ yield gr.update(interactive=True), f"βœ— An error occurred during returning result logsβ†’ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log" # return the exception message
327
+ return [gr.update(interactive=True), f"βœ— An error occurred during returning result logsβ†’ {exc}\n{tb}", {"Error":f"Error: {exc}"}, f"dummy_log.log"] # return the exception message
328
 
329
  #return "\n".join(log for log in logs), "\n".join(str(path) for path in logs_files_images)
330
  #print(f'logs_files_images: {"\n".join(str(path) for path in logs_files_images)}')
 
502
  #message = f"Accumulated {len(updated_files)} file(s) total.\n\nAll file paths:\n{file_info}"
503
  message = f"Accumulated {len(updated_files)} file(s) total: \n{filename_info}"
504
 
505
+ return updated_files, message, gr.update(interactive=True), gr.update(interactive=True)
506
 
507
  # with gr.Blocks(title=TITLE) as demo
508
  with gr.Blocks(title=TITLE, css=custom_css) as demo:
 
577
  )
578
 
579
  # Clean UI: Model parameters hidden in collapsible accordion
580
+ with gr.Accordion("βš™οΈ Marker Converter Settings", open=False):
581
  gr.Markdown(f"#### **Marker Configuration**")
582
  with gr.Row():
583
  openai_base_url_tb = gr.Textbox(
 
592
  value="webp",
593
  )
594
  output_format_dd = gr.Dropdown(
595
+ choices=["markdown", "html", "json"], #, "json", "chunks"], ##SMY: To be enabled later
596
  #choices=["markdown", "html", "json", "chunks"],
597
  label="Output Format",
598
  value="markdown",
 
618
  value=2,
619
  step=1 #0.01
620
  )
621
+ with gr.Column():
622
+ use_llm_cb = gr.Checkbox(
623
+ label="Use LLM for Marker conversion",
624
+ value=False
625
+ )
626
+ force_ocr_cb = gr.Checkbox(
627
+ label="force OCR on all pages",
628
+ value=True,
629
+ )
630
  page_range_tb = gr.Textbox(
631
  label="Page Range (Optional)",
632
  placeholder="Example: 0,1-5,8,12-15",
 
667
  btn_pdf_convert = gr.Button("Convert PDF(s)")
668
  '''
669
 
670
+ file_types_list.extend(file_types_tuple)
671
  with gr.Column(elem_classes=["file-or-directory-area"]):
672
  with gr.Row():
673
  file_btn = gr.UploadButton(
674
  #file_btn = gr.File(
675
  label="Upload Multiple Files",
676
  file_count="multiple",
677
+ file_types= file_types_list, #["file"], ##config.file_types_list
678
  #height=25, #"sm",
679
  size="sm",
680
  elem_classes=["gradio-upload-btn"]
 
683
  #dir_btn = gr.File(
684
  label="Upload a Directory",
685
  file_count="directory",
686
+ file_types= file_types_list, #["file"], #Warning: The `file_types` parameter is ignored when `file_count` is 'directory'
687
  #height=25, #"0.5",
688
  size="sm",
689
  elem_classes=["gradio-upload-btn"]
 
693
  output_textbox = gr.Textbox(label="Accumulated Files", lines=3) #, max_lines=4) #10
694
 
695
  with gr.Row():
696
+ process_button = gr.Button("Process All Uploaded Files", variant="primary", interactive=False)
697
+ clear_button = gr.Button("Clear All Uploads", variant="secondary", interactive=False)
698
 
699
 
700
  # --- PDF β†’ Markdown tab ---
 
881
  """
882
  #msg = f"Files list cleared: {do_logout()}" ## use as needed
883
  msg = f"Files list cleared."
884
+ #yield [], msg, '', ''
885
  #return [], f"Files list cleared.", [], []
886
+ yield [], msg, None, None
887
+ return [], f"Files list cleared.", None, None
888
 
889
  #hf_login_logout_btn.click(fn=custom_do_logout, inputs=None, outputs=hf_login_logout_btn)
890
  ##unused
 
898
  file_btn.upload(
899
  fn=accumulate_files,
900
  inputs=[file_btn, uploaded_file_list],
901
+ outputs=[uploaded_file_list, output_textbox, process_button, clear_button]
902
  )
903
 
904
  # Event handler for the directory upload button
905
  dir_btn.upload(
906
  fn=accumulate_files,
907
  inputs=[dir_btn, uploaded_file_list],
908
+ outputs=[uploaded_file_list, output_textbox, process_button, clear_button]
909
  )
910
 
911
  # Event handler for the "Clear" button
 
950
  output_format_dd,
951
  output_dir_tb,
952
  use_llm_cb,
953
+ force_ocr_cb,
954
  page_range_tb,
955
  tz_hours_num, #state_tz_hours
956
  ]
utils/config.py CHANGED
@@ -28,6 +28,13 @@ DESCRIPTION_MD = (
28
  "Upload Markdown/LaTeX files and generate a polished PDF."
29
  )
30
 
 
 
 
 
 
 
 
31
  # Conversion defaults
32
  DEFAULT_MARKER_OPTIONS = {
33
  "include_images": True,
 
28
  "Upload Markdown/LaTeX files and generate a polished PDF."
29
  )
30
 
31
+ # File types
32
+ file_types_list = []
33
+ file_types_tuple = (".pdf", ".html", ".docx", ".doc")
34
+ #file_types_list = list[file_types_tuple]
35
+ #file_types_list.extend(file_types_tuple)
36
+
37
+
38
  # Conversion defaults
39
  DEFAULT_MARKER_OPTIONS = {
40
  "include_images": True,