KIFF
/

pyannote-speaker-diarization-endpoint

@@ -43,27 +43,78 @@ with open("audio.rttm", "w") as rttm:
 ## Advanced usage
-In case the number of speakers is known in advance, one can use the `num_speakers` option:
 ```python
-diarization = pipeline("audio.wav", num_speakers=2)
 ```
-One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options:
 ```python
-diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
 ```
-If you feel adventurous, you can try and play with the various pipeline hyper-parameters.
-For instance, one can use a more aggressive voice activity detection by increasing the value of `segmentation_onset` threshold:
 ```python
-hparams = pipeline.parameters(instantiated=True)
 hparams["segmentation_onset"] += 0.1
-pipeline.instantiate(hparams)
 ```
 ## Benchmark
 ### Real-time factor

 ## Advanced usage
+If the number of speakers is known in advance, you can include the num_speakers parameter in the parameters dictionary:
 ```python
+handler = EndpointHandler()
+diarization = handler({"inputs": base64_audio, "parameters": {"num_speakers": 2}})
 ```
+You can also provide lower and/or upper bounds on the number of speakers using the min_speakers and max_speakers parameters:
 ```python
+handler = EndpointHandler()
+diarization = handler({"inputs": base64_audio, "parameters": {"min_speakers": 2, "max_speakers": 5}})
 ```
+If you're feeling adventurous, you can experiment with various pipeline hyperparameters.
+For instance, you can use a more aggressive voice activity detection by increasing the value of segmentation_onset threshold:
 ```python
+hparams = handler.pipeline.parameters(instantiated=True)
 hparams["segmentation_onset"] += 0.1
+handler.pipeline.instantiate(hparams)
+```
+To apply the updated handler for the API inference that can handle the number of speakers, use the following code:
+```python
+from typing import Dict
+from pyannote.audio import Pipeline
+import torch
+import base64
+import numpy as np
+SAMPLE_RATE = 16000
+class EndpointHandler():
+    def __init__(self, path=""):
+        # load the model
+        self.pipeline = Pipeline.from_pretrained("KIFF/pyannote-speaker-diarization-endpoint")
+    def __call__(self, data: Dict[str, bytes]) -> Dict[str, str]:
+        """
+        Args:
+            data (:obj:):
+                includes the deserialized audio file as bytes
+        Return:
+            A :obj:`dict`:. base64 encoded image
+        """
+        # process input
+        inputs = data.pop("inputs", data)
+        parameters = data.pop("parameters", None) #  min_speakers=2, max_speakers=5
+        # decode the base64 audio data
+        audio_data = base64.b64decode(inputs)
+        audio_nparray = np.frombuffer(audio_data, dtype=np.int16)
+        # prepare pynannote input
+        audio_tensor= torch.from_numpy(audio_nparray).float().unsqueeze(0)
+        pyannote_input = {"waveform": audio_tensor, "sample_rate": SAMPLE_RATE}
+        # apply pretrained pipeline
+        # pass inputs with all kwargs in data
+        if parameters is not None:
+            diarization = self.pipeline(pyannote_input, **parameters)
+        else:
+            diarization = self.pipeline(pyannote_input)
+        # postprocess the prediction
+        processed_diarization = [
+            {"label": str(label), "start": str(segment.start), "stop": str(segment.end)}
+            for segment, _, label in diarization.itertracks(yield_label=True)
+        ]
+        return {"diarization": processed_diarization}
 ```
 ## Benchmark
 ### Real-time factor