# Document-Based Datasets

The **Document-Based** dataset type in UbiAI is designed to process scanned documents, PDFs, and images. Using an OCR (Optical Character Recognition) engine, UbiAI extracts the text from these scanned files, allowing for annotations and relationship creation between spans and words.

## Getting Started with Document-Based Datasets

Creating a Document-Based dataset in UbiAI involves several steps to ensure the accuracy and usability of your data.&#x20;

## Define Labels

The process of defining labels for a Document-Based dataset is the same as explained on the previous page. You can add labels manually or use pre-set sets to guide the annotation process.

<figure><img src="https://3570583889-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYPwHMAbmrSLa2qrRaRPy%2Fuploads%2FtGuGpVd7okVkfsEJP3fU%2FRecording%202025-01-10%20at%2018.51.11.gif?alt=media&#x26;token=b01eb5f4-c7ea-40f5-be98-faf20b86c181" alt=""><figcaption></figcaption></figure>

> * **Entity Labels**: These help identify and categorize specific entities such as people, locations, or organizations.
> * **Relationship Labels**: These define the connections between the entities in your document, such as linking "John" to "Paris" with a "lives\_in" relationship.
> * **Classification Labels**: These categorize the entire document into one of the following: **Binary Classification, Single Classification**, **Multi-Classification**

## Upload Your Files

Once you have defined your labels, the next step is to upload the textual data that you want to annotate.

{% hint style="info" %}
Supported formats include:

* Text files
* HTML
* PDF
* JSON
* PNG
* JPEG, JPG
* ZIP
* Office formats (CSV, TSV)
  {% endhint %}

{% hint style="warning" %}
**Maximum File Size**: The maximum size for each uploaded file is 500MB.
{% endhint %}

## Choose Pre-Annotation

UbiAI offers an automatic pre-annotation feature that can save you time during the annotation process. You can choose from the following options:

> * **With Pre-Annotation**: Select a pre-trained model like SpaCy or one of your previously trained models on UbiAI to auto-label your documents.
> * **No Pre-Annotation**: If you prefer to annotate your documents manually, you can select this option to upload documents with no pre-set annotations.

## Choose OCR Engine

Since the dataset involves documents or images, you will need to select an OCR engine to extract the text. You can choose from the following OCR engines:

<figure><img src="https://3570583889-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYPwHMAbmrSLa2qrRaRPy%2Fuploads%2FZ7uePBHz4aeMeryU8yfg%2FRecording%202025-01-10%20at%2019.22.32.gif?alt=media&#x26;token=12e03464-fde1-4e39-aee2-18fe08157f2d" alt=""><figcaption></figcaption></figure>

> * **Default OCR Engine**
> * **Amazon Textract**
> * **Google OCR**
> * **Microsoft Azure**

You can choose how the OCR engine will process your documents:

> * **Block**: Extract text in blocks.
> * **Line**: Extract text line by line.

If your document contains tables, you can choose to extract them (this option is only available for the Microsoft Azure OCR engine). Click **Upload** and wait for the pre-processing to complete.

## Annotate and Edit

Once your documents are uploaded and pre-annotations are applied (if selected), you can start the annotation process.

<figure><img src="https://3570583889-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYPwHMAbmrSLa2qrRaRPy%2Fuploads%2FIvwzoMJB01IsuE9FJu05%2FRecording%202025-01-10%20at%2019.24.35.gif?alt=media&#x26;token=5c6a3de3-2b4b-4561-9e57-30f3f6bb3a38" alt=""><figcaption></figcaption></figure>

> * **If Pre-Annotation Was Selected**: You can edit the pre-filled annotations to ensure they align with your needs. Review and modify them to ensure accuracy.
> * **If No Pre-Annotation Was Selected**: Begin annotating manually by selecting the appropriate labels and dragging them over the text. For entity and relationship annotations, simply click and drag over the relevant portions of the text.

For **classification annotations**, you can annotate the entire text by deciding what class it belong to (the default classification is bianry-classification if no other label is chosen).

## Annotation Prediction Tool

If you need help with the annotation process, UbiAI provides an annotation assistance tool. To to use this tool:

<figure><img src="https://3570583889-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYPwHMAbmrSLa2qrRaRPy%2Fuploads%2F555AL1YOvBlQyctRK5PU%2FRecording%202025-01-10%20at%2019.26.18.gif?alt=media&#x26;token=f5c20699-408f-4643-9b74-e6cb8cee9a29" alt=""><figcaption></figcaption></figure>

> * **Access the Tool**: From the top right corner, click on the drop-down menu next to the **Predict** button. Select **Configure Prediction**.
> * **Choose a Task**: Choose the type of task (e.g., relation extraction, NER, span categorizer, text classification).
> * **Select a Model**: Pick an external model (e.g., from Hugging Face) by providing a URL, or select one of UbiAI’s internal models like GPT.
> * **Click Predict**: Once you’ve selected the model and task, click on **Predict**. The platform will auto-annotate the text based on the selected model.&#x20;

You can then edit the results and validate your dataset.

## Export or Expand Your Dataset

Once your dataset is validated and ready, you can choose to either add more documents or export your annotated dataset.

> * Add More Data: If you wish to expand your dataset, simply upload new documents and add them to your existing dataset.
> * Export the Dataset: To use your annotated dataset outside of UbiAI, click on Export. You can filter the data based on specific labels and select a split ratio.&#x20;

{% hint style="info" %}
pported formats include:

* Amazon Comprehend
* JSON
* SpaCy
* Text Classification Format
* Relations Format
* OCR Format
* Stanford CoreNLP
* IOB Format
  {% endhint %}

{% hint style="info" %}
A zip file containing the annotation along with the documents used during annotation will be downloaded, you will need to unzip the file before using the annotation to train a model.
{% endhint %}

{% hint style="warning" %}
For MacOs users, it is recommended to unzip the file using Winzip in order to preserve file names.
{% endhint %}

## Using the UbiAI API for Document-Based Datasets

<figure><img src="https://3570583889-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYPwHMAbmrSLa2qrRaRPy%2Fuploads%2FR5YeEszPnkMjhMDIhHBS%2FRecording%202025-01-10%20at%2018.46.21.gif?alt=media&#x26;token=e7ecbb3f-3bd0-4bf8-be75-dc781eac0e05" alt=""><figcaption></figcaption></figure>

### Upload Files with API

To upload files using the UbiAI API, you can use the following code:

{% hint style="info" %}
&#x20;If you would like to pre-annotate your files, check the "Auto annotate while uploading" box and select the method of pre-annotation as shown below:
{% endhint %}

```python
import requests
import json
import mimetypes
import os

url ="https://api.ubiai.tools:8443/api_v1/upload"
my_token = "Your_Token"
"""  types :  json, image, csv, zip, text_docs  """
file_type = "/json"

list_of_file_path = ['']
urls = []
files = []
for file_path in list_of_file_path :
    files.append(('file',(os.path.basename(file_path ),open(file_path, 'rb'),mimetypes.guess_type(file_path)[0])))

data = {
  'autoAssignToCollab' :False,
  'taskType' :'TASK',
  'nbUsersPerDoc' :'',
  'selectedUsers' :'',
  'filesUrls' : urls
}

response = requests.post(url+ my_token + file_type, files=files, data=data)
print(response.status_code)
res = json.loads(response.content.decode("utf-8"))
print(res)
```

### Export Files with API

To export files using the UbiAI API, you can use this code:

```python
import requests
import json
url ="https://api.ubiai.tools:8443/api_v1/download"
my_token = "Your_Token"
#('aws', 'Lists')
#('spacy', 'Json')
#('DocBin_NER', 'Json')
#('spacy_training', 'Json')
#('classification', 'Json')
#('ocr1', '') ('ocr2', '') ('ocr3', '')
#('stanford', '')
#('iob', '')
#('iob_pos', '')
#('iob_chatbot', '')
file_type = "/json"
split_ratio = ""
params = {'splitRatio': split_ratio}
response = requests.get(url+ my_token + file_type, params=params)
print(response.status_code)
res = response.content.decode("utf-8")
print(res)
```
