Document-Based Datasets

The Document-Based dataset type in UbiAI is designed to process scanned documents, PDFs, and images. Using an OCR (Optical Character Recognition) engine, UbiAI extracts the text from these scanned files, allowing for annotations and relationship creation between spans and words.

Getting Started with Document-Based Datasets

Creating a Document-Based dataset in UbiAI involves several steps to ensure the accuracy and usability of your data.

Define Labels

The process of defining labels for a Document-Based dataset is the same as explained on the previous page. You can add labels manually or use pre-set sets to guide the annotation process.

Entity Labels: These help identify and categorize specific entities such as people, locations, or organizations.
Relationship Labels: These define the connections between the entities in your document, such as linking "John" to "Paris" with a "lives_in" relationship.
Classification Labels: These categorize the entire document into one of the following: Binary Classification, Single Classification, Multi-Classification

Upload Your Files

Once you have defined your labels, the next step is to upload the textual data that you want to annotate.

Supported formats include:

Text files
HTML
PDF
JSON
PNG
JPEG, JPG
ZIP
Office formats (CSV, TSV)

Maximum File Size: The maximum size for each uploaded file is 500MB.

Choose Pre-Annotation

UbiAI offers an automatic pre-annotation feature that can save you time during the annotation process. You can choose from the following options:

With Pre-Annotation: Select a pre-trained model like SpaCy or one of your previously trained models on UbiAI to auto-label your documents.
No Pre-Annotation: If you prefer to annotate your documents manually, you can select this option to upload documents with no pre-set annotations.

Choose OCR Engine

Since the dataset involves documents or images, you will need to select an OCR engine to extract the text. You can choose from the following OCR engines:

Default OCR Engine
Amazon Textract
Google OCR
Microsoft Azure

You can choose how the OCR engine will process your documents:

Block: Extract text in blocks.
Line: Extract text line by line.

If your document contains tables, you can choose to extract them (this option is only available for the Microsoft Azure OCR engine). Click Upload and wait for the pre-processing to complete.

Annotate and Edit

Once your documents are uploaded and pre-annotations are applied (if selected), you can start the annotation process.

If Pre-Annotation Was Selected: You can edit the pre-filled annotations to ensure they align with your needs. Review and modify them to ensure accuracy.
If No Pre-Annotation Was Selected: Begin annotating manually by selecting the appropriate labels and dragging them over the text. For entity and relationship annotations, simply click and drag over the relevant portions of the text.

For classification annotations, you can annotate the entire text by deciding what class it belong to (the default classification is bianry-classification if no other label is chosen).

Annotation Prediction Tool

If you need help with the annotation process, UbiAI provides an annotation assistance tool. To to use this tool:

Access the Tool: From the top right corner, click on the drop-down menu next to the Predict button. Select Configure Prediction.
Choose a Task: Choose the type of task (e.g., relation extraction, NER, span categorizer, text classification).
Select a Model: Pick an external model (e.g., from Hugging Face) by providing a URL, or select one of UbiAI’s internal models like GPT.
Click Predict: Once you’ve selected the model and task, click on Predict. The platform will auto-annotate the text based on the selected model.

You can then edit the results and validate your dataset.

Export or Expand Your Dataset

Once your dataset is validated and ready, you can choose to either add more documents or export your annotated dataset.

Add More Data: If you wish to expand your dataset, simply upload new documents and add them to your existing dataset.
Export the Dataset: To use your annotated dataset outside of UbiAI, click on Export. You can filter the data based on specific labels and select a split ratio.

pported formats include:

Amazon Comprehend
JSON
SpaCy
Text Classification Format
Relations Format
OCR Format
Stanford CoreNLP
IOB Format

A zip file containing the annotation along with the documents used during annotation will be downloaded, you will need to unzip the file before using the annotation to train a model.

For MacOs users, it is recommended to unzip the file using Winzip in order to preserve file names.

Using the UbiAI API for Document-Based Datasets

Upload Files with API

To upload files using the UbiAI API, you can use the following code:

If you would like to pre-annotate your files, check the "Auto annotate while uploading" box and select the method of pre-annotation as shown below:

import requests
import json
import mimetypes
import os

url ="https://api.ubiai.tools:8443/api_v1/upload"
my_token = "Your_Token"
"""  types :  json, image, csv, zip, text_docs  """
file_type = "/json"

list_of_file_path = ['']
urls = []
files = []
for file_path in list_of_file_path :
    files.append(('file',(os.path.basename(file_path ),open(file_path, 'rb'),mimetypes.guess_type(file_path)[0])))

data = {
  'autoAssignToCollab' :False,
  'taskType' :'TASK',
  'nbUsersPerDoc' :'',
  'selectedUsers' :'',
  'filesUrls' : urls
}

response = requests.post(url+ my_token + file_type, files=files, data=data)
print(response.status_code)
res = json.loads(response.content.decode("utf-8"))
print(res)

Export Files with API

To export files using the UbiAI API, you can use this code:

import requests
import json
url ="https://api.ubiai.tools:8443/api_v1/download"
my_token = "Your_Token"
#('aws', 'Lists')
#('spacy', 'Json')
#('DocBin_NER', 'Json')
#('spacy_training', 'Json')
#('classification', 'Json')
#('ocr1', '') ('ocr2', '') ('ocr3', '')
#('stanford', '')
#('iob', '')
#('iob_pos', '')
#('iob_chatbot', '')
file_type = "/json"
split_ratio = ""
params = {'splitRatio': split_ratio}
response = requests.get(url+ my_token + file_type, params=params)
print(response.status_code)
res = response.content.decode("utf-8")
print(res)

PreviousText-Based Datasets NextImage-Based Datasets

Last updated 5 months ago