Document-Based Datasets
Last updated
Last updated
The Document-Based dataset type in UbiAI is designed to process scanned documents, PDFs, and images. Using an OCR (Optical Character Recognition) engine, UbiAI extracts the text from these scanned files, allowing for annotations and relationship creation between spans and words.
Creating a Document-Based dataset in UbiAI involves several steps to ensure the accuracy and usability of your data.
The process of defining labels for a Document-Based dataset is the same as explained on the previous page. You can add labels manually or use pre-set sets to guide the annotation process.
Entity Labels: These help identify and categorize specific entities such as people, locations, or organizations.
Relationship Labels: These define the connections between the entities in your document, such as linking "John" to "Paris" with a "lives_in" relationship.
Classification Labels: These categorize the entire document into one of the following: Binary Classification, Single Classification, Multi-Classification
Once you have defined your labels, the next step is to upload the textual data that you want to annotate.
Maximum File Size: The maximum size for each uploaded file is 500MB.
UbiAI offers an automatic pre-annotation feature that can save you time during the annotation process. You can choose from the following options:
With Pre-Annotation: Select a pre-trained model like SpaCy or one of your previously trained models on UbiAI to auto-label your documents.
No Pre-Annotation: If you prefer to annotate your documents manually, you can select this option to upload documents with no pre-set annotations.
Since the dataset involves documents or images, you will need to select an OCR engine to extract the text. You can choose from the following OCR engines:
Default OCR Engine
Amazon Textract
Google OCR
Microsoft Azure
You can choose how the OCR engine will process your documents:
Block: Extract text in blocks.
Line: Extract text line by line.
If your document contains tables, you can choose to extract them (this option is only available for the Microsoft Azure OCR engine). Click Upload and wait for the pre-processing to complete.
Once your documents are uploaded and pre-annotations are applied (if selected), you can start the annotation process.
If Pre-Annotation Was Selected: You can edit the pre-filled annotations to ensure they align with your needs. Review and modify them to ensure accuracy.
If No Pre-Annotation Was Selected: Begin annotating manually by selecting the appropriate labels and dragging them over the text. For entity and relationship annotations, simply click and drag over the relevant portions of the text.
For classification annotations, you can annotate the entire text by deciding what class it belong to (the default classification is bianry-classification if no other label is chosen).
If you need help with the annotation process, UbiAI provides an annotation assistance tool. To to use this tool:
Access the Tool: From the top right corner, click on the drop-down menu next to the Predict button. Select Configure Prediction.
Choose a Task: Choose the type of task (e.g., relation extraction, NER, span categorizer, text classification).
Select a Model: Pick an external model (e.g., from Hugging Face) by providing a URL, or select one of UbiAI’s internal models like GPT.
Click Predict: Once you’ve selected the model and task, click on Predict. The platform will auto-annotate the text based on the selected model.
You can then edit the results and validate your dataset.
Once your dataset is validated and ready, you can choose to either add more documents or export your annotated dataset.
Add More Data: If you wish to expand your dataset, simply upload new documents and add them to your existing dataset.
Export the Dataset: To use your annotated dataset outside of UbiAI, click on Export. You can filter the data based on specific labels and select a split ratio.
For MacOs users, it is recommended to unzip the file using Winzip in order to preserve file names.
To upload files using the UbiAI API, you can use the following code:
To export files using the UbiAI API, you can use this code: