UBIAI LLM Fine-tuning
  • Welcome to UbiAI
  • Getting Started with UbiAI
  • Create Your Dataset
    • Prompt-Response Datasets
    • Text-Based Datasets
    • Document-Based Datasets
    • Image-Based Datasets
    • Supported File Format Details
    • Setting Up Assisted Annotation
  • Fine-Tune Your Models
  • Playground
  • LLM Monitoring in UbiAI
  • Collaborations on UBIAI
  • Inter-Annotator Agreement (IAA)
Powered by GitBook
On this page
  • Getting Started with Text-Based Datasets
  • Select Tokenization Type
  • Span-Based Tokenization
  • Character-Based Tokenization
  • Define Labels
  • Entity Labels (Optional but Recommended)
  • Relationship Labels (Optional but Recommended)
  • Classification Labels (Optional but Recommended)
  • Upload Your Textual Data
  • Choose Pre-Annotation
  • Annotate and Edit
  • Annotation Prediction Tool
  • Export or Expand Your Dataset
  • Using the UbiAI API for Text-Based Datasets
  • Upload files with API
  • Export files with API
  1. Create Your Dataset

Text-Based Datasets

PreviousPrompt-Response DatasetsNextDocument-Based Datasets

Last updated 5 months ago

The Text-Based dataset type in UbiAI is designed to process textual data at either the word level or character level, enabling efficient annotations between words or characters. This dataset is ideal for tasks like Named Entity Recognition (NER), relationship extraction, and text classification.

Getting Started with Text-Based Datasets

Creating a Text-Based dataset in UbiAI is straightforward, with clear steps to guide you through the entire process.

Select Tokenization Type

The first step in creating your Text-Based dataset is selecting the type of tokenization that best suits your task. UbiAI provides two options for tokenization: Span-Based Tokenization and Character-Based Tokenization.

Span-Based Tokenization

In span-based tokenization, you define continuous spans of text as single entities or relationships.

  • Use Cases: Span-based tokenization is ideal for tasks like Named Entity Recognition, where you need to identify specific entities such as people, places, and organizations in a text.

  • Example: Consider the sentence “John went to Paris.” Here, "John" and "Paris" would be marked as separate entities, with "John" being classified as a person and "Paris" as a location. The span-based tokenization would clearly define these as individual entities.

Character-Based Tokenization

Character-based tokenization splits text down to the individual characters, rather than words or spans of text.

  • Use Cases: This approach is useful for tasks that require a deeper understanding of how words are built or how individual characters contribute to the overall meaning. Used for tasks like character-level language modeling or precise character recognition.

  • Example: For the word “John,” character-based tokenization would break it down into individual characters like “J”, “o”, “h”, and “n”.

Define Labels

Next, you’ll define labels for your dataset to guide the annotation process and achieve more accurate results. UbiAI offers several label options:

Entity Labels (Optional but Recommended)

Entity labels help in identifying and categorizing words or phrases that refer to specific entities (e.g., names, locations, organizations). These labels guide the annotation of entities in the text.

Relationship Labels (Optional but Recommended)

Relationship labels define the connections between entities in a sentence. For example, in a sentence like "John lives in Paris," the relationship could be "lives_in" linking the "John" and "Paris" entities.

Classification Labels (Optional but Recommended)

Classification labels help in categorizing entire documents or sections of text into categories. There are different types of classification UbiAI offers:

  • Binary Classification: Used when you need to classify text into two categories (e.g., Positive/Negative sentiment).

  • Single Classification: Applied when there is only one class per instance, and you need to assign a single label to each text (e.g., Topic classification where each text belongs to one of several predefined topics).

  • Multi-Classification: Useful when a piece of text can belong to multiple categories at once (e.g., a document that discusses both "Politics" and "Economics").

Upload Your Textual Data

Once you have defined your labels, you’ll need to upload the textual data you wish to annotate. UbiAI supports a wide range of file formats, making it easy to import data from different sources.

Supported formats include:

  • Text files

  • HTML

  • PDF

  • JSON

  • Office formats (CSV, TSV)

  • ZIP files containing text-based documents

Maximum file size: The file size you are uploading need to strictly be under 500MB

Choose Pre-Annotation

UbiAI offers an automatic pre-annotation option to assist in labeling your documents. You can select between two options:

  • With Pre-Annotation: You can select a model for pre-annotation. UbiAI allows you to pick a model like SpaCy or one of your previously trained models to auto-label your documents. This saves time, especially for large datasets.

  • No Pre-Annotation: If you prefer to annotate manually, select this option. Your documents will be uploaded without any pre-set annotations, allowing you to annotate them from scratch.

Click on Upload and wait for the pre-processing to complete.

Annotate and Edit

Once your text is uploaded and pre-annotations are applied (if selected), you can start the annotation process.

  • If Pre-Annotation Was Selected: You can edit the pre-filled annotations to ensure they align with your needs. Review and modify them to ensure accuracy.

  • If No Pre-Annotation Was Selected: Begin annotating manually by selecting the appropriate labels and dragging them over the text.

Sometimes you are unsure about how to annotate certain entities and wish to leave a note that you can revisit later to clear up. With UbiAI you can add comments by right-clicking on the entity and selecting "comments".

In addition to annotating individual entities, you can now assign properties having key-value to each annotated entity by right clicking and selecting "Properties List". This is useful for creating knowledge graphs where each entity might have multiple child nodes.

Here are the annotation guidelines you need to know:

  • For Entity Annotations, simply choose the lable then click and drag over the relevant portions of the text. The selected text will be annotated and colored and you can click on it to edit.*

  • For Relation Annotation, Select the relation button from the top of the annotation interface, click on the relation or add one. Then click on 2 entities to create a relation between them.

  • For classification annotations, you can annotate the entire text by deciding what class it belong to. Simply look for classification on the top menu and click on the class to pick it.

The default classification is bianry-classification if no other label is chosen.

Annotation Prediction Tool

If you need help with the annotation process, UbiAI provides an annotation assistance tool. To to use this tool:

  • Access the Tool: From the top right corner, click on the drop-down menu next to the Predict button. Select Configure Prediction.

  • Choose a Task: Choose the type of task (e.g., relation extraction, NER, span categorizer, text classification).

  • Select a Model: Pick an external model (e.g., from Hugging Face) by providing a URL, or select one of UbiAI’s internal models like GPT.

  • Click Predict: Once you’ve selected the model and task, click on Predict. The platform will auto-annotate the text based on the selected model.

You can then edit the results and validate your dataset.

Export or Expand Your Dataset

Once your dataset is validated and ready, you can choose to either add more documents or export your annotated dataset.

  • Add More Data: If you wish to expand your dataset, simply upload new documents and add them to your existing dataset.

  • Export the Dataset: To use your annotated dataset outside of UbiAI, click on Export. You can filter the data based on specific labels and select a split ratio.

pported formats include:

  • Amazon Comprehend

  • JSON

  • SpaCy

  • Text Classification Format

  • Relations Format

  • Stanford CoreNLP

  • IOB Format

A zip file containing the annotation along with the documents used during annotation will be downloaded, you will need to unzip the file before using the annotation to train a model.

For MacOs users, it is recommended to unzip the file using Winzip in order to preserve file names.

Using the UbiAI API for Text-Based Datasets

Upload files with API

You can select File Type and Upload using this API code:

If you would like to pre-annotate your files, check the "Auto annotate while uploading" box and select the method of pre-annotation as shown below:

import requests
import json
import mimetypes
import os

url ="https://api.ubiai.tools:8443/api_v1/upload"
my_token = "Your_Token"
"""  types :  json, image, csv, zip, text_docs  """
file_type = "/json"

list_of_file_path = ['']
urls = []
files = []
for file_path in list_of_file_path :
    files.append(('file',(os.path.basename(file_path ),open(file_path, 'rb'),mimetypes.guess_type(file_path)[0])))

data = {
  'autoAssignToCollab' :False,
  'taskType' :'TASK',
  'nbUsersPerDoc' :'',
  'selectedUsers' :'',
  'filesUrls' : urls
}

response = requests.post(url+ my_token + file_type, files=files, data=data)
print(response.status_code)
res = json.loads(response.content.decode("utf-8"))
print(res)

Export files with API

Select Export Type and Export using this API code:

import requests
import json
url ="https://api.ubiai.tools:8443/api_v1/download"
my_token = "Your_Token"
#('aws', 'Lists')
#('spacy', 'Json')
#('DocBin_NER', 'Json')
#('spacy_training', 'Json')
#('classification', 'Json')
#('ocr1', '') ('ocr2', '') ('ocr3', '')
#('stanford', '')
#('iob', '')
#('iob_pos', '')
#('iob_chatbot', '')
file_type = "/json"
split_ratio = ""
params = {'splitRatio': split_ratio}
response = requests.get(url+ my_token + file_type, params=params)
print(response.status_code)
res = response.content.decode("utf-8")
print(res)