Text-Based Datasets
Last updated
Last updated
The Text-Based dataset type in UbiAI is designed to process textual data at either the word level or character level, enabling efficient annotations between words or characters. This dataset is ideal for tasks like Named Entity Recognition (NER), relationship extraction, and text classification.
Creating a Text-Based dataset in UbiAI is straightforward, with clear steps to guide you through the entire process.
The first step in creating your Text-Based dataset is selecting the type of tokenization that best suits your task. UbiAI provides two options for tokenization: Span-Based Tokenization and Character-Based Tokenization.
In span-based tokenization, you define continuous spans of text as single entities or relationships.
Use Cases: Span-based tokenization is ideal for tasks like Named Entity Recognition, where you need to identify specific entities such as people, places, and organizations in a text.
Example: Consider the sentence “John went to Paris.” Here, "John" and "Paris" would be marked as separate entities, with "John" being classified as a person and "Paris" as a location. The span-based tokenization would clearly define these as individual entities.
Character-based tokenization splits text down to the individual characters, rather than words or spans of text.
Use Cases: This approach is useful for tasks that require a deeper understanding of how words are built or how individual characters contribute to the overall meaning. Used for tasks like character-level language modeling or precise character recognition.
Example: For the word “John,” character-based tokenization would break it down into individual characters like “J”, “o”, “h”, and “n”.
Next, you’ll define labels for your dataset to guide the annotation process and achieve more accurate results. UbiAI offers several label options:
Entity labels help in identifying and categorizing words or phrases that refer to specific entities (e.g., names, locations, organizations). These labels guide the annotation of entities in the text.
Relationship labels define the connections between entities in a sentence. For example, in a sentence like "John lives in Paris," the relationship could be "lives_in" linking the "John" and "Paris" entities.
Classification labels help in categorizing entire documents or sections of text into categories. There are different types of classification UbiAI offers:
Binary Classification: Used when you need to classify text into two categories (e.g., Positive/Negative sentiment).
Single Classification: Applied when there is only one class per instance, and you need to assign a single label to each text (e.g., Topic classification where each text belongs to one of several predefined topics).
Multi-Classification: Useful when a piece of text can belong to multiple categories at once (e.g., a document that discusses both "Politics" and "Economics").
Once you have defined your labels, you’ll need to upload the textual data you wish to annotate. UbiAI supports a wide range of file formats, making it easy to import data from different sources.
Maximum file size: The file size you are uploading need to strictly be under 500MB
UbiAI offers an automatic pre-annotation option to assist in labeling your documents. You can select between two options:
With Pre-Annotation: You can select a model for pre-annotation. UbiAI allows you to pick a model like SpaCy or one of your previously trained models to auto-label your documents. This saves time, especially for large datasets.
No Pre-Annotation: If you prefer to annotate manually, select this option. Your documents will be uploaded without any pre-set annotations, allowing you to annotate them from scratch.
Click on Upload and wait for the pre-processing to complete.
Once your text is uploaded and pre-annotations are applied (if selected), you can start the annotation process.
If Pre-Annotation Was Selected: You can edit the pre-filled annotations to ensure they align with your needs. Review and modify them to ensure accuracy.
If No Pre-Annotation Was Selected: Begin annotating manually by selecting the appropriate labels and dragging them over the text.
Here are the annotation guidelines you need to know:
For Entity Annotations, simply choose the lable then click and drag over the relevant portions of the text. The selected text will be annotated and colored and you can click on it to edit.*
For Relation Annotation, Select the relation button from the top of the annotation interface, click on the relation or add one. Then click on 2 entities to create a relation between them.
For classification annotations, you can annotate the entire text by deciding what class it belong to. Simply look for classification on the top menu and click on the class to pick it.
If you need help with the annotation process, UbiAI provides an annotation assistance tool. To to use this tool:
Access the Tool: From the top right corner, click on the drop-down menu next to the Predict button. Select Configure Prediction.
Choose a Task: Choose the type of task (e.g., relation extraction, NER, span categorizer, text classification).
Select a Model: Pick an external model (e.g., from Hugging Face) by providing a URL, or select one of UbiAI’s internal models like GPT.
Click Predict: Once you’ve selected the model and task, click on Predict. The platform will auto-annotate the text based on the selected model.
You can then edit the results and validate your dataset.
Once your dataset is validated and ready, you can choose to either add more documents or export your annotated dataset.
Add More Data: If you wish to expand your dataset, simply upload new documents and add them to your existing dataset.
Export the Dataset: To use your annotated dataset outside of UbiAI, click on Export. You can filter the data based on specific labels and select a split ratio.
For MacOs users, it is recommended to unzip the file using Winzip in order to preserve file names.
You can select File Type and Upload using this API code:
Select Export Type and Export using this API code: