Create Your Dataset

In artificial intelligence, everything relies on data. The quality of your AI model hinges on the quality, structure, and relevance of the dataset it’s trained on. If you understand your project goals and invest effort in creating the right dataset, you’ll significantly boost the chances of achieving good results.

Why Create your data on UbiAI?

Despite its importance, dataset creation is often seen as a challenging and tedious process. Collecting, annotating, and validating data can be complex and time-consuming, leaving many frustrated. UbiAI simplifies this process, offering powerful yet easy-to-use tools making your workflow stress-free and efficient. Let’s walk through the dataset creation process step-by-step:

Getting Started with Datasets

To begin creating a dataset, navigate to the Datasets menu on the left sidebar of the UbiAI interface. This section serves as the central hub for managing all your datasets.

Here’s what you’ll find:

List of Existing Datasets: A complete overview of all your datasets, including key metadata such as dataset creation date, size, and Versions.
Sort and Filter Options: Tools to organize and quickly locate datasets based on specific criteria, such as name, type, or creation date. For example:
If you only want to see span-based datasets, click the “Span-Based” button in the top menu to filter them.
Use the search bar If you're looking for a specific dataset. Simply type in the name or a relevant keyword associated with the dataset you're looking for.
Use the Sort By option to arrange datasets by criteria like date created, last modified, or alphabetical order, allowing for easy navigation.
New Dataset Button: Located in the top-right corner, this button initiates the dataset creation process.

This centralized view ensures that you can manage multiple datasets at once, whether you’re working on a single project or juggling several tasks with a team.

If you want to view more details of a specific dataset, simply click on the “More Details” button next to it. This will take you to a detailed view where you can inspect the dataset’s properties, make edits, or update its content as needed.

The UbiAI Dataset Creation Steps

Creating a dataset in UbiAI follows a consistent process across all tasks, with slight variations depending on your specific use case. Below is the general workflow:

This Process starts as soon as you click on the new Dataset button in the top-right corner.

Step 1: Selecting a Dataset Type

The first step in creating a new dataset is choosing the dataset type that aligns with your project’s goals. UbiAI offers support for various dataset types to suit different tasks:

Prompt-Response: Ideal for tasks involving conversational AI or prompt tuning. These datasets involve pairing specific prompts with corresponding responses.
Text-Based Datasets: Designed for natural language processing tasks such as Named Entity Recognition (NER) or other annotation needs involving free-form text. This dataset type supports span-based or character-based annotations.
Document-Based: This dataset type is ideal for working with documents such as PDFs, scanned images, or other document formats. UbiAI’s OCR (Optical Character Recognition) technology is integrated to assist you in annotating these types of files by automatically extracting text from images or scanned documents.
Image-Based: Designed for computer vision tasks such as object detection, image segmentation, or image classification.

Each dataset type has unique features and workflows to optimize performance for its intended use case. Make sure you choose the type that best fits your project’s goals.

Step 2: Naming and Choosing a Language

Once you’ve selected the dataset type, the next step is to give it a name and specify the primary language:

Dataset Name: Choose a descriptive name that reflects the purpose or content of the dataset. This will make it easier to locate and manage as your library of datasets grows.
Primary Language: Select the language in which your data is written. This is essential for ensuring accurate tokenization and annotation. Tokenization refers to breaking down text into smaller units (tokens), which are the building blocks of AI model training.

Ubiai also supports multilingual projects, simply select the multilingual option in the language menu when creating your dataset.

Step 3: Adding a Detailed Project Description

Providing a comprehensive project description is vital for successful dataset creation. This step sets the foundation for annotation consistency and collaboration.

Include the following in your description:
Project Objectives: Clearly state the goal of the dataset and what you are using it for.
Annotation Guidelines: Provide detailed instructions on how to annotate the data, including: Entity definitions, Relation definitions, Classification criteria.
Examples and Edge Cases: Illustrate complex scenarios to ensure annotators understand how to handle them.

This description becomes your blueprint, ensuring that everyone involved in the project adheres to the same standards.

Step 4: Customizing the Dataset

The next step involves customizing the dataset for your task. Depending on the dataset type, you may need to define various labels:

Entities: For text and document datasets, specify entity types such as names, dates, or locations.
Relations: Define and establish relationships between different entities within your data.
Classifications: Create distinct classes for image or text classification tasks.

For prompt-response datasets, this step is unnecessary, as annotations are generated differently

Step 5: Uploading or Generating the Dataset

At this stage, you can upload your data files or create a dataset from scratch directly within UbiAI. The platform supports various file types, including:

TXT, PDF, HTML, DOCX, Native PDF with OCR, JPG, PNG, JSON, CSV, TSV, and ZIP.

For prompt-response datasets, UbiAI offers a unique feature to generate data automatically, saving time and manual effort.

Step 6: Annotating the Dataset

Annotation is one of the most important stages in dataset creation. UbiAI simplifies this process with

Manual Annotation: Provides full control for highly specific tasks.
Auto-Annotation: Uses AI to predict and suggest annotations based on your patterns.
Response Generation: For prompt-response datasets, UbiAI can automatically create high-quality responses.

To cancel previous annotations, press ctrl + Z or ctrl + Y to relabel previously erased annotations.

You can track the annotation progress by looking at the completion percentage bar along with the number of finished documents.

Step 7: Validating and Saving the Dataset

Once the annotation process is complete, it’s time to validate your dataset:

Quality Checks: Ensure that annotations are accurate and complete.
Version Control: UbiAI tracks every change you make, allowing you to revert to previous versions or compare different iterations.

Even after saving, datasets remain editable. Access the dataset details to add new entries, refine labels, or adjust annotations as needed.

You can edit/delete an existing entity, relation, or a document class across all your documents by clicking on Dataset Version Settings. In the dataset settings, you have the option to edit the Dataset Name, Description, and labels.

To access the annotation interface, simply click on any document within the dataset page.

Validation Shortcut: shift + down arrow

Rejection Shortcut: shift + left arrow

Using Your Dataset

After validation, your dataset is ready for use. You have several options to integrate it into your AI workflow:

Fine-Tuning on UbiAI: Use your dataset to fine-tune models directly within the UbiAI platform.
Exporting Data: Export datasets in different formats depending on the task for use in external projects.
Access via API: UbiAI’s API lets you interact with datasets programmatically. Key features include: Creating projects and uploading files, Downloading validated documents, Training machine learning models, Running predictions using your trained models.

PreviousGetting Started with UbiAI NextPrompt-Response Datasets

Last updated 7 months ago