Create Your Dataset
Last updated
Last updated
In artificial intelligence, everything relies on data. The quality of your AI model hinges on the quality, structure, and relevance of the dataset it’s trained on. If you understand your project goals and invest effort in creating the right dataset, you’ll significantly boost the chances of achieving good results.
Despite its importance, dataset creation is often seen as a challenging and tedious process. Collecting, annotating, and validating data can be complex and time-consuming, leaving many frustrated. UbiAI simplifies this process, offering powerful yet easy-to-use tools making your workflow stress-free and efficient. Let’s walk through the dataset creation process step-by-step:
To begin creating a dataset, navigate to the Datasets menu on the left sidebar of the UbiAI interface. This section serves as the central hub for managing all your datasets.
Here’s what you’ll find:
List of Existing Datasets: A complete overview of all your datasets, including key metadata such as dataset creation date, size, and Versions.
Sort and Filter Options: Tools to organize and quickly locate datasets based on specific criteria, such as name, type, or creation date. For example:
If you only want to see span-based datasets, click the “Span-Based” button in the top menu to filter them.
Use the search bar If you're looking for a specific dataset. Simply type in the name or a relevant keyword associated with the dataset you're looking for.
Use the Sort By option to arrange datasets by criteria like date created, last modified, or alphabetical order, allowing for easy navigation.
New Dataset Button: Located in the top-right corner, this button initiates the dataset creation process.
This centralized view ensures that you can manage multiple datasets at once, whether you’re working on a single project or juggling several tasks with a team.
Creating a dataset in UbiAI follows a consistent process across all tasks, with slight variations depending on your specific use case. Below is the general workflow:
This Process starts as soon as you click on the new Dataset button in the top-right corner.
The first step in creating a new dataset is choosing the dataset type that aligns with your project’s goals. UbiAI offers support for various dataset types to suit different tasks:
Prompt-Response: Ideal for tasks involving conversational AI or prompt tuning. These datasets involve pairing specific prompts with corresponding responses.
Text-Based Datasets: Designed for natural language processing tasks such as Named Entity Recognition (NER) or other annotation needs involving free-form text. This dataset type supports span-based or character-based annotations.
Document-Based: This dataset type is ideal for working with documents such as PDFs, scanned images, or other document formats. UbiAI’s OCR (Optical Character Recognition) technology is integrated to assist you in annotating these types of files by automatically extracting text from images or scanned documents.
Image-Based: Designed for computer vision tasks such as object detection, image segmentation, or image classification.
Each dataset type has unique features and workflows to optimize performance for its intended use case. Make sure you choose the type that best fits your project’s goals.
Once you’ve selected the dataset type, the next step is to give it a name and specify the primary language:
Dataset Name: Choose a descriptive name that reflects the purpose or content of the dataset. This will make it easier to locate and manage as your library of datasets grows.
Primary Language: Select the language in which your data is written. This is essential for ensuring accurate tokenization and annotation. Tokenization refers to breaking down text into smaller units (tokens), which are the building blocks of AI model training.
Ubiai also supports multilingual projects, simply select the multilingual option in the language menu when creating your dataset.
Providing a comprehensive project description is vital for successful dataset creation. This step sets the foundation for annotation consistency and collaboration.
Include the following in your description:
Project Objectives: Clearly state the goal of the dataset and what you are using it for.
Annotation Guidelines: Provide detailed instructions on how to annotate the data, including: Entity definitions, Relation definitions, Classification criteria.
Examples and Edge Cases: Illustrate complex scenarios to ensure annotators understand how to handle them.
This description becomes your blueprint, ensuring that everyone involved in the project adheres to the same standards.
The next step involves customizing the dataset for your task. Depending on the dataset type, you may need to define various labels:
Entities: For text and document datasets, specify entity types such as names, dates, or locations.
Relations: Define and establish relationships between different entities within your data.
Classifications: Create distinct classes for image or text classification tasks.
At this stage, you can upload your data files or create a dataset from scratch directly within UbiAI. The platform supports various file types, including:
TXT, PDF, HTML, DOCX, Native PDF with OCR, JPG, PNG, JSON, CSV, TSV, and ZIP.
For prompt-response datasets, UbiAI offers a unique feature to generate data automatically, saving time and manual effort.
Annotation is one of the most important stages in dataset creation. UbiAI simplifies this process with
Manual Annotation: Provides full control for highly specific tasks.
Auto-Annotation: Uses AI to predict and suggest annotations based on your patterns.
Response Generation: For prompt-response datasets, UbiAI can automatically create high-quality responses.
To cancel previous annotations, press ctrl + Z or ctrl + Y to relabel previously erased annotations.
Once the annotation process is complete, it’s time to validate your dataset:
Quality Checks: Ensure that annotations are accurate and complete.
Version Control: UbiAI tracks every change you make, allowing you to revert to previous versions or compare different iterations.
Even after saving, datasets remain editable. Access the dataset details to add new entries, refine labels, or adjust annotations as needed.
After validation, your dataset is ready for use. You have several options to integrate it into your AI workflow:
Fine-Tuning on UbiAI: Use your dataset to fine-tune models directly within the UbiAI platform.
Exporting Data: Export datasets in different formats depending on the task for use in external projects.
Access via API: UbiAI’s API lets you interact with datasets programmatically. Key features include: Creating projects and uploading files, Downloading validated documents, Training machine learning models, Running predictions using your trained models.