Supported File Format Details

UbiAI supports a variety of file formats for dataset creation and annotation, ensuring flexibility for different types of data. Below is a guide on how to prepare your files for upload, detailing the specific requirements for each format.

Text Formats

TXT (Plain Text Files)

Each file should contain raw, unformatted text. The text should be stored in a .txt file without any HTML tags or special encoding. Ensure that your text files are free of special characters that could interfere with the annotation process (such as extraneous line breaks or hidden formatting).

This is meant to be a sample document for training text classification.

PDF (Portable Document Format)

UbiAI accepts native PDFs, including scanned documents that require OCR (Optical Character Recognition) for text extraction. For OCR, ensure the scanned documents are legible and contain clear text. For scanned PDFs, make sure the quality is high enough for OCR extraction. Avoid uploading images with low resolution, as they may result in poor text extraction.

HTML (HyperText Markup Language)

HTML files should contain structured text data, including basic tags such as <p> for paragraphs and <h1> for headers. UbiAI extracts the text content within these tags. Ensure that the HTML file is well-structured and that text is not embedded within images or other non-text elements. Avoid using inline styles that could interfere with text extraction.

<html>
  <body>
    <h1>Document Title</h1>
    <p>This is a sample paragraph for annotation.</p>
  </body>
</html>

DOCX (Microsoft Word Documents)

DOCX files should contain structured text, with sections and headings if applicable. UbiAI extracts text from DOCX files, ignoring formatting elements such as fonts or colors. Tables and images will be ignored during text extraction uless you check the table extraction option during setup.

Image Formats

For image classification tasks, UbiAI supports the following image formats: JPG and PNG. files should contain clear images, typically with compressed formats. Make sure the images are not overly compressed because excessive compression may degrade image quality. The images should ideally represent a single object or category for classification tasks.

JSON Format

If you have pre-annotated data, you can upload it in JSON format. The JSON file should contain entities and relationships if applicable. Each entity or relationship in the JSON file should be clearly defined. If your data contains nested objects, ensure they are structured properly with consistent keys and values. The JSON should follow the format below:

CSV Format

For structured text data, UbiAI supports CSV files. Each row represents one document, and the file should be encoded in UTF-8.

For Text Generation tasks, make sure to upload a CSV file with four specific columns:

System Prompt: The prompt provided by the system.
User Prompt: The prompt given by the user.
Input: Optional context or information.
Response: The generated response from the system.

TSV Format

UbiAI also supports TSV files, where tokens are pre-tagged using the IOB format. Ensure that each token is correctly labeled with the appropriate IOB tag.

The document separator -DOCSTART- -X- O O should be used at the beginning of each new document.

ZIP Format

You can also upload a ZIP file containing multiple documents in TXT, PDF, or HTML format, useful for bulk uploads. Ensure that the files are properly structured and that there are no unnecessary subfolders within the ZIP archive.

Compress your documents into a single ZIP file. Ensure all documents inside the ZIP are in the correct format and that the ZIP file is not corrupted.

PreviousImage-Based Datasets NextSetting Up Assisted Annotation

Last updated 5 months ago