Supported File Format Details
UbiAI supports a variety of file formats for dataset creation and annotation, ensuring flexibility for different types of data. Below is a guide on how to prepare your files for upload, detailing the specific requirements for each format.
Text Formats
TXT (Plain Text Files)
Each file should contain raw, unformatted text. The text should be stored in a .txt
file without any HTML tags or special encoding. Ensure that your text files are free of special characters that could interfere with the annotation process (such as extraneous line breaks or hidden formatting).
PDF (Portable Document Format)
UbiAI accepts native PDFs, including scanned documents that require OCR (Optical Character Recognition) for text extraction. For OCR, ensure the scanned documents are legible and contain clear text. For scanned PDFs, make sure the quality is high enough for OCR extraction. Avoid uploading images with low resolution, as they may result in poor text extraction.
HTML (HyperText Markup Language)
HTML files should contain structured text data, including basic tags such as <p>
for paragraphs and <h1>
for headers. UbiAI extracts the text content within these tags. Ensure that the HTML file is well-structured and that text is not embedded within images or other non-text elements. Avoid using inline styles that could interfere with text extraction.
DOCX (Microsoft Word Documents)
DOCX files should contain structured text, with sections and headings if applicable. UbiAI extracts text from DOCX files, ignoring formatting elements such as fonts or colors. Tables and images will be ignored during text extraction uless you check the table extraction option during setup.
Image Formats
For image classification tasks, UbiAI supports the following image formats: JPG and PNG. files should contain clear images, typically with compressed formats. Make sure the images are not overly compressed because excessive compression may degrade image quality. The images should ideally represent a single object or category for classification tasks.
JSON Format
If you have pre-annotated data, you can upload it in JSON format. The JSON file should contain entities and relationships if applicable. Each entity or relationship in the JSON file should be clearly defined. If your data contains nested objects, ensure they are structured properly with consistent keys and values. The JSON should follow the format below:
CSV Format
For structured text data, UbiAI supports CSV files. Each row represents one document, and the file should be encoded in UTF-8.
For Text Generation tasks, make sure to upload a CSV file with four specific columns:
System Prompt: The prompt provided by the system.
User Prompt: The prompt given by the user.
Input: Optional context or information.
Response: The generated response from the system.
TSV Format
UbiAI also supports TSV files, where tokens are pre-tagged using the IOB format. Ensure that each token is correctly labeled with the appropriate IOB tag.
ZIP Format
You can also upload a ZIP file containing multiple documents in TXT, PDF, or HTML format, useful for bulk uploads. Ensure that the files are properly structured and that there are no unnecessary subfolders within the ZIP archive.
Last updated