Sinequa natively offers a multitude of connectors to cope with any enterprise data sources, wherever they reside.
- Off-the-shelf connection to most of all enterprise sources and systems, with hundreds of prepackaged connectors
- New connectors added regularly to the standard catalog to better meet customer's expectations and support emerging and modern content management systems
- Connectors developed and maintained internally, ensuring full integration, optimized analysis, ease of configuration
- Full compliance with the sources' native security model
- No code! Just fill in a few parameters that are specific to each connector type.
Complete toolkit to develop custom connectors for connecting to home-grown data sources.
- Connector toolkit for incorporating content from internally built and legacy systems
- Standard connectors like JSON connector, XML connector, Database connector, HTTP connector, File System connectors enhanced by plug-ins to address particular data sources
- Any SaaS product can be indexed with help from a custom connector in a matter of hours, assuming they include a Web services API
- Custom connectors code already available in Open Source (e.g: Slack connector: https://github.com/sinequa/plugins/tree/master/SlackConnector )
Supported document formats
Sinequa provides the ability to recognize and process more than 300 formats of structured and unstructured data.
- Native support for more than 300 document formats
- Text in any Unicode character set, including double-byte character sets such as Chinese and Japanese
- Markup languages, such as HTML, XML
- XMP-compatible formats, such as JPEG, TIFF, PSD, EPS
- Microsoft Office documents, such as Word, Excel, PowerPoint, and RTF
- Adobe PDF
- Open Office document formats
- Compressed archives can be indexed as single containers or recursively as folders containing files and other archives (Zip, Rar, Tar, 7z, Pst, bz2, etc.)
- Structured formats (JSON, CSV, SAS datasets, SAP IDOC, AutoCAD DWG, 3D parts, etc.)
- Image formats, like BMP, JPEG, PNG, GIF
- An extensive list of document converters to extract text from any document format
The Sinequa platform also supports Optical Character Recognition (OCR). It reads any standard format generated by such applications, enabling you to index and search vast amounts of paper documents not natively created in electronic formats.
Optimized or customized content ingestion, thanks to pre-parameterized templates and support for on-demand indexing.
- On-demand mode: triggered on-demand, as needed by the administrator
- Scheduled mode: automatic execution at pre-set intervals or according to a set calendar, using the built-in Sinequa scheduler or any third-party scheduler
- Trigger mode: automatically triggered events (e.g., whenever a document is added to a particular location or after 1,000 records are added to a database)
Indexing can be full or incremental:
- Complete indexing: the source is fully indexed (or re-indexed); used for initial indexing of a new data source or when a datastore is replaced rather than updated
- Incremental indexing: only new or updated data is indexed
- Real-time indexing: the data source itself indicates what has to be indexed compared to the last indexing task, contributing to a short-time synchronization between the index and the content source
- Collection-cache mode: ability to re-index without accessing the data source, which is beneficial when new semantic extractors are set up to identify new concepts, named entities, or relations after NLP resources are updated
Comprehensive document scanning and in-depth text analysis are foundational for a customizable indexing pipeline, including native integrations with SciBite, Azure Media Services, and others.
Deep text analysis at indexing time including:
- Recursive scan of multiple records documents (CSV, JSON, XML, PST files, Compressed archives, etc.)
- Text conversion to HTML format from any native binary format
- NLP at indexing time (language detection, part of speech tagging, etc.)
- Multi-layer full-text indexing from whole data content
- Standard named entity extractions
- Standard or tailored text mining
- Data mapping from data sources metadata and extracted entities to typed index columns
- Multiple entry points via plug-ins to customize the indexing pipeline as needed