Datadog-TokuMX Integration

Overview

This check collects TokuMX metrics like:

  • Opcounters
  • Replication lag
  • Cache table utilization and storage size

And more.

Setup

Installation

The TokuMX check is packaged with the Agent, so simply install the Agent on your TokuMX servers. If you need the newest version of the check, install the dd-check-tokumx package.

Configuration

Prepare TokuMX

  1. Install the Python MongoDB module on your MongoDB server using the following command:

    sudo pip install --upgrade "pymongo<3.0"
    
  2. You can verify that the module is installed using this command:

    python -c "import pymongo" 2>&1 | grep ImportError && \
    echo -e "\033[0;31mpymongo python module - Missing\033[0m" || \
    echo -e "\033[0;32mpymongo python module - OK\033[0m"
    
  3. Start the mongo shell.In it create a read-only user for the Datadog Agent in the admin database:

    # Authenticate as the admin user.
    use admin
    db.auth("admin", "<YOUR_TOKUMX_ADMIN_PASSWORD>")
    # Add a user for Datadog Agent
    db.addUser("datadog", "<UNIQUEPASSWORD>", true)
    
  4. Verify that you created the user with the following command (not in the mongo shell).

    python -c 'from pymongo import Connection; print Connection().admin.authenticate("datadog", "<UNIQUEPASSWORD>")' | \
    grep True && \
    echo -e "\033[0;32mdatadog user - OK\033[0m" || \
    echo -e "\033[0;31mdatadog user - Missing\033[0m"
    

For more details about creating and managing users in MongoDB, refer to the MongoDB documentation.

Connect the Agent

Create a file tokumx.yaml in the Agent’s conf.d directory. See the sample tokumx.yaml for all available configuration options:

init_config:

instances:
  - server: mongodb://datadog:<UNIQUEPASSWORD>@localhost:27017

Restart the Agent to start sending TokuMX metrics to Datadog.

Validation

Run the Agent’s info subcommand and look for tokumx under the Checks section:

  Checks
  ======
    [...]

    tokumx
    -------
      - instance #0 [OK]
      - Collected 26 metrics, 0 events & 1 service check

    [...]

Compatibility

The tokumx check is compatible with all major platforms.

Data Collected

Metrics

tokumx.asserts.msgps
(gauge)
The number of message assertions raised per second.
shown as assertion
tokumx.asserts.regularps
(gauge)
The number of regular assertions raised per second.
shown as assertion
tokumx.asserts.rolloversps
(gauge)
The number of times that the rollover counters roll over per second. The counters rollover to zero every 2^30 assertions.
shown as assertion
tokumx.asserts.userps
(gauge)
The number of user assertions raised per second.
shown as assertion
tokumx.asserts.warningps
(gauge)
The number of warnings raised per second.
shown as assertion
tokumx.connections.available
(gauge)
The number of unused available incoming connections the database can provide.
shown as connection
tokumx.connections.current
(gauge)
The number of connections to the database server from clients.
shown as connection
tokumx.cursors.timedOut
(gauge)
The total number of cursors that have timed out since the server process started.
shown as cursor
tokumx.cursors.totalOpen
(gauge)
The number of cursors that tokumx is maintaining for clients.
shown as cursor
tokumx.ft.alerts.checkpointFailures
(gauge)
The number of checkpoints that have failed for any reason.
shown as event
tokumx.ft.alerts.locktreeRequestsPending
(gauge)
The number of requests for Document-level Locks in the locktree that are waiting for other requests to release their locks.
shown as request
tokumx.ft.alerts.longWaitEvents.cachePressure.countps
(gauge)
Rate at which a thread had to wait more than 1 second for evictions to create space in the cachetable for it to page in data it needed.
shown as event
tokumx.ft.alerts.longWaitEvents.cachePressure.timeps
(gauge)
Fraction of time (microseconds/second) that a thread had to wait more than 1 second for evictions to create space in the cachetable for it to page in data it needed.
shown as fraction
tokumx.ft.alerts.longWaitEvents.checkpointBegin.countps
(gauge)
Rate at which the begin checkpoint phase of checkpoint has run (these should be fairly quick).
shown as event
tokumx.ft.alerts.longWaitEvents.checkpointBegin.timeps
(gauge)
Fraction of time (microseconds/second) that a begin checkpoint phase has spent blocking other threads.
shown as fraction
tokumx.ft.alerts.longWaitEvents.fsync.countps
(gauge)
Rate at which fsync operations took more than 1 second.
shown as event
tokumx.ft.alerts.longWaitEvents.fsync.timeps
(gauge)
Fraction of time (microseconds/second) spent performing fsync operations that took longer than 1 second.
shown as fraction
tokumx.ft.alerts.longWaitEvents.locktreeWait.countps
(gauge)
Rate at which a thread had to wait more than 1 second to acquire a document-level lock in the locktree.
shown as event
tokumx.ft.alerts.longWaitEvents.locktreeWait.timeps
(gauge)
Fraction of time (microseconds/second) spent by threads waiting more than 1 second to acquire a document-level lock in the locktree.
shown as fraction
tokumx.ft.alerts.longWaitEvents.locktreeWaitEscalation.countps
(gauge)
Rate at which a thread had to wait more than 1 second to acquire a document-level lock because the locktree was at the memory limit and needed to run escalation.
shown as event
tokumx.ft.alerts.longWaitEvents.locktreeWaitEscalation.timeps
(gauge)
Fraction of time (microseconds/second) spent by threads waiting more than 1 second to acquire a document-level lock because the locktree was at the memory limit and needed to run escalation.
shown as fraction
tokumx.ft.alerts.longWaitEvents.logBufferWaitps
(gauge)
Rate at which a writing client had to wait more than 100ms for access to the log buffer.
shown as event
tokumx.ft.cachetable.evictions.full.leaf.clean.bytesps
(gauge)
Rate of full evictions of leaf nodes.
shown as byte
tokumx.ft.cachetable.evictions.full.leaf.clean.countps
(gauge)
Rate of full evictions of leaf nodes.
shown as event
tokumx.ft.cachetable.evictions.full.leaf.dirty.bytesps
(gauge)
Rate of full evictions of leaf nodes that need to be written back to disk.
shown as byte
tokumx.ft.cachetable.evictions.full.leaf.dirty.countps
(gauge)
Rate of full evictions of leaf nodes that need to be written back to disk.
shown as event
tokumx.ft.cachetable.evictions.full.leaf.dirty.timeps
(gauge)
Fraction of time (microseconds/second) spent performing full evictions leaf nodes, including the time spent serializing, compressing, and writing those nodes to disk.
shown as fraction
tokumx.ft.cachetable.evictions.full.nonleaf.clean.bytesps
(gauge)
Rate of full evictions of nonleaf nodes.
shown as byte
tokumx.ft.cachetable.evictions.full.nonleaf.clean.countps
(gauge)
Rate of full evictions of nonleaf nodes.
shown as event
tokumx.ft.cachetable.evictions.full.nonleaf.dirty.bytesps
(gauge)
Rate of full evictions of nonleaf nodes that need to be written back to disk.
shown as byte
tokumx.ft.cachetable.evictions.full.nonleaf.dirty.countps
(gauge)
Rate of full evictions of nonleaf nodes that need to be written back to disk.
shown as event
tokumx.ft.cachetable.evictions.full.nonleaf.dirty.timeps
(gauge)
Fraction of time (microseconds/second) spent performing full evictions nonleaf nodes, including the time spent serializing, compressing, and writing those nodes to disk.
shown as fraction
tokumx.ft.cachetable.evictions.partial.leaf.clean.bytesps
(gauge)
Rate of partial evictions of leaf nodes.
shown as byte
tokumx.ft.cachetable.evictions.partial.leaf.clean.countps
(gauge)
Rate of partial evictions of leaf nodes.
shown as event
tokumx.ft.cachetable.evictions.partial.nonleaf.clean.bytesps
(gauge)
Rate of partial evictions of nonleaf nodes.
shown as byte
tokumx.ft.cachetable.evictions.partial.nonleaf.clean.countps
(gauge)
Rate of partial evictions of nonleaf nodes.
shown as event
tokumx.ft.cachetable.miss.countps
(gauge)
Rate of internal cache misses. This metric is similar to MongoDB’s btree misses and page faults.
shown as miss
tokumx.ft.cachetable.miss.full.countps
(gauge)
Rate of full internal cache misses.
shown as miss
tokumx.ft.cachetable.miss.full.timeps
(gauge)
Fraction of time (microseconds/second) the database has had to wait for a disk read to complete for a full cache miss.
shown as fraction
tokumx.ft.cachetable.miss.partial.countps
(gauge)
Rate of partial internal cache misses.
shown as miss
tokumx.ft.cachetable.miss.partial.timeps
(gauge)
Fraction of time (microseconds/second) the database has had to wait for a disk read to complete for a partial cache miss.
shown as fraction
tokumx.ft.cachetable.miss.timeps
(gauge)
Fraction of time (microseconds/second) the database has had to wait for a disk read to complete for cache misses.
shown as fraction
tokumx.ft.cachetable.size.current
(gauge)
Total amount of uncompressed data currently in the database's internal cache.
shown as byte
tokumx.ft.cachetable.size.limit
(gauge)
Total amount of uncompressed data that will fit in TokuMX’s internal cache.
shown as byte
tokumx.ft.cachetable.size.writing
(gauge)
Total size of nodes that are currently queued up to be written to disk for eviction.
shown as byte
tokumx.ft.checkpoint.begin.timeps
(gauge)
Fraction of time (microseconds/second) that a begin checkpoint phase has spent blocking other threads.
shown as fraction
tokumx.ft.checkpoint.countps
(gauge)
Rate at which checkpoints are completed.
shown as event
tokumx.ft.checkpoint.lastComplete.time
(gauge)
The time spent, in seconds, by the most recently completed checkpoint.
shown as second
tokumx.ft.checkpoint.timeps
(gauge)
Fraction of time (seconds/second) spent doing checkpoints.
shown as fraction
tokumx.ft.checkpoint.write.leaf.bytes.compressedps
(gauge)
The rate at which leaf nodes are written to disk during checkpoints, after compression.
shown as byte
tokumx.ft.checkpoint.write.leaf.bytes.uncompressedps
(gauge)
The rate at which leaf nodes are written to disk during checkpoints, before compression.
shown as byte
tokumx.ft.checkpoint.write.leaf.countps
(gauge)
The rate at which leaf nodes are written to disk during checkpoints.
shown as write
tokumx.ft.checkpoint.write.leaf.timeps
(gauge)
The fraction of time spent writing leaf nodes to disk during checkpoints.
shown as fraction
tokumx.ft.checkpoint.write.nonleaf.bytes.compressedps
(gauge)
The rate at which nonleaf nodes are written to disk during checkpoints, after compression.
shown as byte
tokumx.ft.checkpoint.write.nonleaf.bytes.uncompressedps
(gauge)
The rate at which nonleaf nodes are written to disk during checkpoints, before compression.
shown as byte
tokumx.ft.checkpoint.write.nonleaf.countps
(gauge)
The rate at which nonleaf nodes are written to disk during checkpoints.
shown as write
tokumx.ft.checkpoint.write.nonleaf.timeps
(gauge)
The fraction of time spent writing nonleaf nodes to disk during checkpoints.
shown as fraction
tokumx.ft.compressionRatio.leaf
(gauge)
The size ratio of leaf nodes before and after compression.
shown as fraction
tokumx.ft.compressionRatio.nonleaf
(gauge)
The size ratio of nonleaf nodes before and after compression.
shown as fraction
tokumx.ft.compressionRatio.overall
(gauge)
The size ratio of nodes before and after compression.
shown as fraction
tokumx.ft.fsync.countps
(gauge)
The rate at which the database flushed the operating system’s file buffers to disk.
shown as operation
tokumx.ft.fsync.timeps
(gauge)
The fraction of time (microseconds/second) used to fsync to disk.
shown as fraction
tokumx.ft.locktree.size.current
(gauge)
Total memory the locktree is currently using.
shown as byte
tokumx.ft.locktree.size.limit
(gauge)
Maximum number of bytes that the locktree is allowed to use.
shown as byte
tokumx.ft.log.bytesps
(gauge)
The rate at which the logger writes to disk.
shown as byte
tokumx.ft.log.countps
(gauge)
The rate of of individual log writes.
shown as write
tokumx.ft.log.timeps
(gauge)
The fraction of time spent performing log writes.
shown as fraction
tokumx.ft.serializeTime.leaf.compressps
(gauge)
Fraction of time spent compressing leaf nodes before writing them to disk (for checkpoint or when evicted while dirty).
shown as fraction
tokumx.ft.serializeTime.leaf.decompressps
(gauge)
Fraction of time spent decompressing leaf nodes before writing them to disk (for checkpoint or when evicted while dirty).
shown as fraction
tokumx.ft.serializeTime.leaf.deserializeps
(gauge)
Fraction of time spent deserializing leaf nodes and their partitions after reading them off disk.
shown as fraction
tokumx.ft.serializeTime.leaf.serializeps
(gauge)
Fraction of time spent serializing leaf nodes and their partitions after reading them off disk.
shown as fraction
tokumx.ft.serializeTime.nonleaf.compressps
(gauge)
Fraction of time spent compressing nonleaf nodes before writing them to disk (for checkpoint or when evicted while dirty).
shown as fraction
tokumx.ft.serializeTime.nonleaf.decompressps
(gauge)
Fraction of time spent decompressing nonleaf nodes before writing them to disk (for checkpoint or when evicted while dirty).
shown as fraction
tokumx.ft.serializeTime.nonleaf.deserializeps
(gauge)
Fraction of time spent deserializing nonleaf nodes and their partitions after reading them off disk.
shown as fraction
tokumx.ft.serializeTime.nonleaf.serializeps
(gauge)
Fraction of time spent serializing nonleaf nodes and their partitions after reading them off disk.
shown as fraction
tokumx.mem.resident
(gauge)
The amount of memory currently used by the database process.
shown as mebibyte
tokumx.mem.virtual
(gauge)
The amount of virtual memory used by the database process.
shown as mebibyte
tokumx.metrics.document.deletedps
(gauge)
The number of documents deleted per second.
shown as document
tokumx.metrics.document.insertedps
(gauge)
The number of documents inserted per second.
shown as document
tokumx.metrics.document.returnedps
(gauge)
The number of documents returned by queries per second.
shown as document
tokumx.metrics.document.updatedps
(gauge)
The number of documents updated per second.
shown as document
tokumx.metrics.getLastError.wtime.numps
(gauge)
The number of getLastError operations per second with a specified write concern (i.e. w) that wait for one or more members of a replica set to acknowledge the write operation.
shown as operation
tokumx.metrics.getLastError.wtime.totalMillisps
(gauge)
The number of times per second that write concern operations have timed out as a result of the wtimeout threshold to getLastError.
shown as event
tokumx.metrics.getLastError.wtimeoutsps
(gauge)
The fraction of time (ms/s) spent performing getLastError operations with write concern (i.e. w) that wait for one or more members of a replica set to acknowledge the write operation.
shown as fraction
tokumx.metrics.operation.idhackps
(gauge)
The rate of queries that contain the _id field.
shown as query
tokumx.metrics.operation.scanAndOrderps
(gauge)
The rate of queries that return sorted numbers that cannot perform the sort operation using an index.
shown as query
tokumx.metrics.queryExecutor.scannedps
(gauge)
The rate of index items scanned during queries and query-plan evaluation.
shown as operation
tokumx.metrics.repl.apply.batches.numps
(gauge)
The number of batches applied across all databases per second.
shown as operation
tokumx.metrics.repl.apply.batches.totalMillisps
(gauge)
The fraction of time (ms/s) spent applying operations from the oplog.
shown as fraction
tokumx.metrics.repl.apply.opsps
(gauge)
The rate of oplog operations.
shown as operation
tokumx.metrics.repl.buffer.count
(gauge)
The number of operations in the oplog buffer.
shown as operation
tokumx.metrics.repl.buffer.sizeBytes
(gauge)
The current size of the contents of the oplog buffer.
shown as byte
tokumx.metrics.repl.network.bytesps
(gauge)
The rate at which data is read from the replication sync source.
shown as byte
tokumx.metrics.repl.network.getmores.numps
(gauge)
The rate of getmore operations.
shown as operation
tokumx.metrics.repl.network.getmores.totalMillisps
(gauge)
The fraction of time (ms/s) spent collecting data from getmore operations.
shown as fraction
tokumx.metrics.repl.network.opsps
(gauge)
The rate of operations read from the replication source.
shown as operation
tokumx.metrics.repl.network.readersCreatedps
(gauge)
The rate at which oplog query processes are created.
shown as process
tokumx.metrics.repl.oplog.insert.numps
(gauge)
The rate at which operations are inserted into the oplog.
shown as operation
tokumx.metrics.repl.oplog.insert.totalMillisps
(gauge)
The fraction of time (ms/s) spent inserting operations into the oplog.
shown as fraction
tokumx.metrics.repl.oplog.insertBytesps
(gauge)
The rate (in bytes) at which data is inserted into the oplog.
shown as byte
tokumx.metrics.ttl.deletedDocumentsps
(gauge)
The rate at which documents are deleted from collections with a ttl index.
shown as document
tokumx.metrics.ttl.passesps
(gauge)
The number of times per second the background process removes documents from collections with a ttl index.
shown as event
tokumx.opcounters.commandps
(gauge)
The total number of commands per second issued to the database.
shown as command
tokumx.opcounters.deleteps
(gauge)
The number of delete operations per second.
shown as operation
tokumx.opcounters.getmoreps
(gauge)
The number of getmore operations per second.
shown as operation
tokumx.opcounters.insertps
(gauge)
The number of insert operations per second.
shown as operation
tokumx.opcounters.queryps
(gauge)
The total number of queries per second.
shown as query
tokumx.opcounters.updateps
(gauge)
The number of update operations per second.
shown as operation
tokumx.opcountersRepl.commandps
(gauge)
The total number of replicated commands issued to the database per second.
shown as command
tokumx.opcountersRepl.deleteps
(gauge)
The number of replicated delete operations per second.
shown as operation
tokumx.opcountersRepl.getmoreps
(gauge)
The number of replicated getmore operations per second.
shown as operation
tokumx.opcountersRepl.insertps
(gauge)
The number of replicated insert operations per second.
shown as operation
tokumx.opcountersRepl.queryps
(gauge)
The total number of replicated queries per second.
shown as query
tokumx.opcountersRepl.updateps
(gauge)
The number of replicated update operations per second.
shown as operation
tokumx.stats.coll.count
(gauge)
The number of objects or documents in this collection.
shown as document
tokumx.stats.coll.nindexes
(gauge)
The number of indexes on this collection.
shown as index
tokumx.stats.coll.nindexesbeingbuilt
(gauge)
The number of indexes currently being built.
shown as index
tokumx.stats.coll.size
(gauge)
The total size in memory of all records in a collection. Does not include the record header, but does include the record’s padding. Does not include the size of any indexes associated with the collection.
shown as byte
tokumx.stats.coll.storageSize
(gauge)
The total amount of storage allocated to this collection for document storage.
shown as byte
tokumx.stats.coll.totalIndexSize
(gauge)
The total size of all indexes on this collection.
shown as byte
tokumx.stats.coll.totalIndexStorageSize
(gauge)
The total size on disk of all indexes on this collection (after compression).
shown as byte
tokumx.stats.dataSize
(gauge)
The total size of the data held in this database including the padding factor.
shown as byte
tokumx.stats.db.avgObjSize
(gauge)
The average size of each document.
shown as byte
tokumx.stats.db.collections
(gauge)
The number of collections in the database.
shown as
tokumx.stats.db.dataSize
(gauge)
The total size of the data held in this database including the padding factor.
shown as byte
tokumx.stats.db.indexes
(gauge)
The total number of indexes across all collections in the database.
shown as index
tokumx.stats.db.indexSize
(gauge)
The total size of all indexes created on this database.
shown as byte
tokumx.stats.db.indexStorageSize
(gauge)
The total size on disk of all indexes created on this database (after compression).
shown as byte
tokumx.stats.db.objects
(gauge)
The number of documents in the database across all collections.
shown as document
tokumx.stats.db.storageSize
(gauge)
The total amount of space allocated to collections in this database for document storage.
shown as byte
tokumx.stats.idx.avgObjSize
(gauge)
The average size of each index entry.
shown as byte
tokumx.stats.idx.count
(gauge)
The number of documents in this index.
shown as index
tokumx.stats.idx.deletes
(gauge)
The number of delete operations performed on this index.
shown as operation
tokumx.stats.idx.inserts
(gauge)
The number of insert operations performed on this index.
shown as operation
tokumx.stats.idx.nscanned
(gauge)
The number of index entries scanned for queries using this index.
shown as index
tokumx.stats.idx.nscannedObjects
(gauge)
The number of collection objects examined after scanning an index entry for a query using this index.
shown as object
tokumx.stats.idx.queries
(gauge)
The number of query operations performed using this index.
shown as query
tokumx.stats.idx.size
(gauge)
The total size of this index.
shown as byte
tokumx.stats.idx.storageSize
(gauge)
The total size on disk of this index (after compression).
shown as byte
tokumx.stats.indexes
(gauge)
The total number of indexes across all collections in the database.
shown as index
tokumx.stats.indexSize
(gauge)
The total size of all indexes created on this database.
shown as byte
tokumx.stats.objects
(gauge)
The number of documents in the database across all collections.
shown as document
tokumx.stats.storageSize
(gauge)
The total amount of space allocated to collections in this database for document storage.
shown as byte
tokumx.uptime
(gauge)
The time that the tokumx process has been active.
shown as second

Events

Replication state changes:

This check emits an event each time a TokuMX node has a change in its replication state.

Service Checks

tokumx.can_connect:

Returns CRITICAL if the Agent cannot connect to TokuMX to collect metrics, otherwise OK.

Troubleshooting

Need help? Contact Datadog Support.

Further Reading