Skip to main content

Troubleshooting Discovery

This guide helps you diagnose and resolve common discovery issues in Tripl-i. Use the systematic approach and tools provided to quickly identify and fix problems that may arise during infrastructure discovery.

Diagnostic Framework

Troubleshooting Workflow

Discovery Health Check

The Tripl-i platform provides built-in health check capabilities to verify your discovery system is functioning properly:

  1. Discovery Services Status

    • Check if discovery agents are running
    • Verify service connectivity
    • Monitor background processes
  2. Network Connectivity

    • Test API endpoint accessibility
    • Verify network routes to target systems
    • Check firewall configurations
  3. Credential Vault

    • Test stored credentials
    • Verify credential access
    • Check for expired credentials
  4. Recent Discovery Activity

    • Review discovery run history
    • Check success/failure rates
    • Identify patterns in issues
  5. Error Analysis

    • Review recent error logs
    • Identify recurring problems
    • Track resolution progress
  6. Resource Usage

    • Monitor CPU and memory utilization
    • Check disk space availability
    • Track network bandwidth usage

Common Issues

No Discovery Data

Symptoms

  • CIs not appearing in CMDB
  • Discovery shows "No devices found"
  • Empty discovery results

Diagnostic Steps

1. Network Connectivity:
Test: ping target_device
Test: telnet target_device 22/135/161
Check: Firewall rules
Check: Network ACLs

2. Discovery Service:
Check: Service status
Check: Worker processes
Check: Queue backlog
Review: Service logs

3. Target Availability:
Verify: Device is powered on
Verify: Services are running
Check: Local firewall
Check: SELinux/AppArmor

4. Discovery Scope:
Review: IP ranges
Review: Exclusion rules
Check: Discovery filters
Verify: Schedule active

Common Solutions

Firewall Configuration
# Windows - Allow WMI
netsh advfirewall firewall add rule name="WMI-In" dir=in action=allow protocol=TCP localport=135
netsh advfirewall firewall add rule name="WMI-Async-In" dir=in action=allow protocol=TCP localport=49152-65535

# Linux - Allow SSH
sudo ufw allow from 10.0.0.0/8 to any port 22
sudo iptables -A INPUT -s 10.0.0.0/8 -p tcp --dport 22 -j ACCEPT

# Network Device - Allow SNMP
access-list 100 permit udp host 10.1.1.100 any eq 161
snmp-server community public RO 100
Service Configuration
# Windows - Enable WMI
sc config winmgmt start= auto
net start winmgmt

# Enable Remote Registry
sc config RemoteRegistry start= auto
net start RemoteRegistry

# Linux - Configure SSH
sudo systemctl enable sshd
sudo systemctl start sshd

# Configure sudo for discovery
echo "discovery ALL=(ALL) NOPASSWD: /usr/bin/dmidecode, /bin/netstat, /sbin/ip" | sudo tee /etc/sudoers.d/discovery

Authentication Failures

Symptoms

  • "Access denied" errors
  • "Invalid credentials" messages
  • Partial discovery with auth errors

Diagnostic Tests

# Test Windows credentials
$cred = Get-Credential
Test-WSMan -ComputerName target-server -Credential $cred -Authentication Negotiate

# Test WMI access
Get-WmiObject -Class Win32_OperatingSystem -ComputerName target-server -Credential $cred

# Test specific permissions
Get-WmiObject -Class Win32_Process -ComputerName target-server -Credential $cred
# Test Linux SSH
ssh -o PasswordAuthentication=yes discovery@target-host 'echo "Connection successful"'

# Test sudo permissions
ssh discovery@target-host 'sudo -l'

# Test specific commands
ssh discovery@target-host 'sudo dmidecode -t system'

Permission Requirements

Windows Requirements:
Local Groups:
- Performance Monitor Users
- Event Log Readers
- Distributed COM Users

User Rights:
- Log on as a service
- Access this computer from network

DCOM Permissions:
- Local Launch
- Remote Launch
- Local Activation
- Remote Activation

Linux Requirements:
SSH Access: Required
Sudo Commands:
- /usr/bin/dmidecode
- /bin/netstat or /sbin/ss
- /sbin/ip or /sbin/ifconfig
- /usr/bin/lsof (optional)
- /bin/ps

File Access:
- /proc/* (read)
- /sys/* (read)
- /etc/os-release (read)

Incomplete Discovery

Symptoms

  • Missing software inventory
  • Partial hardware information
  • No relationship data
  • Incomplete attributes

Root Cause Analysis

Check Collection Modules:
1. Agent Configuration:
- Verify enabled collectors
- Check module errors
- Review timeout settings

2. Data Collection:
- Process discovery enabled?
- Software scanning active?
- Network connections tracked?

3. Processing Pipeline:
- Normalization errors?
- Pattern matching failures?
- Enrichment timeouts?

Module-Specific Fixes

Software Discovery Issues
# Windows - Registry access
# Check if remote registry is enabled
sc \\target-server query RemoteRegistry

# Linux - Package manager access
# Verify package database readable
ssh discovery@target "rpm -qa | head -5"
ssh discovery@target "dpkg -l | head -5"

# Fix: Add discovery user to required groups
usermod -a -G rpm discovery # For RPM-based systems
Network Connection Discovery
# Enable network discovery
# Windows
netsh advfirewall firewall set rule group="File and Printer Sharing" new enable=Yes

# Linux - ensure ss/netstat available
# Check if commands exist
which ss netstat lsof

# Install if missing
sudo yum install -y iproute # For ss
sudo apt install -y net-tools # For netstat

Performance Issues

Symptoms

  • Slow discovery completion
  • High CPU/memory usage
  • Network congestion
  • Timeout errors

Performance Diagnostics

-- Analyze discovery performance
WITH discovery_stats AS (
SELECT
discovery_method,
target_type,
AVG(duration_seconds) as avg_duration,
MAX(duration_seconds) as max_duration,
COUNT(*) as total_discoveries,
SUM(CASE WHEN status = 'timeout' THEN 1 ELSE 0 END) as timeouts
FROM discovery_runs
WHERE created_at >= NOW() - INTERVAL '24 hours'
GROUP BY discovery_method, target_type
)
SELECT * FROM discovery_stats
ORDER BY avg_duration DESC;

Performance Tuning

Optimization Strategies:
1. Parallel Processing:
Default: 10 concurrent
High Performance: 50 concurrent
Conservative: 5 concurrent

2. Timeout Adjustments:
Network Devices: 30s → 60s
Busy Servers: 60s → 120s
Slow Links: 30s → 90s

3. Discovery Scope:
- Reduce frequency for stable devices
- Use incremental for frequent scans
- Limit deep discovery to off-hours

4. Resource Limits:
CPU: Max 70% utilization
Memory: Max 4GB per worker
Network: Max 100Mbps total

Data Quality Issues

Symptoms

  • Duplicate CIs created
  • Incorrect classifications
  • Missing relationships
  • Stale data

Data Validation

# CI Deduplication Check
def check_duplicates():
duplicates = db.cis.aggregate([
{"$group": {
"_id": {"name": "$name", "serial": "$serialNumber"},
"count": {"$sum": 1},
"ids": {"$push": "$_id"}
}},
{"$match": {"count": {"$gt": 1}}}
])

for dup in duplicates:
print(f"Duplicate found: {dup['_id']} ({dup['count']} instances)")
# Merge or remove duplicates
merge_duplicate_cis(dup['ids'])

Data Cleanup Procedures

Cleanup Tasks:
1. Remove Orphaned CIs:
- No discovery update > 30 days
- No relationships
- Status = "Unknown"

2. Fix Misclassified Items:
- Re-run AI classification
- Apply pattern matching
- Manual review flagged items

3. Rebuild Relationships:
- Clear stale connections
- Re-discover network topology
- Validate service dependencies

Advanced Diagnostics

Debug Mode

# Enable debug logging for specific target
nopesight discovery debug --target 10.1.1.50 --verbose

# Debug output example:
[DEBUG] Starting discovery for 10.1.1.50
[DEBUG] Using credential: windows_domain_cred
[DEBUG] Attempting WMI connection...
[DEBUG] WMI connection established
[DEBUG] Querying Win32_ComputerSystem...
[DEBUG] Result: {Name: "SERVER01", Domain: "CORP.LOCAL", ...}
[DEBUG] Querying Win32_OperatingSystem...
[ERROR] Access denied to Win32_Process class
[DEBUG] Falling back to limited discovery mode

Network Packet Analysis

# Capture discovery traffic
sudo tcpdump -i eth0 -w discovery.pcap \
'host 10.1.1.50 and (port 22 or port 135 or port 445 or port 161)'

# Analyze WMI traffic
sudo tcpdump -nn -r discovery.pcap 'port 135' | head -20

# Check for SNMP timeouts
sudo tcpdump -nn -r discovery.pcap 'port 161' | \
grep -E "Timeout|No Response"

Discovery Agent Diagnostics

# Agent health check
nopesight-agent diagnose

# Output:
=== Tripl-i Agent Diagnostics ===
Version: 3.2.1
Status: Running
Uptime: 5d 14h 23m

Configuration:
Server: https://nopesight.company.com ✓
API Key: ****1234 ✓
Department: IT ✓

Connectivity:
Server reachable: ✓
Last check-in: 2 minutes ago ✓
SSL Certificate: Valid ✓

Collectors:
System Info: ✓ Enabled
Software: ✓ Enabled
Network: ✗ Error: Permission denied on /proc/net/tcp
Processes: ✓ Enabled

Recent Errors:
[2024-01-15 10:23:45] NetworkCollector: Cannot read /proc/net/tcp
[2024-01-15 09:15:32] SoftwareCollector: dpkg timeout

Recommendations:
1. Add agent user to 'proc' group for network collection
2. Increase software scan timeout to 60s

Troubleshooting Tools

Built-in Diagnostics

Web UI Tools:
Discovery Test:
- Target specific device
- Test specific credential
- Use specific method
- View real-time logs

Credential Tester:
- Validate credentials
- Check permissions
- Test connectivity
- Show capabilities

Pattern Debugger:
- Test pattern matching
- View match details
- Debug regex
- Validate logic

Command Line Tools

# Discovery CLI toolkit

# Test specific discovery method
nopesight discover test \
--method wmi \
--target 10.1.1.50 \
--credential prod_windows \
--debug

# Validate discovery scope
nopesight discover validate-scope \
--ranges "10.0.0.0/16,192.168.0.0/24" \
--show-conflicts

# Analyze discovery queue
nopesight queue status --queue discovery
nopesight queue peek discovery --count 10

# Force discovery retry
nopesight discover retry --failed --last 1h

Log Analysis

# Common log locations
/var/log/nopesight/discovery.log # Main discovery log
/var/log/nopesight/agent.log # Agent logs
/var/log/nopesight/scheduler.log # Scheduler logs
/var/log/nopesight/error.log # Error aggregation

# Useful grep patterns
# Find authentication failures
grep -i "auth\|denied\|permission" discovery.log

# Find timeout issues
grep -i "timeout\|timed out" discovery.log

# Find network errors
grep -i "unreachable\|refused\|network" discovery.log

# Find pattern matching issues
grep -i "pattern\|match\|regex" discovery.log

# Performance issues
grep -i "slow\|performance\|exceeded" discovery.log

Recovery Procedures

Emergency Recovery

When Discovery Completely Fails:
1. Stop all discovery:
systemctl stop nopesight-discovery
nopesight discovery pause --all

2. Clear stuck jobs:
nopesight queue clear discovery --stuck
redis-cli DEL "bull:discovery:*"

3. Reset discovery state:
nopesight discovery reset --confirm

4. Restart services:
systemctl start nopesight-discovery
systemctl restart nopesight-scheduler

5. Test with single target:
nopesight discover now --target 10.1.1.1

6. Resume normal operations:
nopesight discovery resume --all

Data Recovery

-- Restore CIs from discovery history
INSERT INTO cis (name, type, attributes, last_discovered)
SELECT
raw_data->>'hostname' as name,
raw_data->>'device_type' as type,
raw_data->'attributes' as attributes,
discovered_at as last_discovered
FROM discovery_history
WHERE discovered_at >= '2024-01-14'
AND status = 'success'
AND raw_data->>'hostname' NOT IN (
SELECT name FROM cis WHERE tenant_id = 'IT'
);

Prevention Strategies

Monitoring Setup

Proactive Monitoring:
Metrics to Track:
- Discovery success rate < 95%
- Average duration increasing
- Timeout rate > 5%
- Queue depth > 1000
- Error rate > 2%

Alerts:
- Discovery failures > 10 in 5 min
- No discoveries in expected window
- Credential failures spike
- Resource exhaustion warning

Dashboards:
- Real-time discovery status
- Success rate trends
- Performance metrics
- Error categorization

Best Practices

  1. Regular Maintenance

    • Weekly credential validation
    • Monthly discovery audit
    • Quarterly pattern review
    • Annual architecture review
  2. Documentation

    • Document all custom patterns
    • Maintain troubleshooting runbook
    • Record common solutions
    • Update network diagrams
  3. Testing

    • Test credentials before production
    • Validate patterns in dev
    • Load test discovery system
    • Practice recovery procedures

Getting Help

Support Resources

Internal Resources:
- Discovery team Slack: #discovery-help
- Wiki: https://wiki.company.com/nopesight
- Runbooks: https://runbooks.company.com

Tripl-i Support:
- Email: support@nopesight.com
- Portal: https://support.nopesight.com
- Phone: 1-800-NOPESIGHT

Community:
- Forums: https://community.nopesight.com
- GitHub: https://github.com/nopesight/patterns
- Slack: nopesight-users.slack.com

Diagnostic Package

#!/bin/bash
# Create diagnostic package for support

DIAG_DIR="/tmp/nopesight-diag-$(date +%Y%m%d-%H%M%S)"
mkdir -p $DIAG_DIR

# Collect system info
nopesight system info > $DIAG_DIR/system-info.txt
nopesight discovery status > $DIAG_DIR/discovery-status.txt

# Collect recent logs
tail -n 10000 /var/log/nopesight/*.log > $DIAG_DIR/recent-logs.txt

# Collect configuration (sanitized)
nopesight config export --sanitize > $DIAG_DIR/config.yaml

# Create archive
tar -czf $DIAG_DIR.tar.gz -C /tmp $(basename $DIAG_DIR)
echo "Diagnostic package created: $DIAG_DIR.tar.gz"

Next Steps