Troubleshooting Discovery

This guide helps you diagnose and resolve common discovery issues in Tripl-i. Use the systematic approach and tools provided to quickly identify and fix problems that may arise during infrastructure discovery.

Diagnostic Framework

Troubleshooting Workflow

Discovery Health Check

The Tripl-i platform provides built-in health check capabilities to verify your discovery system is functioning properly:

Discovery Services Status
- Check if discovery agents are running
- Verify service connectivity
- Monitor background processes
Network Connectivity
- Test API endpoint accessibility
- Verify network routes to target systems
- Check firewall configurations
Credential Vault
- Test stored credentials
- Verify credential access
- Check for expired credentials
Recent Discovery Activity
- Review discovery run history
- Check success/failure rates
- Identify patterns in issues
Error Analysis
- Review recent error logs
- Identify recurring problems
- Track resolution progress
Resource Usage
- Monitor CPU and memory utilization
- Check disk space availability
- Track network bandwidth usage

Common Issues

No Discovery Data

Symptoms

CIs not appearing in CMDB
Discovery shows "No devices found"
Empty discovery results

Diagnostic Steps

1. Network Connectivity:
   Test: ping target_device
   Test: telnet target_device 22/135/161
   Check: Firewall rules
   Check: Network ACLs
   
2. Discovery Service:
   Check: Service status
   Check: Worker processes
   Check: Queue backlog
   Review: Service logs

3. Target Availability:
   Verify: Device is powered on
   Verify: Services are running
   Check: Local firewall
   Check: SELinux/AppArmor

4. Discovery Scope:
   Review: IP ranges
   Review: Exclusion rules
   Check: Discovery filters
   Verify: Schedule active

Common Solutions

Firewall Configuration

# Windows - Allow WMI
netsh advfirewall firewall add rule name="WMI-In" dir=in action=allow protocol=TCP localport=135
netsh advfirewall firewall add rule name="WMI-Async-In" dir=in action=allow protocol=TCP localport=49152-65535

# Linux - Allow SSH
sudo ufw allow from 10.0.0.0/8 to any port 22
sudo iptables -A INPUT -s 10.0.0.0/8 -p tcp --dport 22 -j ACCEPT

# Network Device - Allow SNMP
access-list 100 permit udp host 10.1.1.100 any eq 161
snmp-server community public RO 100

Service Configuration

# Windows - Enable WMI
sc config winmgmt start= auto
net start winmgmt

# Enable Remote Registry
sc config RemoteRegistry start= auto
net start RemoteRegistry

# Linux - Configure SSH
sudo systemctl enable sshd
sudo systemctl start sshd

# Configure sudo for discovery
echo "discovery ALL=(ALL) NOPASSWD: /usr/bin/dmidecode, /bin/netstat, /sbin/ip" | sudo tee /etc/sudoers.d/discovery

Authentication Failures

Symptoms

"Access denied" errors
"Invalid credentials" messages
Partial discovery with auth errors

Diagnostic Tests

# Test Windows credentials
$cred = Get-Credential
Test-WSMan -ComputerName target-server -Credential $cred -Authentication Negotiate

# Test WMI access
Get-WmiObject -Class Win32_OperatingSystem -ComputerName target-server -Credential $cred

# Test specific permissions
Get-WmiObject -Class Win32_Process -ComputerName target-server -Credential $cred

# Test Linux SSH
ssh -o PasswordAuthentication=yes discovery@target-host 'echo "Connection successful"'

# Test sudo permissions
ssh discovery@target-host 'sudo -l'

# Test specific commands
ssh discovery@target-host 'sudo dmidecode -t system'

Permission Requirements

Windows Requirements:
  Local Groups:
    - Performance Monitor Users
    - Event Log Readers
    - Distributed COM Users
    
  User Rights:
    - Log on as a service
    - Access this computer from network
    
  DCOM Permissions:
    - Local Launch
    - Remote Launch
    - Local Activation
    - Remote Activation

Linux Requirements:
  SSH Access: Required
  Sudo Commands:
    - /usr/bin/dmidecode
    - /bin/netstat or /sbin/ss
    - /sbin/ip or /sbin/ifconfig
    - /usr/bin/lsof (optional)
    - /bin/ps
    
  File Access:
    - /proc/* (read)
    - /sys/* (read)
    - /etc/os-release (read)

Incomplete Discovery

Symptoms

Missing software inventory
Partial hardware information
No relationship data
Incomplete attributes

Root Cause Analysis

Check Collection Modules:
  1. Agent Configuration:
     - Verify enabled collectors
     - Check module errors
     - Review timeout settings
     
  2. Data Collection:
     - Process discovery enabled?
     - Software scanning active?
     - Network connections tracked?
     
  3. Processing Pipeline:
     - Normalization errors?
     - Pattern matching failures?
     - Enrichment timeouts?

Module-Specific Fixes

Software Discovery Issues

# Windows - Registry access
# Check if remote registry is enabled
sc \\target-server query RemoteRegistry

# Linux - Package manager access
# Verify package database readable
ssh discovery@target "rpm -qa | head -5"
ssh discovery@target "dpkg -l | head -5"

# Fix: Add discovery user to required groups
usermod -a -G rpm discovery  # For RPM-based systems

Network Connection Discovery

# Enable network discovery
# Windows
netsh advfirewall firewall set rule group="File and Printer Sharing" new enable=Yes

# Linux - ensure ss/netstat available
# Check if commands exist
which ss netstat lsof

# Install if missing
sudo yum install -y iproute  # For ss
sudo apt install -y net-tools  # For netstat

Performance Issues

Symptoms

Slow discovery completion
High CPU/memory usage
Network congestion
Timeout errors

Performance Diagnostics

-- Analyze discovery performance
WITH discovery_stats AS (
  SELECT 
    discovery_method,
    target_type,
    AVG(duration_seconds) as avg_duration,
    MAX(duration_seconds) as max_duration,
    COUNT(*) as total_discoveries,
    SUM(CASE WHEN status = 'timeout' THEN 1 ELSE 0 END) as timeouts
  FROM discovery_runs
  WHERE created_at >= NOW() - INTERVAL '24 hours'
  GROUP BY discovery_method, target_type
)
SELECT * FROM discovery_stats
ORDER BY avg_duration DESC;

Performance Tuning

Optimization Strategies:
  1. Parallel Processing:
     Default: 10 concurrent
     High Performance: 50 concurrent
     Conservative: 5 concurrent
     
  2. Timeout Adjustments:
     Network Devices: 30s → 60s
     Busy Servers: 60s → 120s
     Slow Links: 30s → 90s
     
  3. Discovery Scope:
     - Reduce frequency for stable devices
     - Use incremental for frequent scans
     - Limit deep discovery to off-hours
     
  4. Resource Limits:
     CPU: Max 70% utilization
     Memory: Max 4GB per worker
     Network: Max 100Mbps total

Data Quality Issues

Symptoms

Duplicate CIs created
Incorrect classifications
Missing relationships
Stale data

Data Validation

# CI Deduplication Check
def check_duplicates():
    duplicates = db.cis.aggregate([
        {"$group": {
            "_id": {"name": "$name", "serial": "$serialNumber"},
            "count": {"$sum": 1},
            "ids": {"$push": "$_id"}
        }},
        {"$match": {"count": {"$gt": 1}}}
    ])
    
    for dup in duplicates:
        print(f"Duplicate found: {dup['_id']} ({dup['count']} instances)")
        # Merge or remove duplicates
        merge_duplicate_cis(dup['ids'])

Data Cleanup Procedures

Cleanup Tasks:
  1. Remove Orphaned CIs:
     - No discovery update > 30 days
     - No relationships
     - Status = "Unknown"
     
  2. Fix Misclassified Items:
     - Re-run AI classification
     - Apply pattern matching
     - Manual review flagged items
     
  3. Rebuild Relationships:
     - Clear stale connections
     - Re-discover network topology
     - Validate service dependencies

Advanced Diagnostics

Debug Mode

# Enable debug logging for specific target
nopesight discovery debug --target 10.1.1.50 --verbose

# Debug output example:
[DEBUG] Starting discovery for 10.1.1.50
[DEBUG] Using credential: windows_domain_cred
[DEBUG] Attempting WMI connection...
[DEBUG] WMI connection established
[DEBUG] Querying Win32_ComputerSystem...
[DEBUG] Result: {Name: "SERVER01", Domain: "CORP.LOCAL", ...}
[DEBUG] Querying Win32_OperatingSystem...
[ERROR] Access denied to Win32_Process class
[DEBUG] Falling back to limited discovery mode

Network Packet Analysis

# Capture discovery traffic
sudo tcpdump -i eth0 -w discovery.pcap \
  'host 10.1.1.50 and (port 22 or port 135 or port 445 or port 161)'

# Analyze WMI traffic
sudo tcpdump -nn -r discovery.pcap 'port 135' | head -20

# Check for SNMP timeouts
sudo tcpdump -nn -r discovery.pcap 'port 161' | \
  grep -E "Timeout|No Response"

Discovery Agent Diagnostics

# Agent health check
nopesight-agent diagnose

# Output:
=== Tripl-i Agent Diagnostics ===
Version: 3.2.1
Status: Running
Uptime: 5d 14h 23m

Configuration:
  Server: https://nopesight.company.com ✓
  API Key: ****1234 ✓
  Department: IT ✓

Connectivity:
  Server reachable: ✓
  Last check-in: 2 minutes ago ✓
  SSL Certificate: Valid ✓

Collectors:
  System Info: ✓ Enabled
  Software: ✓ Enabled  
  Network: ✗ Error: Permission denied on /proc/net/tcp
  Processes: ✓ Enabled

Recent Errors:
  [2024-01-15 10:23:45] NetworkCollector: Cannot read /proc/net/tcp
  [2024-01-15 09:15:32] SoftwareCollector: dpkg timeout

Recommendations:
  1. Add agent user to 'proc' group for network collection
  2. Increase software scan timeout to 60s

Troubleshooting Tools

Built-in Diagnostics

Web UI Tools:
  Discovery Test:
    - Target specific device
    - Test specific credential
    - Use specific method
    - View real-time logs
    
  Credential Tester:
    - Validate credentials
    - Check permissions
    - Test connectivity
    - Show capabilities
    
  Pattern Debugger:
    - Test pattern matching
    - View match details
    - Debug regex
    - Validate logic

Command Line Tools

# Discovery CLI toolkit

# Test specific discovery method
nopesight discover test \
  --method wmi \
  --target 10.1.1.50 \
  --credential prod_windows \
  --debug

# Validate discovery scope
nopesight discover validate-scope \
  --ranges "10.0.0.0/16,192.168.0.0/24" \
  --show-conflicts

# Analyze discovery queue
nopesight queue status --queue discovery
nopesight queue peek discovery --count 10

# Force discovery retry
nopesight discover retry --failed --last 1h

Log Analysis

# Common log locations
/var/log/nopesight/discovery.log     # Main discovery log
/var/log/nopesight/agent.log         # Agent logs
/var/log/nopesight/scheduler.log     # Scheduler logs
/var/log/nopesight/error.log         # Error aggregation

# Useful grep patterns
# Find authentication failures
grep -i "auth\|denied\|permission" discovery.log

# Find timeout issues  
grep -i "timeout\|timed out" discovery.log

# Find network errors
grep -i "unreachable\|refused\|network" discovery.log

# Find pattern matching issues
grep -i "pattern\|match\|regex" discovery.log

# Performance issues
grep -i "slow\|performance\|exceeded" discovery.log

Recovery Procedures

Emergency Recovery

When Discovery Completely Fails:
  1. Stop all discovery:
     systemctl stop nopesight-discovery
     nopesight discovery pause --all
     
  2. Clear stuck jobs:
     nopesight queue clear discovery --stuck
     redis-cli DEL "bull:discovery:*"
     
  3. Reset discovery state:
     nopesight discovery reset --confirm
     
  4. Restart services:
     systemctl start nopesight-discovery
     systemctl restart nopesight-scheduler
     
  5. Test with single target:
     nopesight discover now --target 10.1.1.1
     
  6. Resume normal operations:
     nopesight discovery resume --all

Data Recovery

-- Restore CIs from discovery history
INSERT INTO cis (name, type, attributes, last_discovered)
SELECT 
  raw_data->>'hostname' as name,
  raw_data->>'device_type' as type,
  raw_data->'attributes' as attributes,
  discovered_at as last_discovered
FROM discovery_history
WHERE discovered_at >= '2024-01-14'
  AND status = 'success'
  AND raw_data->>'hostname' NOT IN (
    SELECT name FROM cis WHERE tenant_id = 'IT'
  );

Prevention Strategies

Monitoring Setup

Proactive Monitoring:
  Metrics to Track:
    - Discovery success rate < 95%
    - Average duration increasing
    - Timeout rate > 5%
    - Queue depth > 1000
    - Error rate > 2%
    
  Alerts:
    - Discovery failures > 10 in 5 min
    - No discoveries in expected window
    - Credential failures spike
    - Resource exhaustion warning
    
  Dashboards:
    - Real-time discovery status
    - Success rate trends
    - Performance metrics
    - Error categorization

Best Practices

Regular Maintenance
- Weekly credential validation
- Monthly discovery audit
- Quarterly pattern review
- Annual architecture review
Documentation
- Document all custom patterns
- Maintain troubleshooting runbook
- Record common solutions
- Update network diagrams
Testing
- Test credentials before production
- Validate patterns in dev
- Load test discovery system
- Practice recovery procedures

Getting Help

Support Resources

Internal Resources:
  - Discovery team Slack: #discovery-help
  - Wiki: https://wiki.company.com/nopesight
  - Runbooks: https://runbooks.company.com
  
Tripl-i Support:
  - Email: support@nopesight.com
  - Portal: https://support.nopesight.com
  - Phone: 1-800-NOPESIGHT
  
Community:
  - Forums: https://community.nopesight.com
  - GitHub: https://github.com/nopesight/patterns
  - Slack: nopesight-users.slack.com

Diagnostic Package

#!/bin/bash
# Create diagnostic package for support

DIAG_DIR="/tmp/nopesight-diag-$(date +%Y%m%d-%H%M%S)"
mkdir -p $DIAG_DIR

# Collect system info
nopesight system info > $DIAG_DIR/system-info.txt
nopesight discovery status > $DIAG_DIR/discovery-status.txt

# Collect recent logs
tail -n 10000 /var/log/nopesight/*.log > $DIAG_DIR/recent-logs.txt

# Collect configuration (sanitized)
nopesight config export --sanitize > $DIAG_DIR/config.yaml

# Create archive
tar -czf $DIAG_DIR.tar.gz -C /tmp $(basename $DIAG_DIR)
echo "Diagnostic package created: $DIAG_DIR.tar.gz"

Next Steps

📖 Best Practices - CMDB best practices
📖 Performance Tuning - System optimization
📖 Support Guide - Getting help

Diagnostic Framework​

Troubleshooting Workflow​

Discovery Health Check​

Common Issues​

No Discovery Data​

Authentication Failures​

Incomplete Discovery​

Performance Issues​

Data Quality Issues​

Advanced Diagnostics​

Debug Mode​

Network Packet Analysis​

Discovery Agent Diagnostics​

Troubleshooting Tools​

Built-in Diagnostics​

Command Line Tools​

Log Analysis​

Recovery Procedures​

Emergency Recovery​

Data Recovery​

Prevention Strategies​

Monitoring Setup​

Best Practices​

Getting Help​

Support Resources​

Diagnostic Package​

Next Steps​

Diagnostic Framework

Troubleshooting Workflow

Discovery Health Check

Common Issues

No Discovery Data

Authentication Failures

Incomplete Discovery

Performance Issues

Data Quality Issues

Advanced Diagnostics

Debug Mode

Network Packet Analysis

Discovery Agent Diagnostics

Troubleshooting Tools

Built-in Diagnostics

Command Line Tools

Log Analysis

Recovery Procedures

Emergency Recovery

Data Recovery

Prevention Strategies

Monitoring Setup

Best Practices

Getting Help

Support Resources

Diagnostic Package

Next Steps