OBJECTIVE: Accurate information on spread-of-disease at diagnosis would increase the usefulness of hospital-activity data for cancer research. This study evaluates the accuracy of codes recorded in hospital-activity data to assign spread-of-disease at diagnosis for non-small cell lung cancer (NSCLC). METHODS: The reference (gold) standard was TNM stage as assigned at a multi-disciplinary meeting. To allow comparison with hospital-activity data, TNM stage was mapped to spread-of-disease (local, regional, distant). Sensitivity, specificity and positive-predictive values were stratified by whether the patient had surgery. RESULTS: Data from the reference standard and hospital-activity database were available for 2,184 patients. According to the reference standard, local disease was present for 57.0% of surgical patients and 12.6% of non-surgical patients at diagnosis. Hospital-activity data over-estimated patients with local disease (surgical: 71.9%, non-surgical: 48.5%). There was a corresponding underestimation of distant spread-of-disease: surgical (reference standard: 4.0%, hospital-activity data: 2.7%); non-surgical (reference standard: 45.9%, hospital-activity data: 36.8%). This meant that hospital-activity data had good sensitivity but poor specificity for local disease; and poor sensitivity, but good specificity for metastatic disease. CONCLUSION: Secondary diagnosis codes in hospital activity data do not accurately capture spread-of-disease at diagnosis for patients with non-small cell lung cancer; even when the clinical notes contain TNM clinical stage as documented at a multidisciplinary meeting. IMPLICATIONS: Changes are needed to coding rules, and the ICD codes themselves, to allow for coding of regional and distant spread without specification of the precise site.