Reversed urban–rural gradient in COVID-19 seroprevalence and related factors in a nationally representative survey, Poland, 29 March to 14 May 2021

Background We anticipated that people in rural areas and small towns with lower population density, lower connectivity and jobs less dependent on social interaction will be less exposed to COVID-19. Still, other variables correlated with socioeconomic inequalities may have a greater impact on transmission. Aim We investigated how COVID-19 affected rural and urban communities in Poland, focussing on the most exposed groups and disparities in SARS-CoV-2 transmission. Methods A random digit dial sample of Polish adults stratified by region and age was drawn from 29 March to 14 May 2021. Serum samples were tested for anti-S1 and anti-N IgG antibodies, and positive results in both assays were considered indicative of past infection. Seroprevalence estimates were weighted to account for non-response. Adjusted odds ratios (AORs) were calculated using multivariable logistic regression. Results There was serological evidence of infection in 32.2% (95% CI: 30.2–34.4) of adults in rural areas/small towns (< 50,000 population) and 26.6% (95% CI: 24.9–28.3) in larger cities. Regional SARS-CoV-2 seroprevalence ranged from 23.4% (95% CI: 18.3–29.5) to 41.0% (95% CI: 33.5–49.0) and was moderately positively correlated (R = 0.588; p = 0.017; n = 16) with the proportion of respondents living in rural areas or small cities. Upon multivariable adjustment, both men (AOR = 1.60; 95% CI: 1.09–2.35) and women (AOR = 2.26; 95% CI: 1.58–3.21) from these areas were more likely to be seropositive than residents of larger cities. Conclusions We found an inverse urban–rural gradient of SARS-CoV-2 infections during early stages of the COVID-19 pandemic in Poland and suggest that vulnerabilities of populations living in rural areas need to be addressed.


Concept
Due to the design of the recruitment process we employed a two-stage adjustment to our SARS-CoV-2 seroprevalence estimates to account for non-response at the level of telephone interview and at the level of presentation for the test and receiving the valid test result.The method is an extension of Horvitz-Thompson (HT) estimator for population parameters with non-response weighting, which was shown to be an unbiased although not efficient estimator [1].
We assume non-informative non-response, that is that given the covariates, the probability of coming for the test does not depend on the test outcome and use an approach discussed by Little et al [2].The paper considers the sampling weights resulting from the sampling design in combination with the non-response weights to account for the lack of participation of the sampled individuals.In our case, as the probability of participation in the telephone survey (CATI, computer assisted telephone interview) is related to a mixture of the design (phone number coverage and availability) factors and the lack of consent for the interview, we call it simply the stage 1 weight,  1 .Secondly, the inverse of the probability of presenting for a laboratory test and receiving valid test results among those who took part in the telephone survey was denoted as stage 2 weight,  2 .The weighted estimate of population mean is given by the formula: is the estimated proportion of the population in demographic stratum j, who in addition are in the individual characteristic cell k. + is the size of the demographic stratum in the total population of size N;   -number of individuals in the sample recruited in the first step (CATI interview) who fall into the demographic stratum j and have characteristic k, measured for the CATI respondents, of all CATI respondents in the demographic stratum j.It has been suggested that in the case of two stage weighting the resulting estimates are approximately unbiased if the stage 1 weight is constant for the stratum defined to develop the stage 2 weight [2].The demographic variables used for stage 1 were therefore always considered in the development of stage 2 weights.

Development of stage 1 weight, 𝑤𝑤 1
Stage one weight was used to adjust the distribution of our sample of CATI telephone survey respondents to the structure of the reference Polish population as of June 2020, as provided by the National Statistical Office of Poland.We initially considered the following factors for stage 1 weighting: -administrative region (voivodeship) -there are 16 voivodeships in Poland -sex (male; female) -age group (20-39; 40-59; 60-69; ≥70 years) -type of residence (rural; urban).-COVID-19 vaccination status (at least one dose; not vaccinated) The COVID-19 vaccination status was added as a possible indicator of interest in or fear of the pandemic that could impact the likelihood of participating in the study.The study was conducted 3 -4 months after the vaccine became available in Poland, first to older age groups and vulnerable populations, and approximately at the time of the study -to the whole adult population.Vaccination coverage in the general population was available from the eHealth Centre, thanks to an electronic system newly developed to track COVID-19 vaccination coverage [3].
The stratification above resulted in 16*2*4*2*2 = 512 strata.In order to ensure that the resulting strata were not empty, we analysed the distribution of the number of study participants in each stratum.

Supplementary figure S2. Distribution of strata by the number of cases in each stratum. A -all strata, B -only strata with <=5 cases
The stratification resulted in one empty stratum and 25 strata with the count of 1-4, which were considered too small for weighting.In order to avoid empty weighting strata, we arbitrarily decided to combine males and females, and the proportion of men and women was comparable among the CATI respondents and the general population (Supplementary table S1).The stratum weights  1 were determined by dividing the proportions of the strata in the reference population by the proportion of the corresponding cells in the stratified sample.They were normalised so that the sum of the weights was equal to the total sample size of CATI respondents.Large weights were not removed at this stage.The  1 ranged from 0.21 to 35.6.However, the 99% percentile was 5.4, indicating that the vast majority was within acceptable limits.
The CATI respondents filled in an interview targeting COVID-19 risk factors.At the same time the interview allowed for a more detailed analysis of the non-response at the level of presenting for the serological test and the design of the appropriate adjustment.Given that if all the variables from stage 1 weighting were taken together with risk factor variables the resulting cells would be either very small or empty, we applied a logistic regression model predicting presentation for testing among the CATI interview respondents.The stage 1 variables were also included in the model to maintain the possibility of generalizing the final results to the population [2,4].The factors initially considered to predict the presentation for a serological test included: - The model was used to predict the probability of success (presentation for a serological test)the propensity score.The second stage weight for an individual, who presented for testing,   2 , was defined as the inverse of the propensity score.
Finally, the individual weight for the individual k, was defined as product of stage 1 and stage 2 weights normalized to obtain ∑   =1… = T: where the individual k belongs to demographic stratum j and the total sample of individuals tested is denoted as T.These weights varied from 0.12 to 49.5, with the 1% percentile equal to 0.17 and 99% percentile equal to 7.0.We performed trimming of outlying weights to keep them within acceptable limits [5].Upon trimming, weights varied from 0.15 to 8.81 (Mean = 1.00,Std.Dev.= 0 .72).
The data in supplementary table S1 summarizes the distribution of demographic characteristics in the general adult population of Poland, the CATI respondents, and the serosurvey participants.Supplementary table S2 compares the detailed sociodemographic characteristics of the final sample of serosurvey participants to those of the CATI respondents.The weighted distribution is provided (adj%) for the serorsurvey participants, calculated using the final weights,   .Upon weighting, the survey sample closely matches the desired population distribution.

Multivariable Regression Analysis
The multivariable logistic regression analysis was carried out on weighted data to examine effect modification, i.e., a two-way interaction between a place of residence and risk factors for COVID-19 infection with subject-matter importance.The initial model included all of the investigated main effects and all possible two-way interactions with place of residence (dichotomized as rural areas or small cities with populations of less than 50,000 vs. mid-sized or large cities with populations of 50,000 or more).
The By applying a backward stepwise approach, we "forced" a complete assessment of all possible interactions (with a place of residence) that might be required in the final model.Then, in accordance with the model hierarchy (i.e., interactions must be removed before main effects), we removed all terms from the model that did not meet the significance criteria (p<0.05).One of the difficulties encountered was the collinearity between household size and cohabiting with at least one child (the pairwise correlation coefficient = 0.6813), both important from a subject-matter standpoint.In this dataset, however, the latter predictor (cohabiting with at least one child) provided somewhat redundant information since this information was also contained in the first predictor (household size).The inclusion of household size in the multivariable model masked the significance of the second collinear variable (cohabiting with at least one child and its interaction term with place of residence).After consultation with COVID-19 data modelers, we have arbitrarily chosen to retain household size in the final model.
The following terms were eliminated in subsequent steps that did not meet the significance criteria: (1) interaction term between cohabiting with a child younger than 18 years old and place of residence; (2) cohabiting with a child younger than 18 years old; (3) interaction term between contact with a known COVID case and place of residence; (4) interaction term between have worked mainly remotely and place of residence; (5) interaction term between have received at least one dose of vaccine against COVID-19 and place of residence; (6) interaction term between income from old age/disability pension and place of residence; (7) income from old age/disability pension; (8) interaction term between unemployed and place of residence; (9) unemployed; (10) interaction term between have worked during restrictions and place of residence.
The final multivariable logistic regression model used to estimate adjusted odds ratios (ORs) is presented in Supplementary table S6.For main effects not included in interactions, we report default ORs from procedure output.ORs for factors involved in interactions with place of residence were calculated with the STATA Lincom postestimation command (6).The results from this model feed three tables presented in the paper that group risk factors for SARS-CoV-2 IgG antibodies (past infections) based on the socio-demographic characteristics of the respondents (Table 1), household-related exposures (Table 2), and finally work-related and other risk factors for SARS-CoV-2 IgG antibodies (Table 3).

Impact of weights on estimated seroprevalence
Application of weights lowered the overall seroprevalence estimates marginally, from 31.6% to 29.8% (95% CI: 28.4-31.2),but had no impact on the observed rural-urban gradient and other trends seen whether weighted or unweighted data.Supplementary table S7 shows the replication of the main results calculated on unweighted (raw) data.The t-tests were used to compare coefficients across these regressions.

Impact of exclusion of laboratory borderline results on rural-urban gradient
We considered the worst-case scenario, in which all borderline records have been retained in the analysis and recoded against the expected gradient, i.e., borderline results in rural areas or small towns are treated as negatives, whereas they are treated as positives in larger cities.Our aim was to determine how the exclusion of borderline laboratory results could have affected the reported findings of the rural-urban gradient.
There were 228 records with borderline laboratory results, 106 of which were residents of rural regions or small towns with less than 50,000 residents and 122 of which were residents of midsized or large cities with 50,000 or more inhabitants.
The observed rural-urban gradient and other trends were consistent regardless of the subsets of the data.Supplementary table S8 shows the replication of the main results calculated in this scenario.The t-tests were used to compare coefficients across two regressions.

table S1 .
Supplementary figure S1.Study enrollment flow chart and studied sample in OBSER-CO cross-sectional survey in Poland: 29 March -14 May, 2021 Sociodemographic characteristics of the final sample of CATI respondents and serosurvey participants compared to general (Polish) population in OBSER-CO cross-sectional survey in Poland: 29 March -14 May, 2021

table S2 .
Sociodemographic characteristics of the final sample of serosurvey participants compared to CATI respondents in the OBSER-CO cross-sectional survey in Poland: 29 March -14 May, 2021 1 weighted distribution calculated using  1 weights 2 weighted distribution calculated using the final weights,   3 for Chi-Square difference between the weighted distributions of Serosurvey and CATI participants

table S5 .
The demographic factors (gender, age group, residence type, voivodship) as in the  1 definition -Vaccination status (Not vaccinated, Vaccinated with one does, Vaccinated with 2 doses) -Having experienced any symptoms from the list: fever, cough, shortness of breath, loss of smell and/or taste, sore throat, running nose, muscle or joint aches, fatigue, headache, stomach ache, nausea/vomiting, diarrhoea, rash, conjunctivitis, chills, loss of appetite, bloody nose, confusion, other neurological symptoms -Having experienced any of the typical COVID-19 symptoms: fever, cough, shortness of breath, loss of smell and/or taste -Having close relatives and friends (who were not members of the household) met daily -Employment status (Employed; Unemployed; On pension or retired; Student or in training -Type of work (remote or mainly remote; other) Participation in regular meetings outside of work (e.g., sports or art clubs) -Participation in an organized trip or mission outside of the city of residence -Participation in a private gathering such as a wedding Recall period used for the questions: since March 2020 (approximately 13 months).All the above factors were initially entered into the logistic regression model.The backward selection procedure, with a retention level <0.1, removed two factors: Being on a pension or retired, Being on sick leave due to symptoms compatible with COVID-19.Final multivariable logistic regression model predicting the probability of presenting for the serological test among the CATI respondents