What could be the better approach than apply method
I am working with fuzzy keyword matching The first dataset consists on 20180 rows and second dataset about 10000 rows I am using the .apply method to find the match I am using progress bar to see my iterations per seconds. It posted around 23 iterations per second. How do I increase the speed or Is there any better approach to give faster results than this code to fuzzy match?
df1['match']=df1['title'].progress_apply(lambda x: process.extractOne(x,df_conm['conm'].to_list(),score_cutoff=100))
df1
do you know?
how many words do you know
See also questions close to this topic

How to replace values in pandas dataframe
My goal is to design a program that will take create a program that will replace unique values in a pandas dataframe.
The following code performs the operation
# replace values print(f" {s1['A1'].value_counts().index}") for i in s1['A1'].value_counts().index: s1['A1'].replace(i,1) print(f" {s2['A1'].value_counts().index}") for i in s2['A1'].value_counts().index: s2['A1'].replace(i,2) print("s1 after replacing values") print(s1) print("******************") print("s2 after replacing values") print(s2) print("******************")
Expected: The values in the first dataframe
s1
should be replaced with 1s. The values in the second dataframes2
should be replaced with 2s.Actual:
Int64Index([8, 5, 2, 7, 6], dtype='int64') Int64Index([2, 8, 5, 6, 7, 4, 3], dtype='int64') s1 after replacing values A1 A2 A3 Class 3 5 0.440671 2.3 1 9 8 0.070035 2.9 1 14 2 0.868410 1.5 1 29 6 0.587487 2.6 1 34 8 0.652936 3.0 1 38 8 0.181508 3.0 1 45 8 0.953230 3.0 1 54 7 0.737604 2.7 1 68 5 0.187475 2.2 1 70 5 0.511385 2.3 1 71 8 0.688134 3.0 1 73 2 0.054908 1.5 1 87 8 0.461797 3.0 1 90 2 0.756518 1.5 1 91 2 0.761448 1.5 1 93 5 0.858036 2.3 1 94 5 0.306459 2.2 1 98 5 0.692804 2.2 1 ****************** s2 after replacing values A1 A2 A3 Class 0 2 0.463134 1.5 3 1 8 0.746065 3.0 3 2 6 0.264391 2.5 2 4 2 0.410438 1.5 3 5 2 0.302902 1.5 2 .. .. ... ... ... 92 5 0.775842 2.3 2 95 5 0.844920 2.2 2 96 5 0.428071 2.2 2 97 5 0.356044 2.2 3 99 5 0.815400 2.2 3
Any help understanding how to replace the values in these dataframes would be greatly appreciated. Thank you.

Conda environment missing dependencies when trying to download THOR tracker
I am new to linux in general and I've recently got into vision coding. I am trying to download the THOR real time tracker from
https://github.com/xlsr/THOR
.While trying to make the conda environment I have got a missing dependencies error:
~/Downloads/THOR$ conda env create f environment.yml Solving environment: failed ResolvePackageNotFound:  cacertificates==2019.1.23=0  freetype==2.9.1=h8a8886c_1  ninja==1.9.0=py37hfd86e86_0  openssl==1.1.1c=h7b6447c_1  libgccng==8.2.0=hdf63c60_1  mkl==2019.4=243  readline==7.0=h7b6447c_5  libstdcxxng==8.2.0=hdf63c60_1  cudatoolkit==10.0.130=0  mkl_random==1.0.2=py37hd81dba3_0  olefile==0.46=py37_0  six==1.12.0=py37_0  pytorch==1.1.0=py3.7_cuda10.0.130_cudnn7.5.1_0  mkl_fft==1.0.12=py37ha843d7b_0  intelopenmp==2019.4=243  libffi==3.2.1=hd88cf55_4  cffi==1.12.3=py37h2e261b9_0  zstd==1.3.7=h0b5b093_0  numpybase==1.16.4=py37hde5b4d6_0  sqlite==3.28.0=h7b6447c_0  python==3.7.3=h0371630_0  jpeg==9b=h024ee3a_2  torchvision==0.3.0=py37_cu10.0.130_1  pillow==6.0.0=py37h34e0f95_0  libpng==1.6.37=hbc83047_0  ncurses==6.1=he6710b0_1  libedit==3.1.20181209=hc058e9b_0  libgfortranng==7.3.0=hdf63c60_0  libtiff==4.0.10=h2733197_2  zlib==1.2.11=h7b6447c_3  xz==5.2.4=h14c3975_4  numpy==1.16.4=py37h7e9f1db_0  blas==1.0=mkl
I have tried downloading a few of them with
sudo install
,pip
orconda
but the required version is not found or the package at all. The tracker is from 2019 and I do not think it has been updated since, but this is the best real time tracker I have found that does not require recognition and training which is exactly what I am looking for. I would appreciate any help or suggestions for other trackers that use GPU and do not need training and recognition. 
How to understand a code to calculate fluxes
I am unable to understand how this code work. Where window is a line with two points. Window is where we calculate the flux. I want to understand how exactly the code is working
times = np.zeros((len(x)), dtype=int) center_x, center_y = 0.5 * (window.point1[0] + window.point2[0]), 0.5 * (window.point1[1] + window.point2[1]) safety_value = 3.0 radius = safety_value * 0.5 * (math.sqrt((window.point1[0]  window.point2[0])**2 + (window.point1[1]  window.point2[1])**2)) # fictive boaders left_boarder, right_boarder = center_x  radius, center_x + radius bottom_boarder, up_boarder = center_y  radius, center_y + radius i = 1 while (i < len(x)1): if (left_boarder <= x[i] <= right_boarder) and (x[i] > x[i1]) and np.sign(window.perpendicular_distance_unnormalized(x[i], y[i])) != np.sign(window.perpendicular_distance_unnormalized(x[i1], y[i1])) and window.is_point_on_segment(x[i], y[i]) and window.is_point_on_segment(x[i1], y[i1]): times[i] += 1 i += 1 else: i += 1 return times
result = 0 for i in range(y_min, y_max): result += UX[x_, i] return result def calculate_vel_fluxes(ux, windows): vel_fluxes = np.empty((len(windows), ux.shape[0])) for i in range(len(windows)): for j in range(ux.shape[0]): vel_fluxes[i, j] = uFlux(ux[j], windows[i]) return vel_fluxes

issue with parsing data from txt file using python
I am trying to take in the following data set but I keep getting errors with the parsing. I'm trying to create a geoJSON file that looks like this
"type": "FeatureCollection", "features": [ { "id": "0", "type": "Feature", "properties": { "description": "PlaceBC", "name": "A" }, "geometry": { "type": "LineString", "coordinates": [ [ 103.9364834, 1.3218368 ], [ 103.9364744, 1.3218821 ], [ 103.9367017, 1.3219285 ], [ 103.9364707, 1.321643 ], [ 103.9363887, 1.3216271 ], [ 103.9344606, 1.3235089 ], [ 103.9355026, 1.3237205 ], [ 103.934106, 1.3217046 ] ] } }
So I'm obtaining the data from a txt file but here I'm using a sample set for quick testing.
import pandas as pd import geopandas as gpd from shapely.geometry import LineString import io col = ['lat','long','name','description'] data = '''lat=1.3218368,long=103.9364834,107244,Place BC lat=1.3218821,long=103.9364744,107243,Place BC lat=1.3219285,long=103.9367017,107242,Place BC lat=1.321643,long=103.9364707,107241,Place BC lat=1.3216271,long=103.9363887,107240,Place BC lat=1.3235089,long=103.9344606,107148,Place BC lat=1.3237205,long=103.9355026,107115,Place BC lat=1.3217046,long=103.934106,107065,Place BC lat=1.3203204,long=103.9366324,107053,Place BC lat=1.3206557,long=103.9373536,107052,Place BC lat=1.3206271,long=103.9374192,107051,Place BC lat=1.3205511,long=103.9371742,107050,Place BC lat=1.3206044,long=103.9375056,107049,Place BC lat=1.3207561,long=103.9371863,107048,Place BC lat=1.3204307,long=103.9368537,107047,Place BC lat=1.3204877,long=103.9368389,107046,Place BC lat=1.3205465,long=103.9368269,107045,Place BC lat=1.320612,long=103.9368246,107044,Place BC''' #load csv as dataframe (replace io.StringIO(data) with the csv filename), use converters to clean up lat and long columns upon loading df = pd.read_csv(io.StringIO(data), names=col, sep=',', engine='python', converters={'lat': lambda x: float(x.split('=')[1]), 'long': lambda x: float(x.split('=')[1])}) #input the data from the text file #df = pd.read_csv("latlong.txt", names=col, sep=',', engine='python', converters={'lat': lambda x: float(x.split('=')[1]), 'long': lambda x: float(x.split('=')[1])}) #load dataframe as geodataframe gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.long, df.lat)) #groupby on name and description, while converting the grouped geometries to a LineString gdf = gdf.groupby(['name', 'description'])['geometry'].apply(lambda x: LineString(x.tolist())).reset_index() gdf.to_json()
This is the errors that I am getting,
 AttributeError Traceback (most recent call last) ~\AppData\Roaming\Python\Python39\sitepackages\shapely\speedups\_speedups.pyx in shapely.speedups._speedups.geos_linestring_from_py() AttributeError: 'list' object has no attribute '__array_interface__' During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_14056/627086265.py in <module> 32 gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.long, df.lat)) 33 #groupby on name and description, while converting the grouped geometries to a LineString > 34 gdf = gdf.groupby(['name', 'description'])['geometry'].apply(lambda x: LineString(x.tolist())).reset_index() 35 36 gdf.to_json() ~\AppData\Roaming\Python\Python39\sitepackages\pandas\core\groupby\generic.py in apply(self, func, *args, **kwargs) 221 ) 222 def apply(self, func, *args, **kwargs): > 223 return super().apply(func, *args, **kwargs) 224 225 @doc(_agg_template, examples=_agg_examples_doc, klass="Series") ~\AppData\Roaming\Python\Python39\sitepackages\pandas\core\groupby\groupby.py in apply(self, func, *args, **kwargs) 1273 with option_context("mode.chained_assignment", None): 1274 try: > 1275 result = self._python_apply_general(f, self._selected_obj) 1276 except TypeError: 1277 # gh20949 ~\AppData\Roaming\Python\Python39\sitepackages\pandas\core\groupby\groupby.py in _python_apply_general(self, f, data) 1307 data after applying f 1308 """ > 1309 keys, values, mutated = self.grouper.apply(f, data, self.axis) 1310 1311 return self._wrap_applied_output( ~\AppData\Roaming\Python\Python39\sitepackages\pandas\core\groupby\ops.py in apply(self, f, data, axis) 839 # group might be modified 840 group_axes = group.axes > 841 res = f(group) 842 if not _is_indexed_like(res, group_axes, axis): 843 mutated = True ~\AppData\Local\Temp/ipykernel_14056/627086265.py in <lambda>(x) 32 gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.long, df.lat)) 33 #groupby on name and description, while converting the grouped geometries to a LineString > 34 gdf = gdf.groupby(['name', 'description'])['geometry'].apply(lambda x: LineString(x.tolist())).reset_index() 35 36 gdf.to_json() ~\AppData\Roaming\Python\Python39\sitepackages\shapely\geometry\linestring.py in __init__(self, coordinates) 46 BaseGeometry.__init__(self) 47 if coordinates is not None: > 48 self._set_coords(coordinates) 49 50 @property ~\AppData\Roaming\Python\Python39\sitepackages\shapely\geometry\linestring.py in _set_coords(self, coordinates) 95 def _set_coords(self, coordinates): 96 self.empty() > 97 ret = geos_linestring_from_py(coordinates) 98 if ret is not None: 99 self._geom, self._ndim = ret ~\AppData\Roaming\Python\Python39\sitepackages\shapely\speedups\_speedups.pyx in shapely.speedups._speedups.geos_linestring_from_py() ValueError: LineStrings must have at least 2 coordinate tuples
I'm not sure what's causing the error while I'm parsing the data. Please help :(
Could someone tell me what's the mistake here? Thank you!!!

Creating a dataframe from list and existing dataframe
I have a dataframe in the form of
column 1 column 2 column 3
And I would like to add values to it. I have a list which I would like to add which is in the form of:
a= [['Master Vithal', ' Vithal Zubeida'], ['Firozshah Mistry', ' B Irani'], ['Grigor']]
How do I add elements in this list so that a[0] will go to column 1 and so on. Thanks in advance!

Pandas apply with conditions
I need to use 2 different .apply according to if they meet a certain condition in another column. I've tried this but it didn't work, it chooses one of both and fills the rest of the rows with NaNs.
ocxcat['PORCENTAJE'] = ocxcat[ocxcat.INTENSIDAD_DE_LA_OCUPACIÓN == "Subocupado"].CODUSU.apply(lambda x: 100*x/1987) ocxcat['PORCENTAJE'] = ocxcat[ocxcat.INTENSIDAD_DE_LA_OCUPACIÓN == "Ocupado pleno"].CODUSU.apply(lambda x: 100*x/8253)
Any ideas?

Python  Assign a variable in lambda apply function to calculate correlation
I have a dataframe that has the potential to grow in column size exponentially. I'm trying to calculate the correlation between two columns, multiple times. Part of the correlation calculation is with the growing number of columns. I'm creating the columns needed for the correlation calculation in a FOR loop and when i try and calculate the correlation, I get an error saying:
'DataFrame' object has no attribute 'col'
I've tried assigning the new column name to a variable and putting that variable in the lambda function, but that also doesn't work.
How to I update the correlation piece of the code to use the new columns in the FOR loop?
Here is the for loop that creates the new columns. colname is a list of all column names:
for col in colname: df[col+'_RR'] = df['p_'+col]  df['r2500_ret'] df[col+'_sec_rr'] = df['ret']  df[col+'_RR'] # Calculate Correlation dfcorr = df.groupby('symbol').apply(lambda v: v.col+'_sec_rr'.corr(v.col+'_RR')).to_frame().rename(columns={0:'jets_correlation'})

How do I apply a function to a dataframe and then to a list of dataframes?
I need to calculate the rolling threemonth return for multiple investment portfolios. I would like to apply a function to achieve this because I have several portfolios and benchmarks to calculate this for.
For the sake of my understanding
 I first want to know how to calculate it on one dataframe using an apply method.
 Then, I would like to know how to apply a similar function to a list of dataframes.
The dataframe
stefi < tibble::tribble( ~date , ~return, 19960430, 0.0126, 19960531, 0.0126, 19960630, 0.0119, 19960731, 0.0144, 19960831, 0.0132, 19960930, 0.0136, 19961031, 0.0135, 19961130, 0.0127, 19961231, 0.0143, 19970131, 0.0144)
My attempt at a function
Here is my function to calculate the threemonth return. The math is fine, but I'm not sure whether it's supposed to return a vector, variable, or something else for the purpose of the
apply
method I need to use.calc_3m_return < function(x){ y < (x + 1) * ((lag(X, 1)) + 1) * ((lag(X, 2)) + 1)  1 return(y) }
I'm not having luck with
apply
orlapply
.lapply(stefi$return, calc_3m_return, return)
R's output
> lapply(stefi$return, calc_3m_return, return) Error in FUN(X[[i]], ...) : unused argument (return)
What I can get working
I manage to get the desired result with the following steps:
#Calculate return function calc_3m_return < function(return){ y < (return + 1) * ((lag(return, 1)) + 1) * ((lag(return, 2)) + 1)  1 return(y) } #Calculate 3month return and append it to the dataframe stefi < stefi %>% mutate(return_3m = calc_3m_return(return))
Result
# A tibble: 6 x 3 date return return_3m <date> <dbl> <dbl> 1 19960430 0.0126 NA 2 19960531 0.0126 NA 3 19960630 0.0119 0.0376 4 19960731 0.0144 0.0395 5 19960831 0.0132 0.0401 6 19960930 0.0136 0.0418

Can I use NLP or NN to build a model for correction of historical text PostOCR based on list of manual corrections?
Could you point me in the right direction with a project I am working on?
Summary of Problem My father for years has been collecting scans of old documents in Russian from various Archives in Russia relating to our family history. He has developed his own pipeline for doing this and has completed a workflow and application that does the following
 Performs OCR on the archive scans (proprietary software which he says is the Best available)
 Runs a Russian spellcheck on the text
 Developed an app using a Levenshtein library that goes through each word the spellcheck cannot solve and provide a list of likely corrections that he can choose from (or write in his own if none fit)
 From his own corrections in 3 he generate a csv of the error and the correct the errors in the list if they pop up again.
This solution he has developed mostly works for him and he has been satisfied with his work so far, but he has an improvement he would like to make:
Take the list of corrections from 4 and generalize them so that even if the error is not exactly the same the system still can recognize obvious errors resulting from a few issues.
 Bad OCR output such as interpretting П as 11 or 1l
 OCR (written for modern Russian) being unable to identify nolonger in use letters such as Ҍ
 Russian dictionary not containing old russian spellings of words potentially using nolongerused letters like і or Ҍ
What I have Tried
I only have a 19854 corrections in a CSV file to work with. It is not a lot of data. I also have OCR documents.
I have learned about what Levenshtein is and other techniques to measure distances between words.
I have read on how Neural Networks work and how people do transfer learning to adapt models that were trained on huge amounts of data.
I was inspired by NVidia DLSS and Pixel phone software and decided to read up on techniques to using models to upscale images for improved clarity for OCR. I have found a NLP model website called HuggingFace with existing NLP models, but none seem to fit the bill from the search terms I used. I will ask the question to model authers there as well on their github repositories too.
My Current Path
So I have decided there are 3 things that I can attempt but I am not sure where to start
 Find an existing Russian model that is similar data of error > correction and adapt it with my fathers data. This data set I believe is unlikely as existing nonML spellcheck techniques already work well.
 Build my own model from Scratch on this data. This is difficult because I do not know how to make a model complete from scratch. I do have a potential solution to not having enough data. I could use develop a script to generate more cases for my model to train on.
 I could work on enhancing the quality of the original scans prior to OCR. This seems like it could maybe fix some issues before they occur, but not entirely.
My Code
As of right now I do not have any code, but I have maybe add a snippet here of the source data
1Іравительствующій Правительствующій 18 благотворительница благотворительница 18 благотворительнымъ благотворительнымъ 18 благотворительныхъ благотворительныхъ 18 головокружительной головокружительной 18 жертвовательницамъ жертвовательницамъ 18 ІІравительствующій Правительствующій 18 ІІреосвяіденствомъ Преосвященствомъ 18 неприкосновенность неприкосновенность 18
And a snippet of my experimenting with cleaning and word distance on the data using Pandas and difflib
def delete_equal_words(df): for index, row in df.iterrows(): if row[0] == row[1]: df.drop(index, inplace=True) return df def print_word_differences(df): df_common_corrections = data = {'insert': [], 'delete': []} for index, row in df.iterrows(): print('{} => {}'.format(row[0],row[1])) diff = ndiff(str(row[0]), str(row[1])) #print(''.join(diff), end="") for i,s in enumerate(diff): if s[0]==' ': continue #shouldnt happen elif s[0]=='': print(u'Delete "{}" from position {}'.format(s[1],i)) elif s[0]=='+': print(u'Add "{}" to position {}'.format(s[1],i)) def csv_to_dataframe(file): df = pd.read_csv(file, delimiter='\t', names=['original', 'fixed', 'str length']) #print("df: ", df) df = delete_equal_words(df) df.drop_duplicates(subset=["original"]) df.dropna(inplace=True) print("df shape: ", df.shape) print(df) print_word_differences(df) df.to_csv('out.csv', index=False)

How to obtain the witness function of the Stein operator
With the Stein operator:
We can define the Kernel Stein discrepancy as:
I found a nice explanation of the witness function modified by the Stein operator at: https://slideslive.com/38917868/relativegoodnessoffittestsformodelswithlatentvariables, where this function is explaining how the two distributions are different:
I tried to reproduce this figure as an explanation of the witness function, but I can't quite understand how did he calculated the witness function g* to code it up:
import torch import matplotlib.pyplot as plt p = torch.distributions.normal.Normal(torch.tensor([0.0]), torch.tensor([1.0])) q = torch.distributions.normal.Normal(torch.tensor([1.0]), torch.tensor([1.0])) x = torch.arange(4, 4, .1) fig = plt.figure(1) ax = plt.gca() ax.plot(x, torch.exp(p.log_prob(x)), 'red') ax.plot(x, torch.exp(q.log_prob(x)), 'blue') ax.grid(True) ax.spines['left'].set_position('zero') ax.spines['right'].set_color('none') ax.spines['bottom'].set_position('zero') ax.spines['top'].set_color('none') ax.set_xlim(4, 4); ax.set_ylim(.7, .5) """How to compute the $g^*$ (witness function) to plot it."""
Can you please help to explain how did he compute the witness function g* in green line, and to simply code it up !

Using Levenshtein similarity for comapring values within a column and computing the score
s_t = lev.distance(publications["title"], publications["title"])
I want to compare the similarity between all the titles within the title column in the publications.csv database downloaded from https://dbs.unileipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution
The similarity must be calculated using the Levenshtein distance and then the scores must be computed and saved as s_t
My question therefore is: Does anybody know how to do compute the Levenshtein similarity between all strings within a column?

delete duplicated text in csv file with fuzzywuzzy
I have a csv file called CompaniesShort.csv (download link: link, it has 19 articles in it. some of them are duplicated. If I use remove duplicated lines in notepad++, it removes EXACT (100%) duplicates, does not remove >=70% duplicates. So I write this code, it works:
import csv import dedupe from fuzzywuzzy import process, fuzz with open("CompaniesShort.csv",encoding="utf8",errors='igonre' ) as inputfile: lists=inputfile.read().split("\n") print(len(lists)) #process.extract(query, choices) clean=process.dedupe(lists) keyall=list(clean) print(clean) save_file = open("CompaniesShort_clean.csv", "w",encoding="utf8",errors='igonre' ) writer = csv.writer(save_file) for keys in keyall: writer.writerow(keys) save_file.close()
However, I have another CSV file, contains 3000 articles. the above code runs a long time(30 min), but didn't generate an output. I understand that the code tried to gather everything together rather than write one by one. I need your help, thank you.

Parallelize for loop in pd.concat
I need to merge two large datasets based on string columns which don't perfectly match. I have wide datasets which can help me determine the best match more accurately than string distance alone, but I first need to return several 'top matches' for each string.
Reproducible example:
def example_function(idx, string, comparisons, n): tup = process.extract(string, comparisons, limit = n) df2_index = [i[2] for i in tup] scores = [i[1] for i in tup] return pd.DataFrame({ "df1_index": [idx] * n, "df2_index": df2_index, "score": scores }) import pandas as pd from fuzzywuzzy import process s1 = pd.Series(["two apples", "one orange", "my banana", "red grape", "huge kiwi"]) s2 = pd.Series(["a couple of apples", "old orange", "your bananas", "purple grape", "tropical fruit"]) pd.concat([example_function(index, value, s2, 2) for index, value in s1.items()]).reset_index()
I've been unsuccessful at parallelizing this function. What seems closest to what I'm trying to do is the multiprocessing implementation, but even with starmap I am not getting results. I'd imagine there's a simple way to achieve this but have not yet found a method that works.
I'm open to any advice on how to optimize my code, but parallel processing would be an appropriate solution in this case since it looks like it will take about 45 hours (in hindsight this was a generous estimate) if done sequentially.
UPDATE
Thank you for the solutions. I have a df1 which is 7,000 rows and a df2 which is 70,000 rows. For the results below I've searched all 70,000 rows from df2 for each of the first 20 rows from df1.
 dataframe concat method (original method): 96 sec
 dictionary chain method: 90 sec
 add parallel with dask (4 workers): 77 sec
 use rapidfuzz instead of fuzzywuzzy: 6.73 sec
 use rapidfuzz with dask (4 workers): 5.29 sec
Here is the optimized code:
from dask.distributed import Client from dask import delayed from rapidfuzz import process, fuzz from itertools import chain client = Client(n_workers = 4, processes = False) def example_function(idx, string, comparisons, n): tup = process.extract(string, comparisons, scorer = fuzz.WRatio, limit = n) return [{'idx': idx, 'index2': t[2], 'score': t[1]} for t in tup] jobs = [delayed(example_function)(index, value, t3, 20) for index, value in t1.items()] data = delayed(jobs) df = pd.DataFrame.from_records(chain(*data.compute())) print(df) client.close()
Parallel processing didn't have quite the impact that I was expecting. Perhaps it is not setup ideally for this function, or perhaps it will continue to scale to a larger impact as I include more iterations. Either way it did make a difference so I'm using it in my personal solution. Thanks all

Using FuzzyWuzzy with pandas
I am trying to calculate the similarity between cities in my dataframe, and 1 static city name. (eventually I want to iterate through a dataframe and choose the best matching city name from that data frame, but I am testing my code on this simplified scenario). I am using fuzzywuzzy token set ratio. For some reason it calculates the first row correctly, and it seems it assigns the same value for all rows.
code:
from fuzzywuzzy import fuzz test_df= pd.DataFrame( {"City" : ["Amsterdam","Amsterdam","Rotterdam","Zurich","Vienna","Prague"]}) test_df = test_df.assign(Score = lambda d: fuzz.token_set_ratio("amsterdam",test_df["City"])) print (test_df.shape) test_df.head()
Result:
City Score 0 Amsterdam 100 1 Amsterdam 100 2 Rotterdam 100 3 Zurich 100 4 Vienna 100
If I do the comparison one by one it works:
print (fuzz.token_set_ratio("amsterdam","Amsterdam")) print (fuzz.token_set_ratio("amsterdam","Rotterdam")) print (fuzz.token_set_ratio("amsterdam","Zurich")) print (fuzz.token_set_ratio("amsterdam","Vienna"))
Results:
100 67 13 13
Thank you in advance!