and Izzy Blair, "Comparing Artwork Metadata", link: https://zoeo11.github.io/ Zoe Oboler
File "<ipython-input-1-e9313e1360f1>", line 1
Zoe Oboler and Izzy Blair, "Comparing Artwork Metadata", link: https://zoeo11.github.io/
^
SyntaxError: invalid syntax
Zoe Oboler and Izzy Blair, "Comparing Artwork Metadata", link:
Artwork Metadata is a collaborative effort between Izzy Blair and Zoe Oboler. We are using Google Colab, pandas, Python, and Matplotlib for visualization. Our dataset is a comprehensive spreadsheet of all the items currently in the collection of the Met Museum in New York City. This data provides information for more than 400,000 items in the collection, so you can imagine that it is a very large amount of data to go through (this is why we are not using Github because of their file size restrictions). Each row (representing one object) holds more than 43 columns of data, some of which include the department of the museum, the nationality of the artist as well as time period, the type of object, the date the object was recorded/collected, and more. We have compiled visualizations for the grouping of various objects, which will help us further in our predictive model.
Our model will use this data to predict the date of creation (i.e., "Object Begin Date") based off of other criteria. This model will be an especially helpful tool for predicting Begin Date for older objects, as many older objects in the database are lacking information.
Collaboration Plan: We will be working together at least once or twice a week (we have met up four times and collaborated virtually more times) in order to decide what metrics we want to display as well as how to go about visualizing them. Our original plan was to have metadata from the Met Museum. However, the data is so large that we have only had the capacity to work on the Met Museum data thus far. There are plenty of ways to analyze the data with just one dataset, however, as so much information is provided about each object. Some questions we hope to answer are: How do objects from different cultures compare to each other in terms of collection date, creation date, and object type? How do different object types differ in terms of size of collection and place of origin? What time period contains the most objects/most diverse set of artists?
%%shell
--to html /PATH/TO/YOUR/NOTEBOOKFILE.ipynb jupyter nbconvert
from google.colab import drive
'/content/drive') drive.mount(
Mounted at /content/drive
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
= pd.read_csv("/content/drive/MyDrive/MetObjects.csv")
df df.head()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-5-6c1a4f58638b> in <cell line: 1>()
----> 1 df = pd.read_csv("/content/drive/MyDrive/MetObjects.csv")
2 df.head()
/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
910 kwds.update(kwds_defaults)
911
--> 912 return _read(filepath_or_buffer, kwds)
913
914
/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
575
576 # Create the parser.
--> 577 parser = TextFileReader(filepath_or_buffer, **kwds)
578
579 if chunksize or iterator:
/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
1405
1406 self.handles: IOHandles | None = None
-> 1407 self._engine = self._make_engine(f, self.engine)
1408
1409 def close(self) -> None:
/usr/local/lib/python3.10/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
1659 if "b" not in mode:
1660 mode += "b"
-> 1661 self.handles = get_handle(
1662 f,
1663 mode,
/usr/local/lib/python3.10/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
857 if ioargs.encoding and "b" not in ioargs.mode:
858 # Encoding
--> 859 handle = open(
860 handle,
861 ioargs.mode,
FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/MetObjects.csv'
= pd.read_csv("/content/drive/MyDrive/ArtMetaData/MetObjects.csv")
df df.head()
<ipython-input-4-a5a233a04e20>:1: DtypeWarning: Columns (7,8,9,10,11,18,27,28,29,30,31,32,33,34,35,36,37,39) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv("/content/drive/MyDrive/ArtMetaData/MetObjects.csv")
Object Number | Is Highlight | Is Public Domain | Object ID | Department | Object Name | Title | Culture | Period | Dynasty | ... | Subregion | Locale | Locus | Excavation | River | Classification | Rights and Reproduction | Link Resource | Metadata Date | Repository | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1979.486.1 | False | False | 1 | American Decorative Arts | Coin | One-dollar Liberty Head Coin | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | Metal | NaN | http://www.metmuseum.org/art/collection/search/1 | 4/3/2017 8:00:08 AM | Metropolitan Museum of Art, New York, NY |
1 | 1980.264.5 | False | False | 2 | American Decorative Arts | Coin | Ten-dollar Liberty Head Coin | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | Metal | NaN | http://www.metmuseum.org/art/collection/search/2 | 4/3/2017 8:00:08 AM | Metropolitan Museum of Art, New York, NY |
2 | 67.265.9 | False | False | 3 | American Decorative Arts | Coin | Two-and-a-Half Dollar Coin | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | Metal | NaN | http://www.metmuseum.org/art/collection/search/3 | 4/3/2017 8:00:08 AM | Metropolitan Museum of Art, New York, NY |
3 | 67.265.10 | False | False | 4 | American Decorative Arts | Coin | Two-and-a-Half Dollar Coin | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | Metal | NaN | http://www.metmuseum.org/art/collection/search/4 | 4/3/2017 8:00:08 AM | Metropolitan Museum of Art, New York, NY |
4 | 67.265.11 | False | False | 5 | American Decorative Arts | Coin | Two-and-a-Half Dollar Coin | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | Metal | NaN | http://www.metmuseum.org/art/collection/search/5 | 4/3/2017 8:00:08 AM | Metropolitan Museum of Art, New York, NY |
5 rows × 43 columns
df.shape
(448203, 43)
As you can see below, there are a lot of columns of data. Many of these columns have a lot of NaN values, which we will clean up in order to make the data usable.
print(df.columns.tolist())
['Object Number', 'Is Highlight', 'Is Public Domain', 'Object ID', 'Department', 'Object Name', 'Title', 'Culture', 'Period', 'Dynasty', 'Reign', 'Portfolio', 'Artist Role', 'Artist Prefix', 'Artist Display Name', 'Artist Display Bio', 'Artist Suffix', 'Artist Alpha Sort', 'Artist Nationality', 'Artist Begin Date', 'Artist End Date', 'Object Date', 'Object Begin Date', 'Object End Date', 'Medium', 'Dimensions', 'Credit Line', 'Geography Type', 'City', 'State', 'County', 'Country', 'Region', 'Subregion', 'Locale', 'Locus', 'Excavation', 'River', 'Classification', 'Rights and Reproduction', 'Link Resource', 'Metadata Date', 'Repository']
'Object Begin Date'].value_counts() df[
Object Begin Date
1800 23579
1700 16355
1900 10583
1600 7157
1888 5936
...
-2599 1
1423 1
-1599 1
646 1
1017 1
Name: count, Length: 2074, dtype: int64
'Classification'].value_counts() df[
Classification
Prints 69260
Prints|Ephemera 30033
Photographs 26821
Drawings 25230
Books 14685
...
Prints|Drawings|Books 1
Books|Manuscripts|Ornament & Architecture 1
Ornament & Architecture|Books 1
Albums|Books 1
Paper-Documents|Prints 1
Name: count, Length: 1077, dtype: int64
'Period'].value_counts() df[
Period
Edo period (1615–1868) 8710
New Kingdom 6594
Middle Kingdom 4646
New Kingdom, Ramesside 3758
Qing dynasty (1644–1911) 3713
...
Late Archaic or Classical 1
Imperial, Late Flavian–Hadrianic 1
late Central or early Eastern Javanese period 1
Early Imperial, Augustan, probably 1
Ramesside/Third Intermediate Period 1
Name: count, Length: 1695, dtype: int64
The below cell transforms the date column as given, which has an unusable date format. Many objects have more than one year listed, as some objects took many years to make. In order to just have on year of usable data, we compute the average of all the years and then update the date column to reflect this change.
= df
df2 = df2[df2['Object Begin Date'] > 9999]
two_year_rows
# Compute the average of the two years
'Object Begin Date'] = (two_year_rows['Object Begin Date'] // 10000 + two_year_rows['Object Begin Date'] % 10000) / 2
two_year_rows[
# Update the original DataFrame with the modified values
df.update(two_year_rows)
<ipython-input-11-980bb0026b24>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
two_year_rows['Object Begin Date'] = (two_year_rows['Object Begin Date'] // 10000 + two_year_rows['Object Begin Date'] % 10000) / 2
Below, we filter our original dataframe to show the distribution of classification of objects where the MET has more than 2000 of such objects.
= df['Classification'].value_counts()[df['Classification'].value_counts() > 2000]
filtered_df
# Plotting the bar chart for the filtered counts
='bar')
filtered_df.plot(kind'Classification')
plt.xlabel('Count')
plt.ylabel('Distribution of Classification (Counts > 5000)')
plt.title( plt.show()
Distribution of Classification: There are far more types (classifications) of objects than shown here, but we chose to condense the objects displayed to those objects with more than 2000 pieces. We did this by filtering our dataframe into the value counts that are greater than 2000, and then used matplotlib to visualize the data into a bar graph.
# Group by 'Object Begin Date' (year) and count the number of prints
= df2.loc[(df['Object Begin Date'] >= 0) & (df['Object Begin Date'] <= 2024)]
df3 = df3.groupby('Object Begin Date').size()
obj_per_year
# Plot the line graph
='line')
obj_per_year.plot(kind'Year')
plt.xlabel('Number of Objects')
plt.ylabel('Number of Objects Added per Year')
plt.title(True)
plt.grid( plt.show()
Number of Objects Added per Year: We made this line plot by first creating a new dataframe and displaying the size of objects per year added. It is important to note that due to scale we filtered out years smaller than zero(bce) which is something we will need to address later in the project.
Number of Prints per Culture: This is a plot of a groupby dataframe of the previous print_df (just a df of all Prints) but now it is grouped by Culture. This is then plotted into a bar graph.
'Culture'].value_counts() df[
Culture
American 22167
French 18224
Japan 16374
China 13844
Italian 6580
...
German, Weimar 1
American Eskimo (Alaska) 1
possibly German, Königsberg 1
Russian (Petrograd) 1
French, presumably Paris 1
Name: count, Length: 7101, dtype: int64
= df['Culture'].value_counts()[df['Culture'].value_counts() > 500]
newfiltered_df
# Plotting the bar chart for the filtered counts
='bar')
newfiltered_df.plot(kind'Culture')
plt.xlabel('Count')
plt.ylabel('Distribution of Cultures (Counts > 500)')
plt.title( plt.show()
Distribution of Cultures: Similarly to the Distribution of Classification graph, we graphed the different cultures and how many objects are recorded to be in the collection from each culture. (We only had space to show those cultures with more than 500 objects at the museum). It is interesting to compare this metric, which has the most objects from America, to the Prints culture distribution, where American prints takes up a much smaller amount.
= df['Artist Display Name'].value_counts()[df['Artist Display Name'].value_counts() > 500]
newfiltered_df
# Plotting the bar chart for the filtered counts
='bar')
newfiltered_df.plot(kind'Artist Display Name')
plt.xlabel('Count')
plt.ylabel('Distribution of Artists(Counts > 500)')
plt.title( plt.show()
We show here the distribution of artists by display name (as in the name on display at the MET museum).bold text
= df['Period'].value_counts()[df['Period'].value_counts() > 500]
newfiltered_df
# Plotting the bar chart for the filtered counts
='bar')
newfiltered_df.plot(kind'Period')
plt.xlabel('Count')
plt.ylabel('Distribution of Period(Counts > 500)')
plt.title( plt.show()
Here we display the distribution of period. Period is a broad generalization of time period, but this table gives us a good idea of what periods most objects come from.
Model Plan: For our model ideas, we use the location, culture, artist, and classification variables to determine the time/year that a piece of art was made. For our EDA graphs, we have shown graphs of year of creation based on varying factors such as region, culture, artist, etc. For our model, we have one model that determines time of creation based on the above factors INCLUDING information about the artist. For another model idea, we will make a model that predicts time of creation based on the above factors but EXCLUDING information about the artist. We want to do this because there are many pieces that were found without any access to the artist. This model could help predict time of creation of an object even without knowing who the artist is.
Below, we will be splitting the data into different eras. WE will be using these eras to predict possible predictions for Object Begin Date. We are splitting the data into eras in order to train the model, as there is so much data that our computers do not have sufficient RAM to process it all at once, even after splitting the data in other ways. We tried splitting the data randomly, but we found our models were not very predictive in those cases.
# create a new df called dfme with only the rows from the last 100 years based on df2["Object Begin Date"]
= df2[df2['Object Begin Date'] >= 1922] dfme
dfme.shape
(73015, 43)
We also drop NaN values so that our model can use the data to predict. We found that imputing values here skews the data, as most of the objects have begin dates closer to 1900-2000. So, in order to avoid this skewing, we simply dropped NaN values.
# drop dfme rows with NaN values in "Classification", "Culture"
= dfme.dropna(subset=['Classification', 'Culture']) dfme
dfme.shape
(2668, 43)
= df2[(df2['Object Begin Date'] < 1922) & (df2['Object Begin Date'] >= 1822)]
dfme1 = dfme1.dropna(subset=['Classification', 'Culture']) dfme1
= df2[(df2['Object Begin Date'] < 1821) & (df2['Object Begin Date'] >= 1750)]
dfme2 = dfme2.dropna(subset=['Classification', 'Culture']) dfme2
Below, we have our K-Nearest-Neighbors model to predict the Object Begin Date based off of Classification and Culture. We expect that this will not be a very predictive model, as culture is a broad generalization of an object (as you can see from our data above) so it will probably also generalize on predictions. Also, we are only predicting based off of the objects made after 1922. However, this is an important first step for our model. We will later test the accuracy of our model using k-fold cross validation.
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
= DictVectorizer(sparse=False)
vec = StandardScaler()
scaler = KNeighborsRegressor(n_neighbors=10)
model
= ["Classification", "Culture" ]
features = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
pipeline
= dfme[features].to_dict(orient="records")
X_train_dict = dfme["Object Begin Date"]
y_train
= DictVectorizer(sparse=False)
vec
vec.fit(X_train_dict)= vec.transform(X_train_dict)
X_train
= StandardScaler()
scaler
scaler.fit(X_train)= scaler.transform(X_train)
X_train_sc
= KNeighborsRegressor(n_neighbors=10)
model
model.fit(X_train_sc, y_train)
= model.predict(X_train_sc)
y_train_pred
y_train_pred
= cross_val_score(pipeline, X_train_dict, y_train,
scores =17, scoring="neg_mean_squared_error") cv
Below, we plot the MSE (mean squared error) for our model. We also plot a line that shows how our MSE changes as k increases. This tests the accuracy of our model and also gives us a value of k that has the minimum error. That k value here seems to be around six or seven.
= DictVectorizer(sparse=False)
vec = StandardScaler()
scaler = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
pipeline
def get_cv_error(k):
= KNeighborsRegressor(n_neighbors=k)
model = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
pipeline = np.mean(-cross_val_score(
mse
pipeline, X_train_dict, y_train,=10, scoring="neg_mean_squared_error"
cv
))return mse
= pd.Series(range(1, 51))
ks = range(1, 51)
ks.index = ks.apply(get_cv_error)
test_errs
test_errs.plot.line() test_errs.sort_values()
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-42-cd249b7be27e> in <cell line: 16>()
14 ks = pd.Series(range(1, 51))
15 ks.index = range(1, 51)
---> 16 test_errs = ks.apply(get_cv_error)
17
18 test_errs.plot.line()
/usr/local/lib/python3.10/dist-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwargs)
4628 dtype: float64
4629 """
-> 4630 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
4631
4632 def _reduce(
/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply(self)
1023
1024 # self.f is Callable
-> 1025 return self.apply_standard()
1026
1027 def agg(self):
/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply_standard(self)
1074 else:
1075 values = obj.astype(object)._values
-> 1076 mapped = lib.map_infer(
1077 values,
1078 f,
/usr/local/lib/python3.10/dist-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-42-cd249b7be27e> in get_cv_error(k)
6 model = KNeighborsRegressor(n_neighbors=k)
7 pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
----> 8 mse = np.mean(-cross_val_score(
9 pipeline, X_train_dict, y_train,
10 cv=10, scoring="neg_mean_squared_error"
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
513 scorer = check_scoring(estimator, scoring=scoring)
514
--> 515 cv_results = cross_validate(
516 estimator=estimator,
517 X=X,
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
264 # independent, and that it is pickle-able.
265 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
--> 266 results = parallel(
267 delayed(_fit_and_score)(
268 clone(estimator),
/usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py in __call__(self, iterable)
61 for delayed_func, args, kwargs in iterable
62 )
---> 63 return super().__call__(iterable_with_config)
64
65
/usr/local/lib/python3.10/dist-packages/joblib/parallel.py in __call__(self, iterable)
1916 output = self._get_sequential_output(iterable)
1917 next(output)
-> 1918 return output if self.return_generator else list(output)
1919
1920 # Let's create an ID that uniquely identifies the current call. If the
/usr/local/lib/python3.10/dist-packages/joblib/parallel.py in _get_sequential_output(self, iterable)
1845 self.n_dispatched_batches += 1
1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)
1848 self.n_completed_tasks += 1
1849 self.print_progress()
/usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py in __call__(self, *args, **kwargs)
121 config = {}
122 with config_context(**config):
--> 123 return self.function(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
706
707 fit_time = time.time() - start_time
--> 708 test_scores = _score(estimator, X_test, y_test, scorer, error_score)
709 score_time = time.time() - start_time - fit_time
710 if return_train_score:
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer, error_score)
765 scores = scorer(estimator, X_test)
766 else:
--> 767 scores = scorer(estimator, X_test, y_test)
768 except Exception:
769 if isinstance(scorer, _MultimetricScorer):
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_scorer.py in __call__(self, estimator, *args, **kwargs)
113 try:
114 if isinstance(scorer, _BaseScorer):
--> 115 score = scorer._score(cached_call, estimator, *args, **kwargs)
116 else:
117 score = scorer(estimator, *args, **kwargs)
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_scorer.py in _score(self, method_caller, estimator, X, y_true, sample_weight)
274 """
275
--> 276 y_pred = method_caller(estimator, "predict", X)
277 if sample_weight is not None:
278 return self._sign * self._score_func(
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_scorer.py in _cached_call(cache, estimator, method, *args, **kwargs)
71 """Call estimator with method and args and kwargs."""
72 if cache is None:
---> 73 return getattr(estimator, method)(*args, **kwargs)
74
75 try:
/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
478 Xt = X
479 for _, name, transform in self._iter(with_final=False):
--> 480 Xt = transform.transform(Xt)
481 return self.steps[-1][1].predict(Xt, **predict_params)
482
/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py in wrapped(self, X, *args, **kwargs)
138 @wraps(f)
139 def wrapped(self, X, *args, **kwargs):
--> 140 data_to_wrap = f(self, X, *args, **kwargs)
141 if isinstance(data_to_wrap, tuple):
142 # only wrap the first output for cross decomposition
/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/_dict_vectorizer.py in transform(self, X)
370 Feature vectors; always 2-d.
371 """
--> 372 return self._transform(X, fitting=False)
373
374 def get_feature_names_out(self, input_features=None):
/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/_dict_vectorizer.py in _transform(self, X, fitting)
278 result_matrix.sort_indices()
279 else:
--> 280 result_matrix = result_matrix.toarray()
281
282 if fitting:
/usr/local/lib/python3.10/dist-packages/scipy/sparse/_compressed.py in toarray(self, order, out)
1048 if out is None and order is None:
1049 order = self._swap('cf')[0]
-> 1050 out = self._process_toarray_args(order, out)
1051 if not (out.flags.c_contiguous or out.flags.f_contiguous):
1052 raise ValueError('Output array must be C or F contiguous')
/usr/local/lib/python3.10/dist-packages/scipy/sparse/_base.py in _process_toarray_args(self, order, out)
1265 return out
1266 else:
-> 1267 return np.zeros(self.shape, dtype=self.dtype, order=order)
1268
1269 def _get_index_dtype(self, arrays=(), maxval=None, check_contents=False):
KeyboardInterrupt:
Below, we plot a colormap of our first prediction model. As you can see, it is not very predictive, as we expected.
import matplotlib.pyplot as plt
pipeline.fit(X_train_dict, y_train)
= pipeline.predict(X_train_dict)
predicted_values
= dfme['Culture'].value_counts().nlargest(10).index
top_cultures
= dfme[(dfme['Culture'].isin(top_cultures)) & (dfme['Object Begin Date'] > 0)]
filtered_df =True, inplace=True)
filtered_df.reset_index(drop
=(10, 8))
plt.figure(figsize'Culture'], filtered_df['Object Begin Date'], c=predicted_values[filtered_df.index], cmap='viridis')
plt.scatter(filtered_df[='Predicted Object Begin Date')
plt.colorbar(label'Culture')
plt.xlabel('Object Begin Date')
plt.ylabel('Predicted Object Begin Date based on top 10 cultures')
plt.title(=45, ha='right')
plt.xticks(rotationTrue)
plt.grid( plt.show()
Below we will do the same thing for a new era, using dataframe dfme1.
#for dfme1, new era
= ["Classification", "Culture" ]
features = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
pipeline1
= dfme1[features].to_dict(orient="records")
X_train_dict1 = dfme1["Object Begin Date"]
y_train1
= DictVectorizer(sparse=False)
vec
vec.fit(X_train_dict)= vec.transform(X_train_dict1)
X_train1
= StandardScaler()
scaler
scaler.fit(X_train1)= scaler.transform(X_train1)
X_train_sc1
= KNeighborsRegressor(n_neighbors=10)
model
model.fit(X_train_sc1, y_train1)
= model.predict(X_train_sc1)
y_train_pred1
y_train_pred1
= cross_val_score(pipeline1, X_train_dict1, y_train1,
scores1 =17, scoring="neg_mean_squared_error") cv
Again, we get the MSE for this prediction model to test the accuracy.
#dfme1
def get_cv_error1(k):
= KNeighborsRegressor(n_neighbors=k)
model = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
pipeline = np.mean(-cross_val_score(
mse
pipeline, X_train_dict, y_train,=10, scoring="neg_mean_squared_error"
cv
))return mse
= pd.Series(range(1, 51))
ks = range(1, 51)
ks.index = ks.apply(get_cv_error1)
test_errs
test_errs.plot.line() test_errs.sort_values()
#dfme1
pipeline1.fit(X_train_dict1, y_train1)
= pipeline1.predict(X_train_dict1)
predicted_values1
= dfme1['Culture'].value_counts().nlargest(10).index
top_cultures1
= dfme1[(dfme1['Culture'].isin(top_cultures1)) & (dfme1['Object Begin Date'] > 0)]
filtered_df1 =True, inplace=True)
filtered_df1.reset_index(drop
=(10, 8))
plt.figure(figsize'Culture'], filtered_df1['Object Begin Date'], c=predicted_values1[filtered_df1.index], cmap='viridis')
plt.scatter(filtered_df1[='Predicted Object Begin Date')
plt.colorbar(label'Culture')
plt.xlabel('Object Begin Date')
plt.ylabel('Predicted Object Begin Date based on top 10 cultures')
plt.title(=45, ha='right')
plt.xticks(rotationTrue)
plt.grid( plt.show()
And again for the third era.
#for dfme2, new era
= ["Classification", "Culture" ]
features = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
pipeline2
= dfme2[features].to_dict(orient="records")
X_train_dict2 = dfme2["Object Begin Date"]
y_train2
= DictVectorizer(sparse=False)
vec
vec.fit(X_train_dict)= vec.transform(X_train_dict2)
X_train2
= StandardScaler()
scaler
scaler.fit(X_train2)= scaler.transform(X_train2)
X_train_sc2
= KNeighborsRegressor(n_neighbors=10)
model
model.fit(X_train_sc2, y_train2)
= model.predict(X_train_sc2)
y_train_pred2
y_train_pred2
= cross_val_score(pipeline2, X_train_dict2, y_train2,
scores2 =17, scoring="neg_mean_squared_error") cv
#dfme2
def get_cv_error2(k):
= KNeighborsRegressor(n_neighbors=k)
model = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
pipeline = np.mean(-cross_val_score(
mse
pipeline, X_train_dict2, y_train2,=10, scoring="neg_mean_squared_error"
cv
))return mse
= pd.Series(range(1, 51))
ks = range(1, 51)
ks.index = ks.apply(get_cv_error2)
test_errs
test_errs.plot.line() test_errs.sort_values()
#dfme2
pipeline.fit(X_train_dict2, y_train2)
= pipeline2.predict(X_train_dict2)
predicted_values2
= dfme2['Culture'].value_counts().nlargest(10).index
top_cultures2
= dfme2[(dfme2['Culture'].isin(top_cultures2)) & (dfme2['Object Begin Date'] > 0)]
filtered_df2 =True, inplace=True)
filtered_df2.reset_index(drop
=(10, 8))
plt.figure(figsize'Culture'], filtered_df2['Object Begin Date'], c=predicted_values2[filtered_df2.index], cmap='viridis')
plt.scatter(filtered_df2[='Predicted Object Begin Date')
plt.colorbar(label'Culture')
plt.xlabel('Object Begin Date')
plt.ylabel('Predicted Object Begin Date based on top 10 cultures')
plt.title(=45, ha='right')
plt.xticks(rotationTrue)
plt.grid( plt.show()
Below, we will test the accuracy of a model with three criteria (Classification, Culture, and Medium) versus a model with the same criteria plus Artist Display Name. As we discussed above, we will look at these two models in comparison because many objects do not have an Artist Display Name, or any information about the artist at all.
= ["Classification", "Culture", "Medium"]
features = df2[features].to_dict(orient="records")
X_dict
np.mean(-cross_val_score(pipeline, X_dict, y, cv=10, scoring="neg_mean_squared_error")
)
= ["Classification", "Culture", "Medium", "Artist Display Name"]
features = df2[features].to_dict(orient="records")
X_dict
np.mean(-cross_val_score(pipeline, X_dict, y, cv=10, scoring="neg_mean_squared_error")
)
As you can see, a prediction with the Artist Name included had the MSE that was lower, meaning this is a more accurate prediction. This is to be expected, as more information about the object can make a more accurate prediction.
While this is helpful data point, for our project we think that both Models still have potential use cases. Found objects that are being identified may not have a specific artist, and neither do many of the older objects in the collection.
As part of our testing process, we chose an actual object that we already know information for. We found this object from this website: https://www.metmuseum.org/art/collection/search/16885
We will use the model to predict the Object Begin Date for this object and see how accurate the prediction is.
= ["Classification", "Culture", "Medium", "Artist Display Name", "Region"]
features
= pd.get_dummies(dfme[features])
X_train = dfme["Object Begin Date"]
y_train
X_train.columns
= pd.Series(index=X_train.columns, dtype=float)
x_new
"Classification_Drawings"] = 1
x_new["Culture_American"] = 1
x_new["Medium_Graphite"] = 1
x_new["Region_American"] = 1
x_new[
0, inplace=True)
x_new.fillna(
x_new
= X_train.mean()
X_train_mean = X_train.std()
X_train_std
= (X_train - X_train_mean) / X_train_std
X_train_sc = (x_new - X_train_mean) / X_train_std
x_new_sc
# Find index of 30 nearest neighbors.
= np.sqrt(((X_train_sc - x_new_sc) ** 2).sum(axis=1))
dists = dists.sort_values()[:30].index
i_nearest
# Average the labels of these 30 nearest neighbors
y_train.loc[i_nearest].mean()
This model is therefore pretty accurate for this drawing. The museum website does not have a creation date for this object, but rather a date when it was gifted to the museum which was 1925. Therefore, the prediction is quite accurate to the data that we do have for this particular object.
Conclusions: Because our data is inundated with objects dated very recently, we had to create a predictive model that truly and accurately predicts dates without being skewed to more recent years. So, instead of imputing missing values to train our model, we dropped all NaN's in order to do this. We then used k-fold cross validation to test our model’s accuracy. We found that models that used more criteria (as in, models that predicted based on Culture, Classification, Artist Display Name, Region, and Medium rather than just Culture and Classification) were more accurate. This makes sense for two reasons: one being that there is more data to train on, and also that some of these categories are very broad. One object classified as “American” might be vastly different from another American object. So, we know now that the more criteria we have, the more accurately the model will predict the Object Begin Date. We were also selective about which criteria to use to train the model. For instance, we compared predictions with and without Artist Display Name because many objects have no information about the artist.
In conclusion, it is possible to predict the Object Begin Date of an object from the MET Museum using other criteria. However, you must be selective about which criteria you use. You must also be cautious about missing values so as not to skew the predictive data.